Skip to content

Evaluation Super Agent Tutorial

Objective

Use the AI Refinery SDK to create and run an evaluation system that assesses the performance of your utility agents. The Evaluation Super Agent provides a structured approach to measuring agent performance across various metrics and generating comprehensive performance reports.

What is the Evaluation Super Agent?

The Evaluation Super Agent is a specialized agent designed to evaluate the performance of utility agents within the AI Refinery framework. It works by:

  1. Generating or using predefined test queries tailored to the agent being evaluated
  2. Collecting responses from the agent for each query
  3. Evaluating those responses based on configurable metrics
  4. Providing detailed evaluation reports with scores, insights, and recommendations

This automated evaluation system helps identify strengths and weaknesses in your agent implementations, allowing for continuous improvement of your AI solutions.

Steps

1. Creating the Configuration File

The first step is to create a YAML configuration file that defines:

  • The orchestration setup
  • The Evaluation Super Agent configuration
  • The agents to be evaluated
  • The evaluation metrics and sample queries

Here's a sample configuration file:

orchestrator:
  agent_list:
    - agent_name: "Evaluation Super Agent"

super_agents:
  - agent_class: EvaluationSuperAgent
    agent_name: "Evaluation Super Agent"
    agent_description: "Evaluates the response quality of target utility agents based on predefined metrics, rubrics and scales."
    config:
      agent_list:
        - agent_name: "Search Agent"
          evaluation_config:
            metrics:
              - metric_name: "Relevance"
                rubric: "Assess whether the response directly answers the query."
                scale: "1-5"
              - metric_name: "Coherence"
                rubric: "Check if the response is logically structured and understandable."
                scale: "1-5"
              - metric_name: "Accuracy"
                rubric: "Evaluate if the response provides factually correct information."
                scale: "1-5"
              - metric_name: "Conciseness"
                rubric: "Determine if the response is clear and to the point without unnecessary details."
                scale: "1-5"
              - metric_name: "Source Quality"
                rubric: "Evaluate the credibility and reliability of the sources cited in the response."
                scale: "1-5"
            sample_queries:
              - sample: "What is the capital of France?"
                ground_truth_answer: "Paris"
              - sample: "Who is the third president of United States?"
                ground_truth_answer: "Thomas Jefferson"

utility_agents:
  - agent_class: SearchAgent
    agent_name: "Search Agent"
    agent_description: "The agent provides answers based on online search results, retrieving information from the internet to respond to user queries."

Configuration Key Components

  1. Orchestrator Section: Lists the agents available in your project, including the Evaluation Super Agent.

  2. Super Agents Section: Defines the Evaluation Super Agent and its configuration:

    agent_class: Specifies the class name as "EvaluationSuperAgent"

    agent_name: Custom name for the agent

    agent_description: Description of the agent's function

    config: The evaluation configuration including:

    • `agent_list`: List of agents to evaluate
    
  3. Evaluation Configuration:

    metrics: List of evaluation criteria with:

    • `metric_name`: Name of the metric
    
    • `rubric`: Description of what the metric measures
    
    • `scale`: Scale for measurement (e.g., "1-5")
    

    sample_queries: List of test queries with:

    • `sample`: The query text
    
    • `ground_truth_answer`: The expected answer (optional)
    
  4. Utility Agents Section: Defines the agents to be evaluated.

2. Creating the Python Script

Next, create a Python script to execute the evaluation using the AI Refinery SDK:

import os
import asyncio
import traceback
from air import login, DistillerClient

# Authentication setup
auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")

async def run_evaluation():
    # Create a distiller client
    print("Initializing DistillerClient...")
    distiller_client = DistillerClient(base_url=base_url)
    config_file = "evaluation_config.yaml"  # Your configuration file name
    project_name = "agent_evaluation"  # Your project name

    print(f"Creating project with config: {config_file}...")
    try:
        # Upload evaluation config file to register a new project
        distiller_client.create_project(config_path=config_file, project=project_name)
        print(f"Project {project_name} created successfully.")
    except Exception as e:
        print(f"ERROR creating project: {str(e)}")
        traceback.print_exc()
        return

    # Define any custom agents if needed
    custom_agent_gallery = {}

    print("Initializing client session...")
    async with distiller_client(
        project=project_name,
        uuid="evaluation_session",
        custom_agent_gallery=custom_agent_gallery,
    ) as dc:
        print("Sending query...")
        try:
            responses = await dc.query(query="Please evaluate the Search Agent.")
            print("Query sent successfully, waiting for responses...")

            # Process each response message as it comes in
            # Do not print out the raw json output
            async for response in responses:
                text = response["content"]
                cutoff_index = text.find("## Raw JSON output")
                if cutoff_index == -1:
                    print(response["content"])
        except Exception as e:
            print(f"ERROR during query execution: {str(e)}")
            traceback.print_exc()

if __name__ == "__main__":
    print(f"Using base_url: {base_url}")
    print(f"Account: {auth.account}")
    try:
        asyncio.run(run_evaluation())
    except Exception as e:
        print(f"CRITICAL ERROR: {str(e)}")
        traceback.print_exc()

3. Running the Evaluation

After setting up your configuration and script:

  1. Save the YAML configuration as evaluation_config.yaml

  2. Save the Python script as run_evaluation.py

  3. Make sure your environment variables are set:

    ACCOUNT: Your AI Refinery account

    API_KEY: Your API key

    AIREFINERY_ADDRESS: The base URL (if not using the default)

  4. Run the script:

    python run_evaluation.py
    

The script will:

  1. Authenticate with AI Refinery
  2. Create a project using your configuration
  3. Send a request to evaluate the Search Agent
  4. Receive and display the evaluation results

4. Understanding the Evaluation Results

The evaluation results include:

  1. Per-Query Assessments: Each test query is individually evaluated against the metrics.
  2. Metrics Scoring: Scores for each metric (e.g., Relevance, Coherence, Accuracy).
  3. Detailed Feedback: Qualitative feedback explaining the scores.

Customization Options

Custom Metrics

You can define your own evaluation metrics by modifying the metrics section in the configuration file. Each metric requires:

  • A name (metric_name)
  • A rubric explaining what to evaluate
  • A scale for measurement

Example of adding a custom "User Satisfaction" metric:

metrics:
  - metric_name: "User Satisfaction"
    rubric: "Evaluate how likely a user would be satisfied with this response."
    scale: "1-10"

Custom Test Queries

You can define your own test queries in the sample_queries section. Adding ground truth answers helps the evaluation agent better assess accuracy.

Example of adding custom queries:

sample_queries:
  - sample: "Explain quantum computing in simple terms."
    ground_truth_answer: null  # No specific ground truth
  - sample: "What year was the Declaration of Independence signed?"
    ground_truth_answer: "1776"

Automatic Query Generation

If you don't specify sample_queries, the Evaluation Super Agent can automatically generate test queries based on the agent's description. This is useful when:

  • You're not sure what to test
  • You want a diverse set of test cases
  • You want to avoid bias in your evaluation

To use automatic query generation, simply omit the sample_queries section in your configuration.

Advanced Use Cases

Evaluating Multiple Agents

To evaluate multiple agents, simply add them to the agent_list in your configuration:

config:
  agent_list:
    - agent_name: "Search Agent"
      evaluation_config:
        metrics: [...]
    - agent_name: "Research Agent"
      evaluation_config:
        metrics: [...]
    - agent_name: "Coding Agent"
      evaluation_config:
        metrics: [...]

Conclusion

The Evaluation Super Agent provides a powerful framework for assessing and improving your AI agents. By systematically evaluating performance across various metrics, you can identify strengths and weaknesses, make targeted improvements, and track progress over time.

For more detailed information, refer to the Agent Library/super_agents Documentation on the Evaluation Super Agent.