Evaluation Super Agent Tutorial¶
Objective¶
Use the AI Refinery SDK to create and run an evaluation system that assesses the performance of your utility agents. The Evaluation Super Agent provides a structured approach to measuring agent performance across various metrics and generating comprehensive performance reports.
What is the Evaluation Super Agent?¶
The Evaluation Super Agent is a specialized agent designed to evaluate the performance of utility agents within the AI Refinery framework. It works by:
- Generating or using predefined test queries tailored to the agent being evaluated
- Collecting responses from the agent for each query
- Evaluating those responses based on configurable metrics
- Providing detailed evaluation reports with scores, insights, and recommendations
This automated evaluation system helps identify strengths and weaknesses in your agent implementations, allowing for continuous improvement of your AI solutions.
Steps¶
1. Creating the Configuration File¶
The first step is to create a YAML configuration file that defines:
- The orchestration setup
- The Evaluation Super Agent configuration
- The agents to be evaluated
- The evaluation metrics and sample queries
Here's a sample configuration file:
orchestrator:
agent_list:
- agent_name: "Evaluation Super Agent"
super_agents:
- agent_class: EvaluationSuperAgent
agent_name: "Evaluation Super Agent"
agent_description: "Evaluates the response quality of target utility agents based on predefined metrics, rubrics and scales."
config:
agent_list:
- agent_name: "Search Agent"
evaluation_config:
metrics:
- metric_name: "Relevance"
rubric: "Assess whether the response directly answers the query."
scale: "1-5"
- metric_name: "Coherence"
rubric: "Check if the response is logically structured and understandable."
scale: "1-5"
- metric_name: "Accuracy"
rubric: "Evaluate if the response provides factually correct information."
scale: "1-5"
- metric_name: "Conciseness"
rubric: "Determine if the response is clear and to the point without unnecessary details."
scale: "1-5"
- metric_name: "Source Quality"
rubric: "Evaluate the credibility and reliability of the sources cited in the response."
scale: "1-5"
sample_queries:
- sample: "What is the capital of France?"
ground_truth_answer: "Paris"
- sample: "Who is the third president of United States?"
ground_truth_answer: "Thomas Jefferson"
utility_agents:
- agent_class: SearchAgent
agent_name: "Search Agent"
agent_description: "The agent provides answers based on online search results, retrieving information from the internet to respond to user queries."
Configuration Key Components¶
-
Orchestrator Section: Lists the agents available in your project, including the Evaluation Super Agent.
-
Super Agents Section: Defines the Evaluation Super Agent and its configuration:
agent_class
: Specifies the class name as "EvaluationSuperAgent"agent_name
: Custom name for the agentagent_description
: Description of the agent's functionconfig
: The evaluation configuration including:• `agent_list`: List of agents to evaluate
-
Evaluation Configuration:
metrics
: List of evaluation criteria with:• `metric_name`: Name of the metric • `rubric`: Description of what the metric measures • `scale`: Scale for measurement (e.g., "1-5")
sample_queries
: List of test queries with:• `sample`: The query text • `ground_truth_answer`: The expected answer (optional)
-
Utility Agents Section: Defines the agents to be evaluated.
2. Creating the Python Script¶
Next, create a Python script to execute the evaluation using the AI Refinery SDK:
import os
import asyncio
import traceback
from air import login, DistillerClient
# Authentication setup
auth = login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")
async def run_evaluation():
# Create a distiller client
print("Initializing DistillerClient...")
distiller_client = DistillerClient(base_url=base_url)
config_file = "evaluation_config.yaml" # Your configuration file name
project_name = "agent_evaluation" # Your project name
print(f"Creating project with config: {config_file}...")
try:
# Upload evaluation config file to register a new project
distiller_client.create_project(config_path=config_file, project=project_name)
print(f"Project {project_name} created successfully.")
except Exception as e:
print(f"ERROR creating project: {str(e)}")
traceback.print_exc()
return
# Define any custom agents if needed
custom_agent_gallery = {}
print("Initializing client session...")
async with distiller_client(
project=project_name,
uuid="evaluation_session",
custom_agent_gallery=custom_agent_gallery,
) as dc:
print("Sending query...")
try:
responses = await dc.query(query="Please evaluate the Search Agent.")
print("Query sent successfully, waiting for responses...")
# Process each response message as it comes in
# Do not print out the raw json output
async for response in responses:
text = response["content"]
cutoff_index = text.find("## Raw JSON output")
if cutoff_index == -1:
print(response["content"])
except Exception as e:
print(f"ERROR during query execution: {str(e)}")
traceback.print_exc()
if __name__ == "__main__":
print(f"Using base_url: {base_url}")
print(f"Account: {auth.account}")
try:
asyncio.run(run_evaluation())
except Exception as e:
print(f"CRITICAL ERROR: {str(e)}")
traceback.print_exc()
3. Running the Evaluation¶
After setting up your configuration and script:
-
Save the YAML configuration as
evaluation_config.yaml
-
Save the Python script as
run_evaluation.py
-
Make sure your environment variables are set:
•
ACCOUNT
: Your AI Refinery account•
API_KEY
: Your API key•
AIREFINERY_ADDRESS
: The base URL (if not using the default) -
Run the script:
The script will:
- Authenticate with AI Refinery
- Create a project using your configuration
- Send a request to evaluate the Search Agent
- Receive and display the evaluation results
4. Understanding the Evaluation Results¶
The evaluation results include:
- Per-Query Assessments: Each test query is individually evaluated against the metrics.
- Metrics Scoring: Scores for each metric (e.g., Relevance, Coherence, Accuracy).
- Detailed Feedback: Qualitative feedback explaining the scores.
Customization Options¶
Custom Metrics¶
You can define your own evaluation metrics by modifying the metrics
section in the configuration file. Each metric requires:
- A name (
metric_name
) - A rubric explaining what to evaluate
- A scale for measurement
Example of adding a custom "User Satisfaction" metric:
metrics:
- metric_name: "User Satisfaction"
rubric: "Evaluate how likely a user would be satisfied with this response."
scale: "1-10"
Custom Test Queries¶
You can define your own test queries in the sample_queries
section. Adding ground truth answers helps the evaluation agent better assess accuracy.
Example of adding custom queries:
sample_queries:
- sample: "Explain quantum computing in simple terms."
ground_truth_answer: null # No specific ground truth
- sample: "What year was the Declaration of Independence signed?"
ground_truth_answer: "1776"
Automatic Query Generation¶
If you don't specify sample_queries
, the Evaluation Super Agent can automatically generate test queries based on the agent's description. This is useful when:
- You're not sure what to test
- You want a diverse set of test cases
- You want to avoid bias in your evaluation
To use automatic query generation, simply omit the sample_queries
section in your configuration.
Advanced Use Cases¶
Evaluating Multiple Agents¶
To evaluate multiple agents, simply add them to the agent_list
in your configuration:
config:
agent_list:
- agent_name: "Search Agent"
evaluation_config:
metrics: [...]
- agent_name: "Research Agent"
evaluation_config:
metrics: [...]
- agent_name: "Coding Agent"
evaluation_config:
metrics: [...]
Conclusion¶
The Evaluation Super Agent provides a powerful framework for assessing and improving your AI agents. By systematically evaluating performance across various metrics, you can identify strengths and weaknesses, make targeted improvements, and track progress over time.
For more detailed information, refer to the Agent Library/super_agents Documentation on the Evaluation Super Agent.