Evaluation Super Agent Tutorial¶
Objective¶
Use the AI Refinery SDK to create and run an evaluation system that assesses the performance of your utility agents. The Evaluation Super Agent provides a structured approach to measuring agent performance across various metrics and generating comprehensive performance reports.
What is the Evaluation Super Agent?¶
The Evaluation Super Agent is a specialized agent designed to evaluate the performance of utility agents within the AI Refinery framework. It works by:
- Generating or using predefined test queries tailored to the agent being evaluated
- Collecting responses from the agent for each query
- Evaluating those responses based on configurable metrics
- Providing detailed evaluation reports with scores, insights, and recommendations
This automated evaluation system helps identify strengths and weaknesses in your agent implementations, allowing for continuous improvement of your AI solutions. It supports both text-based and image-based agents, enabling multimodal evaluation workflows out of the box.
Steps¶
1. Creating the Configuration File¶
The first step is to create a YAML configuration file that defines:
- The orchestration setup
- The Evaluation Super Agent configuration
- The agents to be evaluated
- The evaluation metrics and sample queries
Here's a sample configuration file:
orchestrator:
agent_list:
- agent_name: "Evaluation Super Agent"
super_agents:
- agent_class: EvaluationSuperAgent
agent_name: "Evaluation Super Agent"
agent_description: "Evaluates the response quality of target utility agents based on predefined metrics, rubrics and scales."
config:
agent_list:
- agent_name: "Search Agent"
evaluation_config:
metrics:
- metric_name: "Relevance"
rubric: "Assess whether the response directly answers the query."
scale: "1-5"
- metric_name: "Coherence"
rubric: "Check if the response is logically structured and understandable."
scale: "1-5"
- metric_name: "Accuracy"
rubric: "Evaluate if the response provides factually correct information."
scale: "1-5"
- metric_name: "Conciseness"
rubric: "Determine if the response is clear and to the point without unnecessary details."
scale: "1-5"
- metric_name: "Source Quality"
rubric: "Evaluate the credibility and reliability of the sources cited in the response."
scale: "1-5"
sample_queries:
- sample: "What is the capital of France?"
ground_truth_answer: "Paris"
- sample: "Who is the third president of United States?"
ground_truth_answer: "Thomas Jefferson"
- agent_name: "Image Generation Agent"
evaluation_config:
metrics:
- metric_name: "Visual Quality"
rubric: "Assess the overall visual quality, clarity, and resolution of the generated image."
scale: "1-5"
- metric_name: "Prompt Adherence"
rubric: "Evaluate how well the generated image matches the textual description provided."
scale: "1-5"
- metric_name: "Creativity"
rubric: "Assess the artistic creativity and originality of the generated image."
scale: "1-5"
- metric_name: "Composition"
rubric: "Evaluate the balance, framing, and overall composition of the image."
scale: "1-5"
sample_queries:
- sample: "Generate a photorealistic sunset over mountains with orange and purple sky"
ground_truth_answer: null
expected_output_type: "image"
- sample: "Create an image of a futuristic city with flying cars and tall skyscrapers"
ground_truth_answer: null
expected_output_type: "image"
- sample: "Generate an abstract representation of joy using bright colors"
ground_truth_answer: null
expected_output_type: "image"
- sample: "Transform this image into a watercolor painting style"
ground_truth_answer: null
input_image: "https://picsum.photos/id/237/640/480"
expected_output_type: "image"
- agent_name: "Image Understanding Agent"
evaluation_config:
metrics:
- metric_name: "Accuracy"
rubric: "Assess whether the description accurately identifies objects, scenes, and context in the image."
scale: "1-5"
- metric_name: "Completeness"
rubric: "Evaluate whether all significant elements in the image are described."
scale: "1-5"
- metric_name: "Detail Level"
rubric: "Assess the level of detail provided in the image description."
scale: "1-5"
- metric_name: "Relevance"
rubric: "Evaluate whether the response focuses on relevant aspects based on the query."
scale: "1-5"
sample_queries:
- sample: "Describe what you see in this image in detail"
ground_truth_answer: null
input_image: "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/640px-PNG_transparency_demonstration_1.png"
expected_output_type: "text"
- sample: "What objects and colors are present in this image?"
ground_truth_answer: null
input_image: "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/640px-Cat03.jpg"
expected_output_type: "text"
- sample: "Analyze the composition and mood of this scene"
ground_truth_answer: null
input_image: "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/The_Great_Wave_off_Kanagawa.jpg/640px-The_Great_Wave_off_Kanagawa.jpg"
expected_output_type: "text"
utility_agents:
- agent_class: SearchAgent
agent_name: "Search Agent"
agent_description: "The agent provides answers based on online search results, retrieving information from the internet to respond to user queries."
- agent_class: ImageGenerationAgent
agent_name: "Image Generation Agent"
agent_description: "This agent can help you generate an image from a prompt."
config:
rewriter_config: false
text2image_config:
model: flux_schnell/text2image
image2image_config:
model: flux_schnell/image2image
- agent_class: ImageUnderstandingAgent
agent_name: "Image Understanding Agent"
agent_description: "This agent can help you understand and analyze an image."
config:
output_style: "conversational"
vlm_config:
model: "Qwen/Qwen3-VL-32B-Instruct"
Configuration Key Components¶
-
Orchestrator Section: Lists the agents available in your project, including the Evaluation Super Agent.
-
Super Agents Section: Defines the Evaluation Super Agent and its configuration:
agent_class: Specifies the class name as "EvaluationSuperAgent"agent_name: Custom name for the agentagent_description: Description of the agent's functionconfig: The evaluation configuration including:• `agent_list`: List of agents to evaluate -
Evaluation Configuration:
metrics: List of evaluation criteria with:• `metric_name`: Name of the metric • `rubric`: Description of what the metric measures • `scale`: Scale for measurement (e.g., "1-5")sample_queries: List of test queries with:• `sample`: The query text • `ground_truth_answer`: The expected answer (optional) • `input_image`: URL of an input image (required for image-based queries) • `expected_output_type`: Expected output format — `"text"` or `"image"` -
Utility Agents Section: Defines the agents to be evaluated, including their model configurations for image-based agents.
2. Creating the Python Script¶
Next, create a Python script to execute the evaluation using the AI Refinery SDK:
import asyncio
import os
import traceback
from air import DistillerClient
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from a .env file
# Authentication setup
API_KEY = os.getenv("API_KEY", "")
async def run_evaluation():
# Create a distiller client
print("Initializing DistillerClient...")
distiller_client = DistillerClient(api_key=API_KEY)
config_file = "evaluation_config.yaml" # Your configuration file name
project_name = "agent_evaluation" # Your project name
print(f"Creating project with config: {config_file}...")
try:
# Upload evaluation config file to register a new project
distiller_client.create_project(config_path=config_file, project=project_name)
print(f"Project {project_name} created successfully.")
except Exception as e:
print(f"ERROR creating project: {str(e)}")
traceback.print_exc()
return
# Define any custom agents if needed
executor_dict = {}
print("Initializing client session...")
async with distiller_client(
project=project_name,
uuid="evaluation_session",
executor_dict=executor_dict,
) as dc:
print("Sending query...")
try:
# Evaluate all agents: Search, Image Generation, and Image Understanding
responses = await dc.query(
query="Please evaluate the Search Agent, Image Generation Agent, and Image Understanding Agent."
)
print("Query sent successfully, waiting for responses...")
# Process each response message as it comes in
# Do not print out the raw json output
async for response in responses:
text = response["content"]
cutoff_index = text.find("## Raw JSON output")
if cutoff_index == -1:
print(response["content"])
except Exception as e:
print(f"ERROR during query execution: {str(e)}")
traceback.print_exc()
if __name__ == "__main__":
try:
asyncio.run(run_evaluation())
except Exception as e:
print(f"CRITICAL ERROR: {str(e)}")
traceback.print_exc()
3. Running the Evaluation¶
After setting up your configuration and script:
-
Save the YAML configuration as
evaluation_config.yaml -
Save the Python script as
run_evaluation.py -
Make sure your environment variables are set:
•
API_KEY: Your API key -
Run the script:
The script will:
- Authenticate with AI Refinery
- Create a project using your configuration
- Send a request to evaluate the Search Agent, Image Generation Agent, and Image Understanding Agent
- Receive and display the evaluation results
4. Understanding the Evaluation Results¶
The evaluation results include:
- Per-Query Assessments: Each test query is individually evaluated against the metrics.
- Metrics Scoring: Scores for each metric (e.g., Relevance, Coherence, Visual Quality, Prompt Adherence).
- Detailed Feedback: Qualitative feedback explaining the scores.
For image-based agents, input images provided via input_image URLs are automatically converted to base64 for the ImageUnderstandingAgent, and generated image data is extracted from response.image.image_data for the ImageGenerationAgent.
Customization Options¶
Custom Metrics¶
You can define your own evaluation metrics by modifying the metrics section in the configuration file. Each metric requires:
- A name (
metric_name) - A rubric explaining what to evaluate
- A scale for measurement
Example of adding a custom "User Satisfaction" metric:
metrics:
- metric_name: "User Satisfaction"
rubric: "Evaluate how likely a user would be satisfied with this response."
scale: "1-10"
Custom Test Queries¶
You can define your own test queries in the sample_queries section. Adding ground truth answers helps the evaluation agent better assess accuracy. For image-based agents, provide an input_image URL and set expected_output_type accordingly.
Example of adding custom queries:
sample_queries:
- sample: "Explain quantum computing in simple terms."
ground_truth_answer: null # No specific ground truth
- sample: "What year was the Declaration of Independence signed?"
ground_truth_answer: "1776"
- sample: "Describe the objects in this image."
ground_truth_answer: null
input_image: "https://example.com/sample.jpg"
expected_output_type: "text"
Automatic Query Generation¶
If you don't specify sample_queries, the Evaluation Super Agent can automatically generate test queries based on the agent's description. This is useful when:
- You're not sure what to test
- You want a diverse set of test cases
- You want to avoid bias in your evaluation
To use automatic query generation, simply omit the sample_queries section in your configuration.
Advanced Use Cases¶
Evaluating Multiple Agents¶
To evaluate multiple agents, simply add them to the agent_list in your configuration:
config:
agent_list:
- agent_name: "Search Agent"
evaluation_config:
metrics: [...]
- agent_name: "Image Generation Agent"
evaluation_config:
metrics: [...]
- agent_name: "Image Understanding Agent"
evaluation_config:
metrics: [...]
Conclusion¶
The Evaluation Super Agent provides a powerful framework for assessing and improving your AI agents. By systematically evaluating performance across various metrics — including support for image generation and image understanding agents — you can identify strengths and weaknesses, make targeted improvements, and track progress over time.
For more detailed information, refer to the Agent Library/super_agents Documentation on the Evaluation Super Agent.