Evaluation Super Agent Tutorial¶

Objective¶

Use the AI Refinery SDK to create and run an evaluation system that assesses the performance of your utility agents. The Evaluation Super Agent provides a structured approach to measuring agent performance across various metrics and generating comprehensive performance reports.

What is the Evaluation Super Agent?¶

The Evaluation Super Agent is a specialized agent designed to evaluate the performance of utility agents within the AI Refinery framework. It works by:

Generating or using predefined test queries tailored to the agent being evaluated
Collecting responses from the agent for each query
Evaluating those responses based on configurable metrics
Providing detailed evaluation reports with scores, insights, and recommendations

This automated evaluation system helps identify strengths and weaknesses in your agent implementations, allowing for continuous improvement of your AI solutions. It supports both text-based and image-based agents, enabling multimodal evaluation workflows out of the box.

Steps¶

1. Creating the Configuration File¶

The first step is to create a YAML configuration file that defines:

The orchestration setup
The Evaluation Super Agent configuration
The agents to be evaluated
The evaluation metrics and sample queries

Here's a sample configuration file:

orchestrator:
  agent_list:
    - agent_name: "Evaluation Super Agent"

super_agents:
  - agent_class: EvaluationSuperAgent
    agent_name: "Evaluation Super Agent"
    agent_description: "Evaluates the response quality of target utility agents based on predefined metrics, rubrics and scales."
    config:
      agent_list:
        - agent_name: "Search Agent"
          evaluation_config:
            metrics:
              - metric_name: "Relevance"
                rubric: "Assess whether the response directly answers the query."
                scale: "1-5"
              - metric_name: "Coherence"
                rubric: "Check if the response is logically structured and understandable."
                scale: "1-5"
              - metric_name: "Accuracy"
                rubric: "Evaluate if the response provides factually correct information."
                scale: "1-5"
              - metric_name: "Conciseness"
                rubric: "Determine if the response is clear and to the point without unnecessary details."
                scale: "1-5"
              - metric_name: "Source Quality"
                rubric: "Evaluate the credibility and reliability of the sources cited in the response."
                scale: "1-5"
            sample_queries:
              - sample: "What is the capital of France?"
                ground_truth_answer: "Paris"
              - sample: "Who is the third president of United States?"
                ground_truth_answer: "Thomas Jefferson"

        - agent_name: "Image Generation Agent"
          evaluation_config:
            metrics:
              - metric_name: "Visual Quality"
                rubric: "Assess the overall visual quality, clarity, and resolution of the generated image."
                scale: "1-5"
              - metric_name: "Prompt Adherence"
                rubric: "Evaluate how well the generated image matches the textual description provided."
                scale: "1-5"
              - metric_name: "Creativity"
                rubric: "Assess the artistic creativity and originality of the generated image."
                scale: "1-5"
              - metric_name: "Composition"
                rubric: "Evaluate the balance, framing, and overall composition of the image."
                scale: "1-5"
            sample_queries:
              - sample: "Generate a photorealistic sunset over mountains with orange and purple sky"
                ground_truth_answer: null
                expected_output_type: "image"
              - sample: "Create an image of a futuristic city with flying cars and tall skyscrapers"
                ground_truth_answer: null
                expected_output_type: "image"
              - sample: "Generate an abstract representation of joy using bright colors"
                ground_truth_answer: null
                expected_output_type: "image"
              - sample: "Transform this image into a watercolor painting style"
                ground_truth_answer: null
                input_image: "https://picsum.photos/id/237/640/480"
                expected_output_type: "image"

        - agent_name: "Image Understanding Agent"
          evaluation_config:
            metrics:
              - metric_name: "Accuracy"
                rubric: "Assess whether the description accurately identifies objects, scenes, and context in the image."
                scale: "1-5"
              - metric_name: "Completeness"
                rubric: "Evaluate whether all significant elements in the image are described."
                scale: "1-5"
              - metric_name: "Detail Level"
                rubric: "Assess the level of detail provided in the image description."
                scale: "1-5"
              - metric_name: "Relevance"
                rubric: "Evaluate whether the response focuses on relevant aspects based on the query."
                scale: "1-5"
            sample_queries:
              - sample: "Describe what you see in this image in detail"
                ground_truth_answer: null
                input_image: "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/640px-PNG_transparency_demonstration_1.png"
                expected_output_type: "text"
              - sample: "What objects and colors are present in this image?"
                ground_truth_answer: null
                input_image: "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/640px-Cat03.jpg"
                expected_output_type: "text"
              - sample: "Analyze the composition and mood of this scene"
                ground_truth_answer: null
                input_image: "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/The_Great_Wave_off_Kanagawa.jpg/640px-The_Great_Wave_off_Kanagawa.jpg"
                expected_output_type: "text"

utility_agents:
  - agent_class: SearchAgent
    agent_name: "Search Agent"
    agent_description: "The agent provides answers based on online search results, retrieving information from the internet to respond to user queries."

  - agent_class: ImageGenerationAgent
    agent_name: "Image Generation Agent"
    agent_description: "This agent can help you generate an image from a prompt."
    config:
      rewriter_config: false
      text2image_config:
        model: flux_schnell/text2image
      image2image_config:
        model: flux_schnell/image2image

  - agent_class: ImageUnderstandingAgent
    agent_name: "Image Understanding Agent"
    agent_description: "This agent can help you understand and analyze an image."
    config:
      output_style: "conversational"
      vlm_config:
        model: "Qwen/Qwen3-VL-32B-Instruct"

Configuration Key Components¶

Orchestrator Section: Lists the agents available in your project, including the Evaluation Super Agent.
Super Agents Section: Defines the Evaluation Super Agent and its configuration:

agent_class: Specifies the class name as "EvaluationSuperAgent"

agent_name: Custom name for the agent

agent_description: Description of the agent's function

config: The evaluation configuration including:
```
• `agent_list`: List of agents to evaluate
```

Evaluation Configuration:

metrics: List of evaluation criteria with:

• `metric_name`: Name of the metric

• `rubric`: Description of what the metric measures

• `scale`: Scale for measurement (e.g., "1-5")

sample_queries: List of test queries with:

• `sample`: The query text

• `ground_truth_answer`: The expected answer (optional)

• `input_image`: URL of an input image (required for image-based queries)

• `expected_output_type`: Expected output format — `"text"` or `"image"`

Utility Agents Section: Defines the agents to be evaluated, including their model configurations for image-based agents.

2. Creating the Python Script¶

Next, create a Python script to execute the evaluation using the AI Refinery SDK:

import asyncio
import os
import traceback

from air import DistillerClient
from dotenv import load_dotenv

load_dotenv()  # loads your API_KEY from a .env file

# Authentication setup
API_KEY = os.getenv("API_KEY", "")


async def run_evaluation():
    # Create a distiller client
    print("Initializing DistillerClient...")
    distiller_client = DistillerClient(api_key=API_KEY)
    config_file = "evaluation_config.yaml"  # Your configuration file name
    project_name = "agent_evaluation"  # Your project name

    print(f"Creating project with config: {config_file}...")
    try:
        # Upload evaluation config file to register a new project
        distiller_client.create_project(config_path=config_file, project=project_name)
        print(f"Project {project_name} created successfully.")
    except Exception as e:
        print(f"ERROR creating project: {str(e)}")
        traceback.print_exc()
        return

    # Define any custom agents if needed
    executor_dict = {}

    print("Initializing client session...")
    async with distiller_client(
        project=project_name,
        uuid="evaluation_session",
        executor_dict=executor_dict,
    ) as dc:
        print("Sending query...")
        try:
            # Evaluate all agents: Search, Image Generation, and Image Understanding
            responses = await dc.query(
                query="Please evaluate the Search Agent, Image Generation Agent, and Image Understanding Agent."
            )
            print("Query sent successfully, waiting for responses...")

            # Process each response message as it comes in
            # Do not print out the raw json output
            async for response in responses:
                text = response["content"]
                cutoff_index = text.find("## Raw JSON output")
                if cutoff_index == -1:
                    print(response["content"])
        except Exception as e:
            print(f"ERROR during query execution: {str(e)}")
            traceback.print_exc()


if __name__ == "__main__":
    try:
        asyncio.run(run_evaluation())
    except Exception as e:
        print(f"CRITICAL ERROR: {str(e)}")
        traceback.print_exc()

3. Running the Evaluation¶

After setting up your configuration and script:

Save the YAML configuration as evaluation_config.yaml
Save the Python script as run_evaluation.py
Make sure your environment variables are set:

• API_KEY: Your API key
Run the script:
```
python run_evaluation.py
```

The script will:

Authenticate with AI Refinery
Create a project using your configuration
Send a request to evaluate the Search Agent, Image Generation Agent, and Image Understanding Agent
Receive and display the evaluation results

4. Understanding the Evaluation Results¶

The evaluation results include:

Per-Query Assessments: Each test query is individually evaluated against the metrics.
Metrics Scoring: Scores for each metric (e.g., Relevance, Coherence, Visual Quality, Prompt Adherence).
Detailed Feedback: Qualitative feedback explaining the scores.

For image-based agents, input images provided via input_image URLs are automatically converted to base64 for the ImageUnderstandingAgent, and generated image data is extracted from response.image.image_data for the ImageGenerationAgent.

Customization Options¶

Custom Metrics¶

You can define your own evaluation metrics by modifying the metrics section in the configuration file. Each metric requires:

A name (metric_name)
A rubric explaining what to evaluate
A scale for measurement

Example of adding a custom "User Satisfaction" metric:

metrics:
  - metric_name: "User Satisfaction"
    rubric: "Evaluate how likely a user would be satisfied with this response."
    scale: "1-10"

Custom Test Queries¶

You can define your own test queries in the sample_queries section. Adding ground truth answers helps the evaluation agent better assess accuracy. For image-based agents, provide an input_image URL and set expected_output_type accordingly.

Example of adding custom queries:

sample_queries:
  - sample: "Explain quantum computing in simple terms."
    ground_truth_answer: null  # No specific ground truth
  - sample: "What year was the Declaration of Independence signed?"
    ground_truth_answer: "1776"
  - sample: "Describe the objects in this image."
    ground_truth_answer: null
    input_image: "https://example.com/sample.jpg"
    expected_output_type: "text"

Automatic Query Generation¶

If you don't specify sample_queries, the Evaluation Super Agent can automatically generate test queries based on the agent's description. This is useful when:

You're not sure what to test
You want a diverse set of test cases
You want to avoid bias in your evaluation

To use automatic query generation, simply omit the sample_queries section in your configuration.

Advanced Use Cases¶

Evaluating Multiple Agents¶

To evaluate multiple agents, simply add them to the agent_list in your configuration:

config:
  agent_list:
    - agent_name: "Search Agent"
      evaluation_config:
        metrics: [...]
    - agent_name: "Image Generation Agent"
      evaluation_config:
        metrics: [...]
    - agent_name: "Image Understanding Agent"
      evaluation_config:
        metrics: [...]

Conclusion¶

The Evaluation Super Agent provides a powerful framework for assessing and improving your AI agents. By systematically evaluating performance across various metrics — including support for image generation and image understanding agents — you can identify strengths and weaknesses, make targeted improvements, and track progress over time.

For more detailed information, refer to the Agent Library/super_agents Documentation on the Evaluation Super Agent.