Evaluation Super Agent¶

The EvaluationSuperAgent in the AI Refinery SDK is designed to systematically assess the performance of utility agents based on predefined metrics and sample queries. This agent provides a structured approach to measuring and improving agent performance, enabling continuous enhancement of your AI systems.

Workflow Overview¶

The EvaluationSuperAgent is invoked by the orchestrator to evaluate the performance of specific utility agents. Upon invocation, the EvaluationSuperAgent workflow is structured around three essential components:

Evaluation Configuration: Defines metrics, rubrics, and scales used to evaluate agent responses.
Query Generation: Either uses predefined sample queries or generates contextually relevant test queries based on the agent's description.
Response Evaluation: Collects responses from utility agents for each query and evaluates them according to the defined metrics.

Usage¶

Evaluation Super Agents can be easily integrated into your project by adding the necessary configurations to your project YAML file. Specifically, you need to:

List your super agents under the super_agents attribute in your project's YAML configuration.
Ensure the agent_name you chose for your super_agents are listed in the agent_list under orchestrator.
Define the utility agents that will be evaluated in the utility_agents list.
Configure evaluation metrics and optional sample queries for each agent to be evaluated.

Quickstart¶

To quickly set up a project with an EvaluationSuperAgent, use the following YAML configuration. In this quickstart example, we use pre-defined sample queries for evaluation. However, you can also configure the EvaluationSuperAgent to automatically generate sample queries—see the advanced feature section for more details. This configuration sets up a single evaluation super agent that assesses the performance of a Search Agent across five key metrics.

utility_agents:
  - agent_class: SearchAgent  # Must be "SearchAgent" for web or data search functionality
    agent_name: "Search Agent"  # A name you choose for your utility agent
    agent_description: "The agent provides answers based on online search results, retrieving information from the internet to respond to user queries."  # Optional description of the utility agent

super_agents:
  - agent_class: EvaluationSuperAgent  # Must be "EvaluationSuperAgent" for evaluation functionality
    agent_name: "Evaluation Super Agent"  # A name you choose for your evaluation super agent
    agent_description: "Evaluates the response quality of target utility agents based on predefined metrics, rubrics and scales."  # Optional description
    config:
      agent_list:  # Required. The list of utility agents to evaluate
        - agent_name: "Search Agent"  # Must match the name of a utility agent in your project
          evaluation_config:  # Configuration for evaluating this agent
            metrics:  # Define metrics for evaluation
              - metric_name: "Relevance"  # Required. Name of this metric
                rubric: "Assess whether the response directly answers the query."  # What this metric measures
                scale: "1-5"  # Defines the scale for measurement
              - metric_name: "Coherence"
                rubric: "Check if the response is logically structured and understandable."
                scale: "1-5"
              - metric_name: "Accuracy"
                rubric: "Evaluate if the response provides factually correct information."
                scale: "1-5"
              - metric_name: "Conciseness"
                rubric: "Determine if the response is clear and to the point without unnecessary details."
                scale: "1-5"
              - metric_name: "Source Quality"
                rubric: "Evaluate the credibility and reliability of the sources cited in the response."
                scale: "1-5"
            sample_queries:  # Optional list of queries used to test the utility agent’s response quality
              - sample: "What is the capital of France?"  # The query text
                ground_truth_answer: "Paris"  # Expected or correct answer
              - sample: "Who is the third president of United States?"
                ground_truth_answer: "Thomas Jefferson"  # Expected or correct answer

orchestrator:
  agent_list:
    - agent_name: "Evaluation Super Agent"  # Must match the name of your evaluation super agent above
    - agent_name: "Search Agent"  # Must match the name of the utility agent being evaluated

Template YAML Configuration of `EvaluationSuperAgent`¶

The EvaluationSuperAgent supports several configurable options. See the template YAML configuration below for all available settings.

agent_class: EvaluationSuperAgent # The class must be EvaluationSuperAgent
agent_name: <A name that you choose for your super agent.> # Required
agent_description: <Description of your super agent.> # Optional

config: 
  agent_list: # Required. The list of agents to be evaluated.
    - agent_name: <Name of agent 1>  # Required. Must be an agent in your project.
      evaluation_config: # Configuration for this agent's evaluation
        metrics: # Define metrics for evaluation
          - metric_name: <Name of metric>  # Required
            rubric: <Description of what this metric measures> # Required
            scale: <Scale for measurement, e.g., "1-5"> # Required
          - metric_name: <Name of another metric>
            rubric: <Description>
            scale: <Scale>

        sample_queries: # Optional. If not provided, queries will be auto-generated
          - sample: <Query text>
            ground_truth_answer: <Expected answer> # Optional
          - sample: <Another query>
            ground_truth_answer: <Another expected answer>

    - agent_name: <Name of agent 2>
      evaluation_config:
        metrics: [...]
        sample_queries: [...]

  output_format: "summary" # Optional. Format for evaluation results. Options: "summary" or "tabular". Default: "summary"
  truncate_length: 50 # Optional. Maximum length for text in tabular output before truncation. Default: 50

Key Components¶

Evaluation Configuration¶

Each agent to be evaluated can have its own evaluation configuration with:

Metrics: Define what aspects of agent responses to evaluate:

• metric_name: Name of the metric (e.g., "Relevance", "Accuracy")

• rubric: Description of what the metric measures

• scale: Scale for measurement (e.g., "1-5", "1-10")
Sample Queries: Test queries used to evaluate the agent:

• sample: The query text

• ground_truth_answer: The expected answer (optional)

Output Formats¶

The EvaluationSuperAgent provides two output formats:

summary: Provides a detailed narrative report including detailed evaluations for each agent.
tabular: Presents results in a tabular format (in JSON format), suitable for further analysis or visualization.

Advanced Features¶

Automatic Query Generation¶

If you don't specify sample_queries in your configuration, the EvaluationSuperAgent will automatically generate test queries based on the agent's description. This is useful when:

You want a diverse set of test cases without manual specification
You want to avoid bias in your evaluation
You're not sure what queries would best test the agent's capabilities

Multi-Agent Evaluation¶

You can evaluate multiple agents simultaneously by adding them to the agent_list in your configuration. This allows for direct comparison between different agent implementations.

Custom Metrics¶

You can define any number of custom metrics to evaluate aspects of agent performance that are important for your specific use case. Each metric should have a clear rubric explaining what to evaluate and a scale for measurement.