Evaluation Super Agent¶
The EvaluationSuperAgent
in the AI Refinery SDK is designed to systematically assess the performance of utility agents based on predefined metrics and sample queries. This agent provides a structured approach to measuring and improving agent performance, enabling continuous enhancement of your AI systems.
Workflow Overview¶
The EvaluationSuperAgent
is invoked by the orchestrator to evaluate the performance of specific utility agents. Upon invocation, the EvaluationSuperAgent
workflow is structured around three essential components:
-
Evaluation Configuration: Defines metrics, rubrics, and scales used to evaluate agent responses.
-
Query Generation: Either uses predefined sample queries or generates contextually relevant test queries based on the agent's description.
-
Response Evaluation: Collects responses from utility agents for each query and evaluates them according to the defined metrics.
Usage¶
Evaluation Super Agents can be easily integrated into your project by adding the necessary configurations to your project YAML file. Specifically, you need to:
- List your super agents under the
super_agents
attribute in your project's YAML configuration. - Ensure the
agent_name
you chose for yoursuper_agents
are listed in theagent_list
underorchestrator
. - Define the utility agents that will be evaluated in the
utility_agents
list. - Configure evaluation metrics and optional sample queries for each agent to be evaluated.
Quickstart¶
To quickly set up a project with an EvaluationSuperAgent
, use the following YAML configuration. In this quickstart example, we use pre-defined sample queries for evaluation. However, you can also configure the EvaluationSuperAgent
to automatically generate sample queriesβsee the advanced feature section for more details. This configuration sets up a single evaluation super agent that assesses the performance of a Search Agent across five key metrics.
utility_agents:
- agent_class: SearchAgent # Must be "SearchAgent" for web or data search functionality
agent_name: "Search Agent" # A name you choose for your utility agent
agent_description: "The agent provides answers based on online search results, retrieving information from the internet to respond to user queries." # Optional description of the utility agent
super_agents:
- agent_class: EvaluationSuperAgent # Must be "EvaluationSuperAgent" for evaluation functionality
agent_name: "Evaluation Super Agent" # A name you choose for your evaluation super agent
agent_description: "Evaluates the response quality of target utility agents based on predefined metrics, rubrics and scales." # Optional description
config:
agent_list: # Required. The list of utility agents to evaluate
- agent_name: "Search Agent" # Must match the name of a utility agent in your project
evaluation_config: # Configuration for evaluating this agent
metrics: # Define metrics for evaluation
- metric_name: "Relevance" # Required. Name of this metric
rubric: "Assess whether the response directly answers the query." # What this metric measures
scale: "1-5" # Defines the scale for measurement
- metric_name: "Coherence"
rubric: "Check if the response is logically structured and understandable."
scale: "1-5"
- metric_name: "Accuracy"
rubric: "Evaluate if the response provides factually correct information."
scale: "1-5"
- metric_name: "Conciseness"
rubric: "Determine if the response is clear and to the point without unnecessary details."
scale: "1-5"
- metric_name: "Source Quality"
rubric: "Evaluate the credibility and reliability of the sources cited in the response."
scale: "1-5"
sample_queries: # Optional list of queries used to test the utility agentβs response quality
- sample: "What is the capital of France?" # The query text
ground_truth_answer: "Paris" # Expected or correct answer
- sample: "Who is the third president of United States?"
ground_truth_answer: "Thomas Jefferson" # Expected or correct answer
orchestrator:
agent_list:
- agent_name: "Evaluation Super Agent" # Must match the name of your evaluation super agent above
- agent_name: "Search Agent" # Must match the name of the utility agent being evaluated
Template YAML Configuration of EvaluationSuperAgent
¶
The EvaluationSuperAgent
supports several configurable options. See the template YAML configuration below for all available settings.
agent_class: EvaluationSuperAgent # The class must be EvaluationSuperAgent
agent_name: <A name that you choose for your super agent.> # Required
agent_description: <Description of your super agent.> # Optional
config:
agent_list: # Required. The list of agents to be evaluated.
- agent_name: <Name of agent 1> # Required. Must be an agent in your project.
evaluation_config: # Configuration for this agent's evaluation
metrics: # Define metrics for evaluation
- metric_name: <Name of metric> # Required
rubric: <Description of what this metric measures> # Required
scale: <Scale for measurement, e.g., "1-5"> # Required
- metric_name: <Name of another metric>
rubric: <Description>
scale: <Scale>
sample_queries: # Optional. If not provided, queries will be auto-generated
- sample: <Query text>
ground_truth_answer: <Expected answer> # Optional
- sample: <Another query>
ground_truth_answer: <Another expected answer>
- agent_name: <Name of agent 2>
evaluation_config:
metrics: [...]
sample_queries: [...]
output_format: "summary" # Optional. Format for evaluation results. Options: "summary" or "tabular". Default: "summary"
truncate_length: 50 # Optional. Maximum length for text in tabular output before truncation. Default: 50
Key Components¶
Evaluation Configuration¶
Each agent to be evaluated can have its own evaluation configuration with:
-
Metrics: Define what aspects of agent responses to evaluate:
β’
metric_name
: Name of the metric (e.g., "Relevance", "Accuracy")β’
rubric
: Description of what the metric measuresβ’
scale
: Scale for measurement (e.g., "1-5", "1-10") -
Sample Queries: Test queries used to evaluate the agent:
β’
sample
: The query textβ’
ground_truth_answer
: The expected answer (optional)
Output Formats¶
The EvaluationSuperAgent
provides two output formats:
-
summary: Provides a detailed narrative report including detailed evaluations for each agent.
-
tabular: Presents results in a tabular format (in JSON format), suitable for further analysis or visualization.
Advanced Features¶
Automatic Query Generation¶
If you don't specify sample_queries
in your configuration, the EvaluationSuperAgent
will automatically generate test queries based on the agent's description. This is useful when:
- You want a diverse set of test cases without manual specification
- You want to avoid bias in your evaluation
- You're not sure what queries would best test the agent's capabilities
Multi-Agent Evaluation¶
You can evaluate multiple agents simultaneously by adding them to the agent_list
in your configuration. This allows for direct comparison between different agent implementations.
Custom Metrics¶
You can define any number of custom metrics to evaluate aspects of agent performance that are important for your specific use case. Each metric should have a clear rubric explaining what to evaluate and a scale for measurement.