Skip to content

Analytics Agent

The AnalyticsAgent is a built-in utility agent in the AI Refinery SDK for handling queries that require data analysis. For instance, if you have a dataset with information on breast cancer, and your query is "What is the distribution of radius over benign vs malignant tumors?", the AnalyticsAgent will

  • generate a pandas/postgres command (depending on your data source)
  • send the command to the AI Refinery SDK backend to be executed
  • receive the results from the command execution
  • generate either a textual answer or a visualization plot (depending on your config) based on the results recieved

This documentation outlines the workflow and configurations needed to use the AnalyticsAgent.

Workflow Overview

The workflow of AnalyticsAgent consists of five components:

  1. Data hosting: You have the flexibility to store your dataset in CSV files or a PostgreSQL database. Furthermore, there's no need to upload your data to the AI Refinery service backend. All data analytics commands generated by AI Refinery will be executed directly on the SDK backend on your end. Additionally, you can host your data on remote servers and databases, simply providing us with the necessary configuration details for access.

  2. Executor configuration: To utilize the AnalyticsAgent, you must configure either a Postgres or Pandas executor. These executors will process commands sent from the AI Refinery service (generated by the AnalyticsAgent) on the SDK backend. For both executor types, it is essential to define the database tables (for Postgres) or dataframes (for Pandas) that the executor can access. Additionally, you have the option to specify column names along with their descriptions. To set up your executor, you have two options:

    • Automatic Instantiation via Project YAML File: Include all required configuration details for your executor in the project's YAML file. This approach allows the system to automatically instantiate the executor based on the specified parameters.

    • Manual Instantiation: Alternatively, you can manually create an instance of the executor. Once instantiated, add it to the executor_dict to integrate it with your project.

  3. Query transformation: The AnalyticsAgent can transform the original user query into one or more refined queries using query_transformation_examples and database or dataframe schema defined in the executor configuration. This provides more flexibility in customizing how the agent processes different types of user queries.

  4. Command generation: Using each transformed query, the AnalyticsAgent generates a command tailored to the configured executor—Pandas or Postgres, and sends it to the AI Refinery SDK backend for execution.

  5. Command execution: The executor you configured receives the commands from the AnalyticsAgent and carries out the execution. Once completed, the results are returned to the AnalyticsAgent hosted on the AI Refinery service.

  6. Answer generation: The AnalyticsAgent will then interpret the results and generate a textual response to the original user query.

  7. Plot generation: If you enable visualization in your project config, instead of generating a textual response, the AnalyticsAgent will generate a JSON which can be plotted using Apache ECharts.

Usage

As a built-in utility agent in the AI Refinery SDK, AnalyticsAgent can be easily integrated into your project by adding the necessary configurations to your project YAML file. Specifically, ensure the following configurations are included:

  • Add a utility agent with agent_class: AnalyticsAgent under utility_agents.
  • Ensure the agent_name you chose for your AnalyticsAgent is listed in the agent_list under orchestrator.
  • Add the PandasExecutor or PostgresExecutor configurations under the executor_config if you have not manually created them and added to your executor_dict.

Quickstart

This section presents an example project with two different AnalyticsAgents: one utilizing a PandasExecutor for CSV files and another using a PostgresExecutor for PostgreSQL databases. Here is the yaml config for this project:

utility_agents:
  - agent_class: AnalyticsAgent
    agent_name: "Breast Cancer Analytics Agent" # A name that you choose for your AnalyticsAgent. This needs to be listed under orchestrator.
    agent_description: "An analytics agent capable of performing in-depth analyses on breast cancer data and providing insightful answers to related questions." # A description for your AnalyticsAgent.
    config: # Required. Configuration of your AnalyticsAgent
      contexts: # Optional. Recommended. Useful if your AnalyticsAgent needs to refer to previous history.
        - "date"
        - "chat_history"
      executor_config: # Required. Configuration of your executor.
        type: PandasExecutor # Required. Type of the executor.
        tables: # Required. The list of tables i.e., DataFrames 
          - name: "BreastCancer" # Required. The name of your DataFrame that you choose.
            file_path: "analytics_data/BreastCancer.csv" # Required. The file path where the file is hosted. Currently, both CSV and parquet files are supported. The path can be both local or a remote URL.

  - agent_class: AnalyticsAgent
    agent_name: "City Information Analytics Agent"  # A name that you choose for your AnalyticsAgent. This needs to be listed under orchestrator.
    agent_description: "An analytics agent designed to process and analyze city data, providing insightful answers to related questions." # A description for your AnalyticsAgent.
    config: # Required. Configuration of your AnalyticsAgent
      contexts: # Optional. Recommended. Useful if your AnalyticsAgent needs to refer to previous history.
        - "date"
        - "chat_history"
      executor_config: # Required. Configuration of your executor.
        type: PostgresExecutor 
        db_config: # Required if you do not manually create the PostgresExecutor and add it to your executor_dict
          "host": "localhost"
          "port": "5432"
          "user": "postgres"
          "password": "mypassword"
          "database": "city_information"
        tables: # Required. List of database tables with their schema_names
          - name: "world_cities" # Required. Name of your database table
            schema_name: "public" # Required. Schema name of the table
          - name: "city_mayors" # Required. Name of your database table
            schema_name: "public" # Required. Schema name of the table
          - name: "attractions" # Required. Name of your database table
            schema_name: "city_tourism" # Required. Schema name of the table
        query_transformation_examples:  # Optional. Examples of transforming the original user query into refined textual queries.
          - user_query: What is the highest population among cities? 
            query:  
              - Sort the cities in terms of population, and select the row with highest population
              - Find both the name and population of highest populated city

orchestrator:
  agent_list:
    - agent_name: "Breast Cancer Analytics Agent"
    - agent_name: "City Information Analytics Agent"

If you prefer to create the executors yourself and avoid sharing your credentials, you can do so in a python file and add them to your executor_dict. See an example below:

import asyncio
import os

from air import DistillerClient
from air.api import PandasAPI, PostgresAPI
from dotenv import load_dotenv

load_dotenv()  # loads your API_KEY from a .env file
api_key=str(os.getenv("API_KEY"))


# db_config for creating the PostgresExecutor. No need to add the db_config in your project configuration if you create your executor here.
analytics_db_config: dict = {
    "host": "localhost",
    "port": "5432",
    "user": "postgres",
    "password": "mypassword",
    "database": "city_information",
}
analytics_db_client = PostgresAPI(analytics_db_config)  # creating the database client

# Table information for creating the PandasExecutor
tables = [
    {
        "name": "BreastCancer",
        "file_path": "analytics_data/BreastCancer.csv",
    },
]

analytics_df_client = PandasAPI(tables)  # creating the dataframe client


async def analytics_demo():
    """
    Function for testing analytics agents
    """

    # create a distiller client
    distiller_client = DistillerClient(
        api_key=api_key
    )  # instantiating the DistillerClient

    # creating the project
    distiller_client.create_project(
        config_path="analytics.yaml", project="analytics_example"
    )

    executor_dict = {
        "Breast Cancer Analytics Agent": {
            "PandasExecutor": analytics_df_client.execute_query,  # the execute_query function of the dataframe client needs to be added as the PandasExecutor
        },
        "City Information Analytics Agent": {
            "PostgresExecutor": analytics_db_client.execute_query,  # the execute_query function of the database client needs to be added as the PostgresExecutor
        },
    }

    async with distiller_client(  # Open an async distiller client session
        project="analytics_example",  # Project name for context
        uuid="test_user",  # Unique user identifier
        executor_dict=executor_dict,  # your executor dictionary
    ) as dc:  # Alias the client as "dc"
        queries = [  # Define list of natural language queries
            "How many patients are benign?",  # Query 1
            "What is the distribution of radius over benign vs malignant tumors?",  # Query 2
        ]
        for query in queries:  # Loop through each query
            responses = await dc.query(query=query)  # Send query and await responses
            print(f"----\nQuery: {query}")  # Print separator and query text
            async for response in responses:  # Stream responses asynchronously
                print(f"Response: {response['content']}")  # Print the content of each response


if __name__ == "__main__":
    asyncio.run(analytics_demo())

Template YAML Configuration of AnalyticsAgent

In addition to the configurations mentioned for the example above, the AnalyticsAgent supports several other configurable options. See the template YAML configuration below for all available settings.

agent_class: AnalyticsAgent
agent_name: <name of the agent> # A name that you choose for your analytics agent
agent_description: <description of the agent> #Optional
config:
  visualization: <True or False> # The agent generates a JSON to be plotted with Apache ECharts if True, textual response otherwise.
  plot_templates: <Custom plot tempaltes>  # A dictionary with plot types as the keys and a valid Apache EChart plot templates as values if you want to use your own templates. 
  executor_config: # Required. Configuration of your executor.
    type: <Type of the executor. PandasExecutor or PostgresExecutor> # Required. 
    db_config: # Required ONLY for PostgresExecutor and ONLY if you do not manually create it and add it to your executor_dict in your python file
      "host": "<database_host>"  # Required. The address of your database server (e.g., "localhost" or IP address)  
      "port": "<database_port>"  # Required. The port number your database server is listening on (typically "5432" for PostgreSQL)  
      "user": "<database_user>"  # Required. The username you use to connect to the database  
      "password": "<database_password>" # Required. The password associated with the username  
      "database": "<database_name>"  # Required. The name of the specific database you want to connect to  
    tables: # Required. The list of tables  
      - name: <The name of your table.> # Required.
        desc: <A description of the table.> # Optional. The description is generated by the agents if you do not provide it.
        file_path: <The file path where the csv or parquet file is hosted.> # Required only for PandasExecutor.
        schema_name: <The name of schema under which the table exists> # Required only for PostgresExecutor.
        columns: # Optional. List of columns if you want to provide a description (for all columns or a subset)
          - name: <name of the column> 
            desc: <description of the column> # a description of the column to help the agents better understand your data. The description is generated by the agents if your do not provide it.
    query_transformation_examples:  # Optional. Examples of transforming the original user query into refined textual queries. Defaults are used if none are provided.
      - user_query: <example of original user query 1>
        query:  
          - <transformed query 1>
          - <transformed query 2>
      - user_query: <example of original user query 2>
        query:  
          - <transformed query 1>
          - <transformed query 2>

  output_style: <"markdown" or "conversational" or "html">  # Optional field
  contexts:  # Optional field
    - "date"
    - "chat_history" # the chat history upto a certain number of rounds
    - "env_variable"
    - "relevant_chat_history" # the chat history that is relevant to the current query
  llm_config:
  # Optional. Customized llm config (if you want the research agent to use a different LLM than the on in your base config)
    model: <model_name>
  self_reflection_config:  # Optional. Configuration for self-reflection.
    self_reflection: <true or false>   # Whether to enable self-reflection for this agent. Default is false.
    max_attempts: <number>  # Maximum number of times the agent can perform self-reflection. Default is 3.
    response_selection_mode: <"best" | "aggregate" | "auto">  # Strategy used to generate the final response after reaching max_attempts. Default is "auto".
    return_internal_reflection_msg: <true or false>   # Whether to return internal reflection messages. Default is false.