Analyzing City Information Data with Analytics Agent¶

Overview¶

The Analytics Agent is a built-in utility agent within the AI Refinery platform, designed to perform data analysis based on user queries. The agent offers seamless analysis and interaction with data from CSV files or PostgreSQL databases, utilizing natural language queries to generate and refine commands. It also enhances metadata understanding and supports remote data sources, allowing analytics and execution without needing data uploads to the AI Refinery service.

Goals¶

In this tutorial, you'll learn how to leverage the Analytics Agent to streamline data analysis and visualization of different information about cities around the world. We'll demonstrate:

Configuring the Analytics Agent to interact with a PostgreSQL database.
Configuring the agent to handle spreadsheets as Pandas DataFrames.
Enabling the agent to generate plots using built-in Apache ECharts templates.
Overriding the built-in visualization templates with valid Apache ECharts templates of your choice.

Data Description¶

The data that the Analytics Agent with leverage in this tutorial comprises three key tables that provide comprehensive data about global cities, their leaders, and notable tourist attractions:

Table: world_cities

Column	Type	Description
id	INTEGER	A unique identifier for each city.
city_name	VARCHAR	The name of the city.
country	VARCHAR	The country in which the city is located.
population	INTEGER	The population count of the city.
area_km2	FLOAT	The total area of the city in square kilometers.
gdp_billion_usd	FLOAT	The GDP of the city in billion U.S. dollars.

Table: city_mayors

Column	Type	Description
mayor_id	INTEGER	A unique identifier for each mayor.
city_id	INTEGER	A reference to the city the mayor governs.
mayor_name	VARCHAR	The full name of the mayor.
term_start	DATE	The start date of the mayor's current term.
term_end	DATE	The end date of the mayor's current term.

Table: attractions

Column	Type	Description
id	INTEGER	A unique identifier for each attraction.
name	VARCHAR	The name of the tourist attraction.
city_id	INTEGER	A reference to the city where the attraction is located.
description	TEXT	A detailed description of the attraction.
opening_hours	VARCHAR	The hours during which the attraction is open to visitors.
ticket_price	FLOAT	The cost of admission to the attraction.

Common Steps¶

To effectively use the Analytics Agent with both PostgreSQL and pandas, follow these common steps:

Configure the Agent: Provide the necessary configuration in a YAML file. This includes details such as database tables, CSV file names, column descriptions etc.
Create an AI Refinery Project: Use a Python script to create an AI Refinery project that will utilize the configured agent.
Configure the Executor: As outlined in the workflow below, all SQL/pandas commands generated by the agent are sent back to the SDK backend to be executed. An executor with access to the data sources must be configured and instantiated on the SDK backend. This can be done in two ways:

Workflow of Analytics Agent
- Automatic Instantiation: Provide the necessary details (e.g., database credentials) in the YAML config.
- Manual Instantiation: Manually instantiate the executor in your Python script and add it to the executor_dict.

Examples of both executor configuration approaches are shown in the following sections.

Analytics Agent with PostgreSQL¶

In this section, we configure and utilize an Analytics Agent to interact with a PostgreSQL database that contains the data above in different tables.

Automatic Executor Instantiation¶

In order to automatically instantiate the executor, we have to provide the database credentials in the YAML file. The python script does not need include any executor details.

YAML Configuration Including DB Credentials¶

orchestrator:
  agent_list:
    - agent_name: "Analytics Agent"

utility_agents:
  - agent_class: AnalyticsAgent
    agent_name: "Analytics Agent"
    agent_description: "An agent that performs data analytics"
    config:
      contexts:  # Optional field
        - "date"
        - "chat_history"
      executor_config:
        type: PostgresExecutor # Must be PostgresExecutor if you want to interact with PostgreSQL databases.
        db_config: # Required if you do not manually create the PostgresExecutor and add it to your executor_dict
          "host": "localhost"
          "port": "5432"
          "user": "myuser"
          "password": "mypassword"
          "database": "city_information"
        tables: # required
          - name: "world_cities" # table name is required.
            schema_name: "public" # schema name for this table
            columns: #optional if the user wants to specify column details (for all columns or a subset). This is not an exhaustive list
              - name: "city_name" 
                desc: "Name of the city"
              - name: "area_km2"
          - name: "city_mayors" # a table containing information about city mayors
            schema_name: "public" # schema name for this table
          - name: "attractions" # a table containing information about city attraction
            schema_name: "city_tourism" # schema name for this table

Python Code Excluding any Executor Details¶

Since we provide the necessary details in the YAML config, we do not need to create an executor in our python code. Based on the config, a PostgresExecutor will be created automatically in the background.

import asyncio
import os

from air import DistillerClient, login
from dotenv import load_dotenv

load_dotenv()  # loads your ACCOUNT and API_KEY from a .env file

login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
)


async def analytics_test():
    """
    Function for testing analytics agent
    """

    # create a distiller client
    distiller_client = DistillerClient()

    # upload your config file to register a new distiller project
    distiller_client.create_project(
        config_path="analytics.yaml",
        project="analytics_example",  # assuming the config above is in a file named "analytics.yaml",
    )

    async with distiller_client(
        project="analytics_example",
        uuid="test_user",
    ) as dc:
        queries = [
            "Who is the mayor of the largest city in terms of area?",
        ]
        for query in queries:
            responses = await dc.query(query=query)
            print(f"----\nQuery: {query}")
            async for response in responses:
                print(f"Response: {response['content']}")


if __name__ == "__main__":
    asyncio.run(analytics_test())

Manual Executor Instantiation¶

If you prefer to not share the database credentials, you can omit the db_config from the YAML file, and create an executor in your python file.

YAML Configuration Excluding DB Credentials¶

orchestrator:
  agent_list:
    - agent_name: "Analytics Agent"

utility_agents:
  - agent_class: AnalyticsAgent
    agent_name: "Analytics Agent"
    agent_description: "An agent that performs data analytics"
    config:
      contexts:  # Optional field
        - "date"
        - "chat_history"
      executor_config:
        type: PostgresExecutor # Must be PostgresExecutor if you want to interact with PostgreSQL databases.
        tables: # required
          - name: "world_cities" # table name is required.
            schema_name: "public"
            columns: #optional if the user wants to specify column details (for all columns or a subset). This is not an exhaustive list
              - name: "city_name" 
                desc: "Name of the city"
              - name: "area_km2"
          - name: "city_mayors" # a table containing information about city mayors
            schema_name: "public"
          - name: "attractions" # a table containing information about city attraction
            schema_name: "city_tourism"

Python Code Including Executor Instantiation¶

Since we do not provide the db_config in the YAML configuration, we must instantiate an executor in our python code and add it to the executor_dict.

import asyncio
import os

from air import DistillerClient, login
from air.api import PostgresAPI
from dotenv import load_dotenv

load_dotenv()  # loads your ACCOUNT and API_KEY from a .env file

login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
)


# setting up database client

analytics_db_config: dict = {
    "host": "localhost",
    "port": "5432",
    "user": "myuser",
    "password": "mypassword",
    "database": "city_information",
}
analytics_db_client = PostgresAPI(analytics_db_config)


async def analytics_test():
    """
    Function for testing analytics agent
    """

    # create a distiller client
    distiller_client = DistillerClient()

    # upload your config file to register a new distiller project
    distiller_client.create_project(
        config_path="analytics.yaml", project="analytics_example"
    )

    executor_dict = {
        "Analytics Agent": {  # This key must match the name of the agent in the YAML config.
            "PostgresExecutor": analytics_db_client.execute_query,  # The key must be `PostgresExecutor` for PostgreSQL command executor. The value must be a callable that executes the command and returns the results.
        }
    }

    async with distiller_client(
        project="analytics_example",
        uuid="test_user",
        executor_dict=executor_dict,  # passing the executor_dict to the call
    ) as dc:
        queries = [
            "Who is the mayor of the largest city in terms of area?",
        ]
        for query in queries:
            responses = await dc.query(query=query)
            print(f"----\nQuery: {query}")
            async for response in responses:
                print(f"Response: {response['content']}")


if __name__ == "__main__":
    asyncio.run(analytics_test())

Analytics Agent with Pandas¶

In this section, we configure and utilize an Analytics Agent to interact with our data (described above stored in different CSV files.

Automatic Executor Instantiation¶

To automatically instantiate the executor, specify the data source details, like CSV file paths, in the YAML file. The Python script does not need to include any executor details.

YAML Configuration Including Data Source Details¶

orchestrator:  
  agent_list:  
    - agent_name: "Analytics Agent"  

utility_agents:  
  - agent_class: AnalyticsAgent  
    agent_name: "Analytics Agent"  # A name that you choose for your analytics agent  
    agent_description: "An agent that performs data analytics on city information"  # Optional  
    config:  
      visualization: False  # The agent generates a JSON to be plotted with Apache ECharts if True, textual response otherwise.  
      executor_config:  # Required. Configuration of your executor.  
        type: PandasExecutor  # Must be PandasExecutor if you want to interact with Pandas DataFrames.  
        tables:  # Required. The list of tables  
          - name: "world_cities"  # The name of your table.  
            file_path: "data/world_cities.csv"  # The file path where the CSV file is hosted.  
            columns:  # Optional. List of columns if you want to provide a description (for all columns or a subset)  
              - name: "city_name"  
                desc: "Name of the city"  
              - name: "area_km2"  
          - name: "city_mayors"  
            file_path: "data/city_mayors.csv"  
          - name: "attractions"  
            file_path: "data/attractions.csv"  
      output_style: "markdown"  # Optional field  
      contexts:  # Optional field  
        - "date"  
        - "chat_history"  # The chat history up to a certain number of rounds

Python Code Excluding any Executor Details¶

Since we provide the necessary details in the YAML config, we do not need to create an executor in our python code. Based on the config, a PandasExecutor will be created automatically in the background.

import asyncio
import os

from air import DistillerClient, login
from dotenv import load_dotenv

load_dotenv()  # loads your ACCOUNT and API_KEY from a .env file

login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
)


async def analytics_test():
    """
    Function for testing analytics agent
    """

    # create a distiller client
    distiller_client = DistillerClient()

    # upload your config file to register a new distiller project
    distiller_client.create_project(
        config_path="analytics.yaml",
        project="analytics_example",  # assuming the config above is in a file named "analytics.yaml",
    )

    async with distiller_client(
        project="analytics_example",
        uuid="test_user",
    ) as dc:
        queries = [
            "Who is the mayor of the largest city in terms of area?",
        ]
        for query in queries:
            responses = await dc.query(query=query)
            print(f"----\nQuery: {query}")
            async for response in responses:
                print(f"Response: {response['content']}")


if __name__ == "__main__":
    asyncio.run(analytics_test())

Manual Executor Instantiation¶

If you prefer not to specify the data source details (i.e., file_path) in the YAML file, you can create and configure the executor manually in your Python code.

YAML Configuration Excluding Data Source Details¶

orchestrator:
  agent_list:
    - agent_name: "Analytics Agent"

utility_agents:
  - agent_class: AnalyticsAgent
    agent_name: "Analytics Agent"
    agent_description: "An agent that performs data analytics"
    config:
      contexts:  # Optional field
        - "date"
        - "chat_history"
      executor_config:
        type: PandasExecutor
        tables:
          - name: "world_cities" # table name is required.
            columns: #optional if the user wants to specify column details (for all columns or a subset)
              - name: "city_name" #  this is not an exhaustive list
                desc: "Name of the city"
              - name: "area_km2"
          - name: "city_mayors"
          - name: "attractions"

Python Code Including Executor Instantiation¶

import asyncio
import os

from air import DistillerClient, login
from air.api import PandasAPI
from dotenv import load_dotenv

load_dotenv()  # loads your ACCOUNT and API_KEY from a .env file

auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
)

# Table information for creating the PandasExecutor
tables = [
    {
        "name": "world_cities",
        "file_path": "data/world_cities.csv",
    },
    {
        "name": "city_mayors",
        "file_path": "data/city_mayors.csv",
    },
    {
        "name": "attractions",
        "file_path": "data/attractions.csv",
    },
]

analytics_df_client = PandasAPI(tables)  # creating the dataframe client


async def analytics_demo():
    """Function for testing analytics agents"""
    # create a distiller client
    distiller_client = DistillerClient()

    # creating the project
    distiller_client.create_project(
        config_path="analytics.yaml", project="analytics_example"
    )

    executor_dict = {
        "Analytics Agent": {
            "PandasExecutor": analytics_df_client.execute_query,  # the execute_query function of the dataframe client needs to be added as the PandasExecutor
        },
    }

    async with distiller_client(
        project="analytics_example",
        uuid="test_user",
        executor_dict=executor_dict,
    ) as dc:
        queries = [
            "Who is the mayor of the largest city in terms of area?",
        ]
        for query in queries:
            responses = await dc.query(query=query)
            print(f"----\nQuery: {query}")
            async for response in responses:
                print(f"Response: {response['content']}")


if __name__ == "__main__":
    asyncio.run(analytics_demo())

Visualization with Analytics Agent¶

You can enable the Analytics Agent to generate Apache ECharts plot configuration (in JSON format) instead of providing a textual response. The generated JSON can be used to create plots in your HTML file.

The Analytics Agent is pre-configured with the following plot templates to pick from: bar, line, scatter, histogram, pie, and box. However, you may provide your own template which will override the built-in ones.

See an example below where we use the same Analytics Agent using pandas, but with visualization enabled.

Enabling Visualization in YAML Configuration¶

Simply set visualization: True in your agent configuration to enable the agent to generate visualization config instead of textual responses.

orchestrator:  
  agent_list:  
    - agent_name: "Analytics Agent"  

utility_agents:  
  - agent_class: AnalyticsAgent  
    agent_name: "Analytics Agent"  # A name that you choose for your analytics agent  
    agent_description: "An agent that performs data analytics on city information"  # Optional  
    config:  
      visualization: True  # Set this to True to enable visualization 
      executor_config:  # Required. Configuration of your executor.  
        type: PandasExecutor  # Must be PandasExecutor if you want to interact with Pandas DataFrames.  
        tables:  # Required. The list of tables  
          - name: "world_cities"  # The name of your table.  
            file_path: "data/world_cities.csv"  # The file path where the CSV file is hosted.  
            columns:  # Optional. List of columns if you want to provide a description (for all columns or a subset)  
              - name: "city_name"  
                desc: "Name of the city"  
              - name: "area_km2"  
          - name: "city_mayors"  
            file_path: "data/city_mayors.csv"  
          - name: "attractions"  
            file_path: "data/attractions.csv"  
      output_style: "markdown"  # Optional field  
      contexts:  # Optional field  
        - "date"  
        - "chat_history"  # The chat history up to a certain number of rounds

Using Custom Plot Templates¶

You can customize the plot templates to be used by the Analytics Agent in the YAML configuration. This can be done by modifying the plot_templates attribute in the configuration of the agent. The plot_templates must contain a dictionary with plot types as the keys and valid Apache EChart plot templates as their values.

orchestrator:  
  agent_list:  
    - agent_name: "Analytics Agent"  

utility_agents:  
  - agent_class: AnalyticsAgent  
    agent_name: "Analytics Agent"  # A name that you choose for your analytics agent  
    agent_description: "An agent that performs data analytics on city information"  # Optional  
    config:  
      visualization: True # Set this to True to enable visualization  

      # A dictionary with plot type as the key and a valid Apache EChart plot template as value
      plot_templates: | 
        {  
          "line": {  
            "option": {  
              "title": {  
                "text": "Replace with your chart title, e.g., 'Company Information'",  
                "subtext": "Replace with your chart subtitle.",  
                "left": "Specify the position of the title, e.g., 'center'"  
              }, 
              "xAxis": {  
                "type": "category",  
                "data": "Replace with your category labels", 
                "name": "Replace with an appropriate name for the x axis.",
                "axisLabel": {  
                  "interval": 0,  
                  "rotate": 30  
                }
              },  
              "yAxis": {  
                "type": "value",
                "name": "Replace with an appropriate name for the y axis." 
              },  
              "series": [  
                {  
                  "data": "Replace with your data values",  
                  "type": "line",
                  "markPoint": {
                    "data": [
                      { 
                        "name": "Replace with the appropriate name",
                        "type": "max"
                      }
                    ]
                  }
                }  
              ]  
            }  
          }
        }

      executor_config:  # Required. Configuration of your executor.  
        type: PandasExecutor  # Must be PandasExecutor if you want to interact with Pandas DataFrames.  
        tables:  # Required. The list of tables  
          - name: "world_cities"  # The name of your table.  
            file_path: "data/world_cities.csv"  # The file path where the CSV file is hosted.  
            columns:  # Optional. List of columns if you want to provide a description (for all columns or a subset)  
              - name: "city_name"  
                desc: "Name of the city"  
              - name: "area_km2"  
          - name: "city_mayors"  
            file_path: "data/city_mayors.csv"  
          - name: "attractions"  
            file_path: "data/attractions.csv"  
      output_style: "markdown"  # Optional field  
      contexts:  # Optional field  
        - "date"  
        - "chat_history"  # The chat history up to a certain number of rounds

Plotting with Generated Visualization Configuration¶

After the agent generates the configuration JSON for visualization, you can integrate it into your HTML file along with the required JavaScript imports to create the visualizations.

The following example demonstrates how to generate HTML files using these JSON configurations and save them to your local file system. You can then open these files in your browser to view the plots.

import asyncio
import json
import os
from ast import literal_eval

from air import DistillerClient, login
from dotenv import load_dotenv

load_dotenv()  # loads your ACCOUNT and API_KEY from a .env file

login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
)


def generate_html_with_echarts(config, output_file="newchart.html"):
    """
    A function that generates HTML files containing Apache ECharts
    Visualization.
    """
    option = json.loads(config)["option"]
    option_str = json.dumps(option)
    html_content = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="UTF-8">
        <title>Apache ECharts Example</title>
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <!-- Load ECharts library from CDN -->
        <script type="text/javascript" src="https://cdn.jsdelivr.net/npm/echarts@5.3.3/dist/echarts.min.js"></script>
    </head>
    <body>
        <!-- Container for the chart -->
        <div id="main" style="width: 600px; height: 400px;"></div>
        <script type="text/javascript">
            // Initialize the chart
            var myChart = echarts.init(document.getElementById('main'));

            // Specify the chart options
            var option = {option_str};

            // Use the specified options to generate the chart
            myChart.setOption(option);
        </script>
    </body>
    </html>
    """

    # Write the HTML content to the output file
    with open(output_file, "w", encoding="utf-8") as file:
        file.write(html_content)

    print(f"HTML file '{output_file}' has been generated.")


async def analytics_test():
    """
    Function for testing analytics agent
    """

    # create a distiller client
    distiller_client = DistillerClient()

    # upload your config file to register a new distiller project
    distiller_client.create_project(
        config_path="analytics.yaml",
        project="analytics_example",  # assuming the config above is in a file named "analytics.yaml",
    )

    visulization_output_dir = "visualization_output"
    async with distiller_client(
        project="analytics_example",
        uuid="test_user",
    ) as dc:
        queries = [
            "I want to see a line plot of the areas of the different cities.",
        ]
        for query_index, query in enumerate(queries):
            responses = await dc.query(query=query)
            print(f"----\nQuery: {query}")
            async for response in responses:
                print(f"Response: {response['content']}")  # return final results
                if response["role"] == "Analytics Agent":
                    visulization_dicts = literal_eval(response["content"])
                    for visulization_dict in visulization_dicts:
                        # generate plots for data visualization
                        try:
                            generate_html_with_echarts(
                                visulization_dict,
                                output_file=visulization_output_dir
                                + "/chart_"
                                + str(query_index)
                                + ".html",
                            )
                        except Exception as e:
                            print(e)
                            continue


if __name__ == "__main__":
    asyncio.run(analytics_test())