Analyzing City Information Data with Analytics Agent¶
Overview¶
The Analytics Agent is a built-in utility agent within the AI Refinery platform, designed to perform data analysis based on user queries. The agent offers seamless analysis and interaction with data from CSV files or PostgreSQL databases, utilizing natural language queries to generate and refine commands. It also enhances metadata understanding and supports remote data sources, allowing analytics and execution without needing data uploads to the AI Refinery service.
Goals¶
In this tutorial, you'll learn how to leverage the Analytics Agent to streamline data analysis and visualization of different information about cities around the world. We'll demonstrate:
- Configuring the Analytics Agent to interact with a PostgreSQL database.
- Configuring the agent to handle spreadsheets as Pandas DataFrames.
- Enabling the agent to generate plots using built-in Apache ECharts templates.
- Overriding the built-in visualization templates with valid Apache ECharts templates of your choice.
Data Description¶
The data that the Analytics Agent with leverage in this tutorial comprises three key tables that provide comprehensive data about global cities, their leaders, and notable tourist attractions:
Table: world_cities
Column | Type | Description |
---|---|---|
id | INTEGER | A unique identifier for each city. |
city_name | VARCHAR | The name of the city. |
country | VARCHAR | The country in which the city is located. |
population | INTEGER | The population count of the city. |
area_km2 | FLOAT | The total area of the city in square kilometers. |
gdp_billion_usd | FLOAT | The GDP of the city in billion U.S. dollars. |
Table: city_mayors
Column | Type | Description |
---|---|---|
mayor_id | INTEGER | A unique identifier for each mayor. |
city_id | INTEGER | A reference to the city the mayor governs. |
mayor_name | VARCHAR | The full name of the mayor. |
term_start | DATE | The start date of the mayor's current term. |
term_end | DATE | The end date of the mayor's current term. |
Table: attractions
Column | Type | Description |
---|---|---|
id | INTEGER | A unique identifier for each attraction. |
name | VARCHAR | The name of the tourist attraction. |
city_id | INTEGER | A reference to the city where the attraction is located. |
description | TEXT | A detailed description of the attraction. |
opening_hours | VARCHAR | The hours during which the attraction is open to visitors. |
ticket_price | FLOAT | The cost of admission to the attraction. |
Common Steps¶
To effectively use the Analytics Agent with both PostgreSQL and pandas, follow these common steps:
-
Configure the Agent: Provide the necessary configuration in a YAML file. This includes details such as database tables, CSV file names, column descriptions etc.
-
Create an AI Refinery Project: Use a Python script to create an AI Refinery project that will utilize the configured agent.
-
Configure the Executor: As outlined in the workflow below, all SQL/pandas commands generated by the agent are sent back to the SDK backend to be executed. An executor with access to the data sources must be configured and instantiated on the SDK backend. This can be done in two ways:
Workflow of Analytics Agent - Automatic Instantiation: Provide the necessary details (e.g., database credentials) in the YAML config.
- Manual Instantiation: Manually instantiate the executor in your Python script and add it to the
executor_dict
.
Examples of both executor configuration approaches are shown in the following sections.
Analytics Agent with PostgreSQL¶
In this section, we configure and utilize an Analytics Agent to interact with a PostgreSQL database that contains the data above in different tables.
Automatic Executor Instantiation¶
In order to automatically instantiate the executor, we have to provide the database credentials in the YAML file. The python script does not need include any executor details.
YAML Configuration Including DB Credentials¶
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent"
agent_description: "An agent that performs data analytics"
config:
contexts: # Optional field
- "date"
- "chat_history"
executor_config:
type: PostgresExecutor # Must be PostgresExecutor if you want to interact with PostgreSQL databases.
db_config: # Required if you do not manually create the PostgresExecutor and add it to your executor_dict
"host": "localhost"
"port": "5432"
"user": "myuser"
"password": "mypassword"
"database": "city_information"
tables: # required
- name: "world_cities" # table name is required.
schema_name: "public" # schema name for this table
columns: #optional if the user wants to specify column details (for all columns or a subset). This is not an exhaustive list
- name: "city_name"
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors" # a table containing information about city mayors
schema_name: "public" # schema name for this table
- name: "attractions" # a table containing information about city attraction
schema_name: "city_tourism" # schema name for this table
Python Code Excluding any Executor Details¶
Since we provide the necessary details in the YAML config, we do not need to create an executor in our python code. Based on the config, a PostgresExecutor
will be created automatically in the background.
import asyncio
import os
from air import DistillerClient, login
from dotenv import load_dotenv
load_dotenv() # loads your ACCOUNT and API_KEY from a .env file
login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
)
async def analytics_test():
"""
Function for testing analytics agent
"""
# create a distiller client
distiller_client = DistillerClient()
# upload your config file to register a new distiller project
distiller_client.create_project(
config_path="analytics.yaml",
project="analytics_example", # assuming the config above is in a file named "analytics.yaml",
)
async with distiller_client(
project="analytics_example",
uuid="test_user",
) as dc:
queries = [
"Who is the mayor of the largest city in terms of area?",
]
for query in queries:
responses = await dc.query(query=query)
print(f"----\nQuery: {query}")
async for response in responses:
print(f"Response: {response['content']}")
if __name__ == "__main__":
asyncio.run(analytics_test())
Manual Executor Instantiation¶
If you prefer to not share the database credentials, you can omit the db_config
from the YAML file, and create an executor in your python file.
YAML Configuration Excluding DB Credentials¶
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent"
agent_description: "An agent that performs data analytics"
config:
contexts: # Optional field
- "date"
- "chat_history"
executor_config:
type: PostgresExecutor # Must be PostgresExecutor if you want to interact with PostgreSQL databases.
tables: # required
- name: "world_cities" # table name is required.
schema_name: "public"
columns: #optional if the user wants to specify column details (for all columns or a subset). This is not an exhaustive list
- name: "city_name"
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors" # a table containing information about city mayors
schema_name: "public"
- name: "attractions" # a table containing information about city attraction
schema_name: "city_tourism"
Python Code Including Executor Instantiation¶
Since we do not provide the db_config
in the YAML configuration, we must instantiate an executor in our python code and add it to the executor_dict
.
import asyncio
import os
from air import DistillerClient, login
from air.api import PostgresAPI
from dotenv import load_dotenv
load_dotenv() # loads your ACCOUNT and API_KEY from a .env file
login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
)
# setting up database client
analytics_db_config: dict = {
"host": "localhost",
"port": "5432",
"user": "myuser",
"password": "mypassword",
"database": "city_information",
}
analytics_db_client = PostgresAPI(analytics_db_config)
async def analytics_test():
"""
Function for testing analytics agent
"""
# create a distiller client
distiller_client = DistillerClient()
# upload your config file to register a new distiller project
distiller_client.create_project(
config_path="analytics.yaml", project="analytics_example"
)
executor_dict = {
"Analytics Agent": { # This key must match the name of the agent in the YAML config.
"PostgresExecutor": analytics_db_client.execute_query, # The key must be `PostgresExecutor` for PostgreSQL command executor. The value must be a callable that executes the command and returns the results.
}
}
async with distiller_client(
project="analytics_example",
uuid="test_user",
executor_dict=executor_dict, # passing the executor_dict to the call
) as dc:
queries = [
"Who is the mayor of the largest city in terms of area?",
]
for query in queries:
responses = await dc.query(query=query)
print(f"----\nQuery: {query}")
async for response in responses:
print(f"Response: {response['content']}")
if __name__ == "__main__":
asyncio.run(analytics_test())
Analytics Agent with Pandas¶
In this section, we configure and utilize an Analytics Agent to interact with our data (described above stored in different CSV files.
Automatic Executor Instantiation¶
To automatically instantiate the executor, specify the data source details, like CSV file paths, in the YAML file. The Python script does not need to include any executor details.
YAML Configuration Including Data Source Details¶
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent" # A name that you choose for your analytics agent
agent_description: "An agent that performs data analytics on city information" # Optional
config:
visualization: False # The agent generates a JSON to be plotted with Apache ECharts if True, textual response otherwise.
executor_config: # Required. Configuration of your executor.
type: PandasExecutor # Must be PandasExecutor if you want to interact with Pandas DataFrames.
tables: # Required. The list of tables
- name: "world_cities" # The name of your table.
file_path: "data/world_cities.csv" # The file path where the CSV file is hosted.
columns: # Optional. List of columns if you want to provide a description (for all columns or a subset)
- name: "city_name"
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors"
file_path: "data/city_mayors.csv"
- name: "attractions"
file_path: "data/attractions.csv"
output_style: "markdown" # Optional field
contexts: # Optional field
- "date"
- "chat_history" # The chat history up to a certain number of rounds
Python Code Excluding any Executor Details¶
Since we provide the necessary details in the YAML config, we do not need to create an executor in our python code. Based on the config, a PandasExecutor
will be created automatically in the background.
import asyncio
import os
from air import DistillerClient, login
from dotenv import load_dotenv
load_dotenv() # loads your ACCOUNT and API_KEY from a .env file
login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
)
async def analytics_test():
"""
Function for testing analytics agent
"""
# create a distiller client
distiller_client = DistillerClient()
# upload your config file to register a new distiller project
distiller_client.create_project(
config_path="analytics.yaml",
project="analytics_example", # assuming the config above is in a file named "analytics.yaml",
)
async with distiller_client(
project="analytics_example",
uuid="test_user",
) as dc:
queries = [
"Who is the mayor of the largest city in terms of area?",
]
for query in queries:
responses = await dc.query(query=query)
print(f"----\nQuery: {query}")
async for response in responses:
print(f"Response: {response['content']}")
if __name__ == "__main__":
asyncio.run(analytics_test())
Manual Executor Instantiation¶
If you prefer not to specify the data source details (i.e., file_path) in the YAML file, you can create and configure the executor manually in your Python code.
YAML Configuration Excluding Data Source Details¶
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent"
agent_description: "An agent that performs data analytics"
config:
contexts: # Optional field
- "date"
- "chat_history"
executor_config:
type: PandasExecutor
tables:
- name: "world_cities" # table name is required.
columns: #optional if the user wants to specify column details (for all columns or a subset)
- name: "city_name" # this is not an exhaustive list
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors"
- name: "attractions"
Python Code Including Executor Instantiation¶
import asyncio
import os
from air import DistillerClient, login
from air.api import PandasAPI
from dotenv import load_dotenv
load_dotenv() # loads your ACCOUNT and API_KEY from a .env file
auth = login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
)
# Table information for creating the PandasExecutor
tables = [
{
"name": "world_cities",
"file_path": "data/world_cities.csv",
},
{
"name": "city_mayors",
"file_path": "data/city_mayors.csv",
},
{
"name": "attractions",
"file_path": "data/attractions.csv",
},
]
analytics_df_client = PandasAPI(tables) # creating the dataframe client
async def analytics_demo():
"""Function for testing analytics agents"""
# create a distiller client
distiller_client = DistillerClient()
# creating the project
distiller_client.create_project(
config_path="analytics.yaml", project="analytics_example"
)
executor_dict = {
"Analytics Agent": {
"PandasExecutor": analytics_df_client.execute_query, # the execute_query function of the dataframe client needs to be added as the PandasExecutor
},
}
async with distiller_client(
project="analytics_example",
uuid="test_user",
executor_dict=executor_dict,
) as dc:
queries = [
"Who is the mayor of the largest city in terms of area?",
]
for query in queries:
responses = await dc.query(query=query)
print(f"----\nQuery: {query}")
async for response in responses:
print(f"Response: {response['content']}")
if __name__ == "__main__":
asyncio.run(analytics_demo())
Visualization with Analytics Agent¶
You can enable the Analytics Agent to generate Apache ECharts plot configuration (in JSON format) instead of providing a textual response. The generated JSON can be used to create plots in your HTML file.
The Analytics Agent is pre-configured with the following plot templates to pick from: bar, line, scatter, histogram, pie, and box. However, you may provide your own template which will override the built-in ones.
See an example below where we use the same Analytics Agent using pandas, but with visualization enabled.
Enabling Visualization in YAML Configuration¶
Simply set visualization: True
in your agent configuration to enable the agent to generate visualization config instead of textual responses.
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent" # A name that you choose for your analytics agent
agent_description: "An agent that performs data analytics on city information" # Optional
config:
visualization: True # Set this to True to enable visualization
executor_config: # Required. Configuration of your executor.
type: PandasExecutor # Must be PandasExecutor if you want to interact with Pandas DataFrames.
tables: # Required. The list of tables
- name: "world_cities" # The name of your table.
file_path: "data/world_cities.csv" # The file path where the CSV file is hosted.
columns: # Optional. List of columns if you want to provide a description (for all columns or a subset)
- name: "city_name"
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors"
file_path: "data/city_mayors.csv"
- name: "attractions"
file_path: "data/attractions.csv"
output_style: "markdown" # Optional field
contexts: # Optional field
- "date"
- "chat_history" # The chat history up to a certain number of rounds
Using Custom Plot Templates¶
You can customize the plot templates to be used by the Analytics Agent in the YAML configuration. This can be done by modifying the plot_templates
attribute in the configuration of the agent. The plot_templates
must contain a dictionary with plot types as the keys and valid Apache EChart plot templates as their values.
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent" # A name that you choose for your analytics agent
agent_description: "An agent that performs data analytics on city information" # Optional
config:
visualization: True # Set this to True to enable visualization
# A dictionary with plot type as the key and a valid Apache EChart plot template as value
plot_templates: |
{
"line": {
"option": {
"title": {
"text": "Replace with your chart title, e.g., 'Company Information'",
"subtext": "Replace with your chart subtitle.",
"left": "Specify the position of the title, e.g., 'center'"
},
"xAxis": {
"type": "category",
"data": "Replace with your category labels",
"name": "Replace with an appropriate name for the x axis.",
"axisLabel": {
"interval": 0,
"rotate": 30
}
},
"yAxis": {
"type": "value",
"name": "Replace with an appropriate name for the y axis."
},
"series": [
{
"data": "Replace with your data values",
"type": "line",
"markPoint": {
"data": [
{
"name": "Replace with the appropriate name",
"type": "max"
}
]
}
}
]
}
}
}
executor_config: # Required. Configuration of your executor.
type: PandasExecutor # Must be PandasExecutor if you want to interact with Pandas DataFrames.
tables: # Required. The list of tables
- name: "world_cities" # The name of your table.
file_path: "data/world_cities.csv" # The file path where the CSV file is hosted.
columns: # Optional. List of columns if you want to provide a description (for all columns or a subset)
- name: "city_name"
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors"
file_path: "data/city_mayors.csv"
- name: "attractions"
file_path: "data/attractions.csv"
output_style: "markdown" # Optional field
contexts: # Optional field
- "date"
- "chat_history" # The chat history up to a certain number of rounds
Plotting with Generated Visualization Configuration¶
After the agent generates the configuration JSON for visualization, you can integrate it into your HTML file along with the required JavaScript imports to create the visualizations.
The following example demonstrates how to generate HTML files using these JSON configurations and save them to your local file system. You can then open these files in your browser to view the plots.
import asyncio
import json
import os
from ast import literal_eval
from air import DistillerClient, login
from dotenv import load_dotenv
load_dotenv() # loads your ACCOUNT and API_KEY from a .env file
login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
)
def generate_html_with_echarts(config, output_file="newchart.html"):
"""
A function that generates HTML files containing Apache ECharts
Visualization.
"""
option = json.loads(config)["option"]
option_str = json.dumps(option)
html_content = f"""
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Apache ECharts Example</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- Load ECharts library from CDN -->
<script type="text/javascript" src="https://cdn.jsdelivr.net/npm/echarts@5.3.3/dist/echarts.min.js"></script>
</head>
<body>
<!-- Container for the chart -->
<div id="main" style="width: 600px; height: 400px;"></div>
<script type="text/javascript">
// Initialize the chart
var myChart = echarts.init(document.getElementById('main'));
// Specify the chart options
var option = {option_str};
// Use the specified options to generate the chart
myChart.setOption(option);
</script>
</body>
</html>
"""
# Write the HTML content to the output file
with open(output_file, "w", encoding="utf-8") as file:
file.write(html_content)
print(f"HTML file '{output_file}' has been generated.")
async def analytics_test():
"""
Function for testing analytics agent
"""
# create a distiller client
distiller_client = DistillerClient()
# upload your config file to register a new distiller project
distiller_client.create_project(
config_path="analytics.yaml",
project="analytics_example", # assuming the config above is in a file named "analytics.yaml",
)
visulization_output_dir = "visualization_output"
async with distiller_client(
project="analytics_example",
uuid="test_user",
) as dc:
queries = [
"I want to see a line plot of the areas of the different cities.",
]
for query_index, query in enumerate(queries):
responses = await dc.query(query=query)
print(f"----\nQuery: {query}")
async for response in responses:
print(f"Response: {response['content']}") # return final results
if response["role"] == "Analytics Agent":
visulization_dicts = literal_eval(response["content"])
for visulization_dict in visulization_dicts:
# generate plots for data visualization
try:
generate_html_with_echarts(
visulization_dict,
output_file=visulization_output_dir
+ "/chart_"
+ str(query_index)
+ ".html",
)
except Exception as e:
print(e)
continue
if __name__ == "__main__":
asyncio.run(analytics_test())