Analyzing City Information Data with Analytics Agent¶
Overview¶
The Analytics Agent is a built-in utility agent within the AI Refinery platform, designed to perform data analysis based on user queries. The agent offers seamless analysis and interaction with data from CSV files or PostgreSQL databases, utilizing natural language queries to generate and refine commands. It also enhances metadata understanding and supports remote data sources, allowing analytics and execution without needing data uploads to the AI Refinery service.
Goals¶
In this tutorial, you'll learn how to leverage the Analytics Agent to streamline data analysis and visualization of different information about cities around the world. We'll demonstrate:
- Configuring the Analytics Agent to interact with a PostgreSQL database.
- Configuring the agent to handle spreadsheets as Pandas DataFrames.
- Customizing query transformation using
query_transformation_examples
to refine how user queries are interpreted. - Enabling the agent to generate plots using built-in Apache ECharts templates.
- Overriding the built-in visualization templates with valid Apache ECharts templates of your choice.
Data Description¶
The data that the Analytics Agent with leverage in this tutorial comprises three key tables that provide comprehensive data about global cities, their leaders, and notable tourist attractions:
Table: world_cities
Column | Type | Description |
---|---|---|
id | INTEGER | A unique identifier for each city. |
city_name | VARCHAR | The name of the city. |
country | VARCHAR | The country in which the city is located. |
population | INTEGER | The population count of the city. |
area_km2 | FLOAT | The total area of the city in square kilometers. |
gdp_billion_usd | FLOAT | The GDP of the city in billion U.S. dollars. |
Table: city_mayors
Column | Type | Description |
---|---|---|
mayor_id | INTEGER | A unique identifier for each mayor. |
city_id | INTEGER | A reference to the city the mayor governs. |
mayor_name | VARCHAR | The full name of the mayor. |
term_start | DATE | The start date of the mayor's current term. |
term_end | DATE | The end date of the mayor's current term. |
Table: attractions
Column | Type | Description |
---|---|---|
id | INTEGER | A unique identifier for each attraction. |
name | VARCHAR | The name of the tourist attraction. |
city_id | INTEGER | A reference to the city where the attraction is located. |
description | TEXT | A detailed description of the attraction. |
opening_hours | VARCHAR | The hours during which the attraction is open to visitors. |
ticket_price | FLOAT | The cost of admission to the attraction. |
Common Steps¶
To effectively use the Analytics Agent with both PostgreSQL and pandas, follow these common steps:
-
Configure the Agent: Provide the necessary configuration in a YAML file. This includes details such as database tables, CSV file names, column descriptions etc.
-
Create an AI Refinery Project: Use a Python script to create an AI Refinery project that will utilize the configured agent.
-
Configure the Executor: As outlined in the workflow below, all SQL/pandas commands generated by the agent are sent back to the SDK backend to be executed. An executor with access to the data sources must be configured and instantiated on the SDK backend. This can be done in two ways:
Workflow of Analytics Agent - Automatic Instantiation: Provide the necessary details (e.g., database credentials) in the YAML config.
- Manual Instantiation: Manually instantiate the executor in your Python script and add it to the
executor_dict
.
Examples of both executor configuration approaches are shown in the following sections.
Analytics Agent with PostgreSQL¶
In this section, we configure and utilize an Analytics Agent to interact with a PostgreSQL database that contains the data above in different tables.
Automatic Executor Instantiation¶
In order to automatically instantiate the executor, we have to provide the database credentials in the YAML file. The python script does not need include any executor details.
YAML Configuration Including DB Credentials¶
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent"
agent_description: "An agent that performs data analytics"
config:
contexts: # Optional field
- "date"
- "chat_history"
executor_config:
type: PostgresExecutor # Must be PostgresExecutor if you want to interact with PostgreSQL databases.
db_config: # Required if you do not manually create the PostgresExecutor and add it to your executor_dict
"host": "localhost"
"port": "5432"
"user": "myuser"
"password": "mypassword"
"database": "city_information"
tables: # required
- name: "world_cities" # table name is required.
desc: "Global city data with demographics and economic statistics" # Optional table description. The description is generated by the agents if you do not provide it.
schema_name: "public" # schema name for this table
columns: #optional if the user wants to specify column details (for all columns or a subset). This is not an exhaustive list
- name: "city_name"
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors" # a table containing information about city mayors
schema_name: "public" # schema name for this table
- name: "attractions" # a table containing information about city attraction
schema_name: "city_tourism" # schema name for this table
Python Code Excluding any Executor Details¶
Since we provide the necessary details in the YAML config, we do not need to create an executor in our python code. Based on the config, a PostgresExecutor
will be created automatically in the background.
import asyncio
import os
from air import DistillerClient
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))
async def analytics_test():
"""
Function for testing analytics agent
"""
# create a distiller client
distiller_client = DistillerClient(api_key=api_key)
# upload your config file to register a new distiller project
distiller_client.create_project(
config_path="analytics.yaml",
project="analytics_example", # assuming the config above is in a file named "analytics.yaml",
)
async with distiller_client( # Open an async distiller client session
project="analytics_example", # Project name that was created above
uuid="test_user", # Unique user identifier
) as dc: # Alias the client as "dc"
queries = [ # Define list of natural language queries
"Who is the mayor of the largest city in terms of area?", # Query 1
]
for query in queries: # Loop through each query
responses = await dc.query(query=query) # Send query and await responses
print(f"----\nQuery: {query}")
async for response in responses:
print(f"Response: {response['content']}") # Print the content of each response
if __name__ == "__main__":
asyncio.run(analytics_test())
Manual Executor Instantiation¶
If you prefer to not share the database credentials, you can omit the db_config
from the YAML file, and create an executor in your python file.
YAML Configuration Excluding DB Credentials¶
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent"
agent_description: "An agent that performs data analytics"
config:
contexts: # Optional field
- "date"
- "chat_history"
executor_config:
type: PostgresExecutor # Must be PostgresExecutor if you want to interact with PostgreSQL databases.
tables: # required
- name: "world_cities" # table name is required.
desc: "Global city data with demographics and economic statistics" # Optional table description. The description is generated by the agents if you do not provide it.
schema_name: "public"
columns: #optional if the user wants to specify column details (for all columns or a subset). This is not an exhaustive list
- name: "city_name"
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors" # a table containing information about city mayors
schema_name: "public"
- name: "attractions" # a table containing information about city attraction
schema_name: "city_tourism"
Python Code Including Executor Instantiation¶
Since we do not provide the db_config
in the YAML configuration, we must instantiate an executor in our python code and add it to the executor_dict
.
import asyncio
import os
from air import DistillerClient
from air.api import PostgresAPI
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))
# setting up database client
analytics_db_config: dict = {
"host": "localhost",
"port": "5432",
"user": "myuser",
"password": "mypassword",
"database": "city_information",
}
analytics_db_client = PostgresAPI(analytics_db_config)
async def analytics_test():
"""
Function for testing analytics agent
"""
# create a distiller client
distiller_client = DistillerClient(api_key=api_key)
# upload your config file to register a new distiller project
distiller_client.create_project(
config_path="analytics.yaml", project="analytics_example"
)
executor_dict = {
"Analytics Agent": { # This key must match the name of the agent in the YAML config.
"PostgresExecutor": analytics_db_client.execute_query, # The key must be `PostgresExecutor` for PostgreSQL command executor. The value must be a callable that executes the command and returns the results.
}
}
async with distiller_client( # Open an async distiller client session
project="analytics_example", # Project name that was created above
uuid="test_user", # Unique user identifier
executor_dict=executor_dict, # Pass the executor_dict to the call
) as dc: # Alias the client as "dc"
queries = [ # Define list of natural language queries
"Who is the mayor of the largest city in terms of area?", # Query 1
]
for query in queries: # Loop through each query
responses = await dc.query(query=query) # Send query and await responses
print(f"----\nQuery: {query}") # Print separator and query text
async for response in responses: # Stream responses asynchronously
print(f"Response: {response['content']}") # Print the content of each response
if __name__ == "__main__":
asyncio.run(analytics_test())
Analytics Agent with Pandas¶
In this section, we configure and utilize an Analytics Agent to interact with our data (described above stored in different CSV files.
Automatic Executor Instantiation¶
To automatically instantiate the executor, specify the data source details, like CSV file paths, in the YAML file. The Python script does not need to include any executor details.
YAML Configuration Including Data Source Details¶
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent" # A name that you choose for your analytics agent
agent_description: "An agent that performs data analytics on city information" # Optional
config:
visualization: False # The agent generates a JSON to be plotted with Apache ECharts if True, textual response otherwise.
executor_config: # Required. Configuration of your executor.
type: PandasExecutor # Must be PandasExecutor if you want to interact with Pandas DataFrames.
tables: # Required. The list of tables
- name: "world_cities" # The name of your table.
desc: "Global city data with demographics and economic statistics" # Optional table description. The description is generated by the agents if you do not provide it.
file_path: "data/world_cities.csv" # The file path where the CSV file is hosted.
columns: # Optional. List of columns if you want to provide a description (for all columns or a subset)
- name: "city_name"
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors"
file_path: "data/city_mayors.csv"
- name: "attractions"
file_path: "data/attractions.csv"
output_style: "markdown" # Optional field
contexts: # Optional field
- "date"
- "chat_history" # The chat history up to a certain number of rounds
Python Code Excluding any Executor Details¶
Since we provide the necessary details in the YAML config, we do not need to create an executor in our python code. Based on the config, a PandasExecutor
will be created automatically in the background.
import asyncio
import os
from air import DistillerClient
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))
async def analytics_test():
"""
Function for testing analytics agent
"""
# create a distiller client
distiller_client = DistillerClient(api_key=api_key)
# upload your config file to register a new distiller project
distiller_client.create_project(
config_path="analytics.yaml",
project="analytics_example", # assuming the config above is in a file named "analytics.yaml",
)
async with distiller_client( # Open an async distiller client session
project="analytics_example", # Project name that was created above
uuid="test_user", # Unique user identifier
) as dc: # Alias the client as "dc"
queries = [ # Define list of natural language queries
"Who is the mayor of the largest city in terms of area?", # Query 1
]
for query in queries: # Loop through each query
responses = await dc.query(query=query) # Send query and await responses
print(f"----\nQuery: {query}") # Print separator and query text
async for response in responses: # Stream responses asynchronously
print(f"Response: {response['content']}") # Print the content of each response
if __name__ == "__main__":
asyncio.run(analytics_test())
Manual Executor Instantiation¶
If you prefer not to specify the data source details (i.e., file_path) in the YAML file, you can create and configure the executor manually in your Python code.
YAML Configuration Excluding Data Source Details¶
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent"
agent_description: "An agent that performs data analytics"
config:
contexts: # Optional field
- "date"
- "chat_history"
executor_config:
type: PandasExecutor
tables:
- name: "world_cities" # table name is required.
desc: "Global city data with demographics and economic statistics" # Optional table description. The description is generated by the agents if you do not provide it.
columns: #optional if the user wants to specify column details (for all columns or a subset)
- name: "city_name" # this is not an exhaustive list
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors"
- name: "attractions"
Python Code Including Executor Instantiation¶
import asyncio
import os
from air import DistillerClient
from air.api import PandasAPI
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))
# Table information for creating the PandasExecutor
tables = [
{
"name": "world_cities",
"file_path": "data/world_cities.csv",
},
{
"name": "city_mayors",
"file_path": "data/city_mayors.csv",
},
{
"name": "attractions",
"file_path": "data/attractions.csv",
},
]
analytics_df_client = PandasAPI(tables) # creating the dataframe client
async def analytics_demo():
"""Function for testing analytics agents"""
# create a distiller client
distiller_client = DistillerClient(api_key=api_key)
# creating the project
distiller_client.create_project(
config_path="analytics.yaml", project="analytics_example"
)
executor_dict = {
"Analytics Agent": {
"PandasExecutor": analytics_df_client.execute_query, # the execute_query function of the dataframe client needs to be added as the PandasExecutor
},
}
async with distiller_client( # Open an async distiller client session
project="analytics_example", # Project name that was created above
uuid="test_user", # Unique user identifier
executor_dict=executor_dict, # Pass the executor_dict to the call
) as dc: # Alias the client as "dc"
queries = [ # Define list of natural language queries
"Who is the mayor of the largest city in terms of area?", # Query 1
]
for query in queries: # Loop through each query
responses = await dc.query(query=query) # Send query and await responses
print(f"----\nQuery: {query}") # Print separator and query text
async for response in responses: # Stream responses asynchronously
print(f"Response: {response['content']}") # Print the content of each response
if __name__ == "__main__":
asyncio.run(analytics_demo())
Customizing Query Transformation¶
As described in the workflow overview, the Analytics Agent supports query transformation. By default, it uses built-in exemplars, but you can customize how the agent rewrites user queries by providing query_transformation_examples
in the executor configuration.
This allows the agent to generate one or more refined queries based on the user’s input and the database or dataframe schema, giving you flexibility to control how different types of queries are processed.
YAML Configuration Including Query Transformation¶
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent"
agent_description: "An agent that performs data analytics with query transformation"
config:
contexts: # Optional field
- "date"
- "chat_history"
executor_config:
type: PostgresExecutor # Must be PostgresExecutor if you want to interact with PostgreSQL databases.
db_config: # Required if you do not manually create the PostgresExecutor and add it to your executor_dict
host: "localhost"
port: "5432"
user: "myuser"
password: "mypassword"
database: "city_information"
tables: # required
- name: "world_cities" # table name is required.
desc: "Global city data with demographics and economic statistics" # Optional table description. The description is generated by the agents if you do not provide it.
schema_name: "public" # schema name for this table
columns: # optional if the user wants to specify column details (for all columns or a subset). This is not an exhaustive list
- name: "city_name"
desc: "Name of the city"
- name: "population"
desc: "Population of the city"
- name: "area_km2"
- name: "city_mayors" # a table containing information about city mayors
schema_name: "public" # schema name for this table
- name: "attractions" # a table containing information about city attraction
schema_name: "city_tourism" # schema name for this table
query_transformation_examples: # Optional. Exemplars for how to transform the original user query to refined queries, more suitable for generating the pandas/postgres commands. Built-in exemplars are used if you do not provide your own exemplars.
- user_query: "What is the largest population among cities?"
query:
- "Return the city name and its population for the city with the largest population."
- user_query: "Who is the mayor of the largest city?"
query:
- "Find the mayor and the city name for the city with the largest population."
- "Provide the name of the mayor along with the city that has the highest population."
Python Code to Run Queries¶
import asyncio
import os
from air import DistillerClient
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from '.env'
api_key = str(os.getenv("API_KEY"))
async def analytics_with_transformation():
"""
Example showing the use of query transformation
"""
distiller_client = DistillerClient(api_key=api_key)
# Create or register the project using YAML config
distiller_client.create_project(
config_path="analytics_transformation.yaml",
project="analytics_example_transformed",
)
async with distiller_client( # Open an async distiller client session
project="analytics_example_transformed", # Project name that was created above
uuid="test_user", # Unique user identifier
) as dc: # Alias the client as "dc"
queries = [ # Define list of natural language queries
"What is the largest population among cities?", # Query 1
"Who is the mayor of the largest city?" # Query 2
]
for query in queries: # Loop through each query
print(f"----\nQuery: {query}") # Print separator and query text
responses = await dc.query(query=query) # Send query and await responses
async for response in responses: # Stream responses asynchronously
print(f"Response: {response['content']}") # Print the content of each response
if __name__ == "__main__":
asyncio.run(analytics_with_transformation())
Visualization with Analytics Agent¶
You can enable the Analytics Agent to generate Apache ECharts plot configuration (in JSON format) instead of providing a textual response. The generated JSON can be used to create plots in your HTML file.
The Analytics Agent is pre-configured with the following plot templates to pick from: bar, line, scatter, histogram, pie, and box. However, you may provide your own template which will override the built-in ones.
See an example below where we use the same Analytics Agent using pandas, but with visualization enabled.
Enabling Visualization in YAML Configuration¶
Simply set visualization: True
in your agent configuration to enable the agent to generate visualization config instead of textual responses.
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent" # A name that you choose for your analytics agent
agent_description: "An agent that performs data analytics on city information" # Optional
config:
visualization: True # Set this to True to enable visualization
executor_config: # Required. Configuration of your executor.
type: PandasExecutor # Must be PandasExecutor if you want to interact with Pandas DataFrames.
tables: # Required. The list of tables
- name: "world_cities" # The name of your table.
desc: "Global city data with demographics and economic statistics" # Optional table description. The description is generated by the agents if you do not provide it.
file_path: "data/world_cities.csv" # The file path where the CSV file is hosted.
columns: # Optional. List of columns if you want to provide a description (for all columns or a subset)
- name: "city_name"
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors"
file_path: "data/city_mayors.csv"
- name: "attractions"
file_path: "data/attractions.csv"
output_style: "markdown" # Optional field
contexts: # Optional field
- "date"
- "chat_history" # The chat history up to a certain number of rounds
Using Custom Plot Templates¶
You can customize the plot templates to be used by the Analytics Agent in the YAML configuration. This can be done by modifying the plot_templates
attribute in the configuration of the agent. The plot_templates
must contain a dictionary with plot types as the keys and valid Apache EChart plot templates as their values.
orchestrator:
agent_list:
- agent_name: "Analytics Agent"
utility_agents:
- agent_class: AnalyticsAgent
agent_name: "Analytics Agent" # A name that you choose for your analytics agent
agent_description: "An agent that performs data analytics on city information" # Optional
config:
visualization: True # Set this to True to enable visualization
# A dictionary with plot type as the key and a valid Apache EChart plot template as value
plot_templates: |
{
"line": {
"option": {
"title": {
"text": "Replace with your chart title, e.g., 'Company Information'",
"subtext": "Replace with your chart subtitle.",
"left": "Specify the position of the title, e.g., 'center'"
},
"xAxis": {
"type": "category",
"data": "Replace with your category labels",
"name": "Replace with an appropriate name for the x axis.",
"axisLabel": {
"interval": 0,
"rotate": 30
}
},
"yAxis": {
"type": "value",
"name": "Replace with an appropriate name for the y axis."
},
"series": [
{
"data": "Replace with your data values",
"type": "line",
"markPoint": {
"data": [
{
"name": "Replace with the appropriate name",
"type": "max"
}
]
}
}
]
}
}
}
executor_config: # Required. Configuration of your executor.
type: PandasExecutor # Must be PandasExecutor if you want to interact with Pandas DataFrames.
tables: # Required. The list of tables
- name: "world_cities" # The name of your table.
desc: "Global city data with demographics and economic statistics" # Optional table description. The description is generated by the agents if you do not provide it.
file_path: "data/world_cities.csv" # The file path where the CSV file is hosted.
columns: # Optional. List of columns if you want to provide a description (for all columns or a subset)
- name: "city_name"
desc: "Name of the city"
- name: "area_km2"
- name: "city_mayors"
file_path: "data/city_mayors.csv"
- name: "attractions"
file_path: "data/attractions.csv"
output_style: "markdown" # Optional field
contexts: # Optional field
- "date"
- "chat_history" # The chat history up to a certain number of rounds
Plotting with Generated Visualization Configuration¶
After the agent generates the configuration JSON for visualization, you can integrate it into your HTML file along with the required JavaScript imports to create the visualizations.
The following example demonstrates how to generate HTML files using these JSON configurations and save them to your local file system. You can then open these files in your browser to view the plots.
import asyncio
import json
import os
from ast import literal_eval
from air import DistillerClient
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))
def generate_html_with_echarts(config, output_file="newchart.html"):
"""
A function that generates HTML files containing Apache ECharts
Visualization.
"""
option = json.loads(config)["option"]
option_str = json.dumps(option)
html_content = f"""
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Apache ECharts Example</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- Load ECharts library from CDN -->
<script type="text/javascript" src="https://cdn.jsdelivr.net/npm/echarts@5.3.3/dist/echarts.min.js"></script>
</head>
<body>
<!-- Container for the chart -->
<div id="main" style="width: 600px; height: 400px;"></div>
<script type="text/javascript">
// Initialize the chart
var myChart = echarts.init(document.getElementById('main'));
// Specify the chart options
var option = {option_str};
// Use the specified options to generate the chart
myChart.setOption(option);
</script>
</body>
</html>
"""
# Write the HTML content to the output file
with open(output_file, "w", encoding="utf-8") as file:
file.write(html_content)
print(f"HTML file '{output_file}' has been generated.")
async def analytics_test():
"""
Function for testing analytics agent
"""
# create a distiller client
distiller_client = DistillerClient(api_key=api_key)
# upload your config file to register a new distiller project
distiller_client.create_project(
config_path="analytics.yaml",
project="analytics_example", # assuming the config above is in a file named "analytics.yaml",
)
visulization_output_dir = "visualization_output"
async with distiller_client( # Open an async distiller client session
project="analytics_example", # Project name that was created above
uuid="test_user", # Unique user identifier
) as dc: # Alias the client as "dc"
queries = [ # Define list of natural language queries
"I want to see a line plot of the areas of the different cities.", # Query 1
]
for query_index, query in enumerate(queries): # Loop through each query with index
responses = await dc.query(query=query) # Send query and await responses
print(f"----\nQuery: {query}") # Print separator and query text
async for response in responses: # Stream responses asynchronously
print(f"Response: {response['content']}") # Return final results
if response["role"] == "Analytics Agent": # Check if response is from Analytics Agent
visulization_dicts = literal_eval(response["content"]) # Convert string to Python dict/list
for visulization_dict in visulization_dicts: # Loop through each visualization dict
try: # Attempt to generate plots
generate_html_with_echarts(
visulization_dict, # Visualization configuration
output_file=visulization_output_dir # Output directory
+ "/chart_" # File prefix
+ str(query_index) # Include query index in filename
+ ".html", # File extension
)
except Exception as e: # Catch any errors during plot generation
print(e) # Print error message
continue # Continue to next visualization dict
if __name__ == "__main__":
asyncio.run(analytics_test())