Skip to content

Explore the Capabilities of the Image Understanding Agent

Overview

The Image Understanding Agent is a utility agent designed to fulfill user requests by interpreting the contents of provided images. It can perform tasks such as natural image description, chart reading, Optical Character Recognition (OCR), and more. This extends the scope of agentic frameworks beyond text-based applications.

Goals

The goals of this tutorial are to demonstrate some of the agent's capabilities and illustrate how different agents interact to solve user queries within a user-defined agentic framework. By the end, you will know how to configure your own agentic framework, consisting of custom and default agentsโ€”including the Image Understanding Agentโ€”to solve simplified tasks involving images.

Configuration

To utilize the Image Understanding Agent, you need to define its configuration in the YAML file example.yaml. This configuration specifies the agent's settings, with Llama 3.2-90B-Vision-Instruct as the default. In this tutorial, we will use four different agents and demonstrate how the Image Understanding Agent functions and interacts with different agents to handle user queries. Descriptions of the agents can be found in the YAML configuration below:

orchestrator:
  agent_list:
    - agent_name: "Search Agent"
    - agent_name: "Image Understanding Agent"
    - agent_name: "Story Teller Agent"
    - agent_name: "Markdown Agent"

utility_agents:
  - agent_class: UtilityAgent
    agent_name: "Story Teller Agent" # This agent will create stories as requested by the user
    agent_description: "This is capable of writing a story"
    config:
      magic_prompt: "You are a master of enchanting stories for children. Your story must begin with the timeless phrase, 'Once upon a time...'\nUser query:\n{query}"
      contexts:  # Optional field
        - "date" # This will add a date stamp to the agent's output, which can be leveraged later.
        - "chat_history" # This enables the agent to utilize the previous chat history to fulfill the user's query

  - agent_class: SearchAgent
    agent_name: "Search Agent" # This agent will fulfill the user's query by web search
    config:
      contexts: # Optional field
        - "date"
        - "chat_history"

  - agent_class: ImageUnderstandingAgent
    agent_name: Image Understanding Agent # This agent can perform queries related to image
    agent_description: This agent can help you understand and analyze an image.
    config: {}

  - agent_class: ImageUnderstandingAgent
    agent_name: Markdown Agent # This is a sub-agent that specializes in converting tables to Markdown
    agent_description: This agent can convert a table in an image into Markdown format.
    config:
      output_style: "markdown"

Note that the Story Teller Agent and Search Agent take chat_history because they will use the previous conversation history to perform the requested tasks in the examples below. The Markdown Agent inherits from the default agent, ImageUnderstandingAgent, because it requires image understanding to extract a table from an image. We set the output_style to markdown.

We also define the magic_prompt for the Story Teller Agent. The magic_prompt can be used for various purposes such as providing instructions.

Python Files

The code snippet below queries the framework with the question "What's in the image?" for the image located at the following URL: https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-logo-v2@2x.png. You can apply this code snippet to any (query, image) pair from the example use cases provided in the next subsection.

import os
import asyncio
import random
import string

from openai import AsyncOpenAI

from air import login, DistillerClient
from air import utils


auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
)

async def image_understanding():
    # create a distiller client
    distiller_client = DistillerClient()

    # upload your config file to register a new distiller project
    distiller_client.create_project(config_path="example.yaml", project="example")

    async with distiller_client(
        project="example",
        uuid="test_user",
    ) as dc:
        responses = await dc.query(
            query="what's in the image?",
            image=utils.image_to_base64(
                "https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-logo-v2@2x.png" # This argument can be either a url or a local file path.
            ),
        )
        async for response in responses:
            print(response["content"])  # Extract and print the 'content' field


if __name__ == "__main__":
    print("Image Understanding")
    asyncio.run(image_understanding())

Replace the arguments to query and image with your own query and image. The image argument can take both a url and a local file path.

Example Usage

Below are example use cases for the Image Understanding Agent:

  • Chart reading
  • General image reading
  • Synthetic image reading
  • Table-to-text (OCR) conversion
  • Table-to-Markdown conversion

Single Agent Call

The following examples use a single call to the Image Understanding Agent because the queries are simple enough to be answered by this agent alone. The primary purpose of the single agent call is to demonstrate some of the agent's capabilities, which can later be combined with other agents in interactions.

Chart Reading

This example demonstrates the agent's ability to read and comprehend the chart provided in the image.

Query:

What's in this image?

chart image

Output:

Here, we paste the model generated output to the user query. The agent was able to extract the important information from the chart.

The image presents a bar graph illustrating the population of selected European countries in 2013 and 2021. The graph features two sets of bars for each country, with yellow representing the population in 2013 and green representing the population in 2021.

Countries and Populations:

Germany: 80.6 million (2013), 83.2 million (2021)
Spain: 46.6 million (2013), 47.4 million (2021)
France: 65.8 million (2013), 67.7 million (2021)
Croatia: 4.3 million (2013), 4 million (2021)
Italy: 60.2 million (2013), 59.1 million (2021)
Poland: 38 million (2013), 37.7 million (2021)
Portugal: 10.5 million (2013), 10.3 million (2021)
Observations:

The population of Germany increased by 2.6 million from 2013 to 2021.
The population of Spain increased by 0.8 million from 2013 to 2021.
The population of France increased by 1.9 million from 2013 to 2021.
The population of Croatia decreased by 0.3 million from 2013 to 2021.
The population of Italy decreased by 1.1 million from 2013 to 2021.
The population of Poland decreased by 0.3 million from 2013 to 2021.
The population of Portugal decreased by 0.2 million from 2013 to 2021.
Conclusion:

The graph provides a clear visual representation of the population changes in these European countries between 2013 and 2021. It allows for easy comparison of the population trends across different countries, making it a useful tool for understanding demographic shifts in Europe during this period.

General Image Reading

This example demonstrates the agent's ability to understand the image in the context of a recent event.

Query:

Describe the image

general image reading

Output: Response from the Image Understanding Agent,

The image features the Olympic rings logo prominently displayed in front of the Eiffel Tower, with a crowd of people gathered around it. The Olympic rings are depicted in their signature colors: blue, yellow, black, green, and red. In the background, the iconic Eiffel Tower stands tall against a blue sky dotted with white clouds. The overall atmosphere suggests that the image was taken during the 2024 Summer Olympics in Paris, France.

The output shows that the agent not only reads the contents of the image but recognizes the recent event, the Paris Olympic 2024.

Synthetic Image Reading

The Image Understanding Agent can also interpret synthetic images generated by image generation models, such as Flux.1-schnell. The image below was generated using Stable Diffusion 1.5.

We provide the image with the prompt:

Query:

Describe the image

synthetic image reading

Output:

The image shows a vintage light blue car parked in front of an old stone building. The car has a distinctive design, featuring a long hood and a short rear deck, with a spare tire mounted on the trunk lid. A red dog statue is prominently displayed on the car's hood, adding a playful touch to the scene.

  **Key Features:**

  * **Car:** The car is the main focus of the image, with its light blue color and classic design making it stand out.
  * **Dog Statue:** The red dog statue on the hood of the car adds a whimsical element to the scene.
  * **Stone Building:** The old stone building in the background provides a sense of history and context to the image.
  * **Cobblestone Street:** The cobblestone street in front of the car adds texture and depth to the scene.

  **Overall Impression:**

  The image evokes a sense of nostalgia and charm, with the vintage car and old stone building creating a timeless atmosphere. The red dog statue adds a playful touch, making the image feel more lighthearted and fun.

The Image Understanding Agent accurately describes the image with detailed observations, demonstrating its potential for use with image generation models. The agent captures not only the physical elements of the scene but also the emotional essence, emphasizing the nostalgia and charm.

Table to Text (OCR)

The Image Understanding Agent exhibits strong performance in recognizing text within an image. While the agent is capable of performing complex tasks beyond text recognition, in this example, we ask it to extract the text from an image and summarize it.

Query:

Describe the image and summary the text in each row

table to text

Output:

The image presents a table with four columns and six rows, each representing a distinct category for evaluating discussion rubrics. The categories are: Criteria, Exemplary Performance, Satisfactory Performance, and Needs Improvement.

**Criteria:**

*   Participation
*   Relevance
*   Insight
*   Support

**Exemplary Performance:**

*   Makes an original post and replies to at least two other classmates in the discussion.
*   The posting directly addresses key issues, questions, or problems related to the text and the discussion activity. The posting applies course concepts well.
*   The posting offers original or thoughtful insight, analysis, or observation that demonstrates a strong grasp of concepts and ideas pertaining to the discussion topic.
*   The posting supports all claims and opinions with either rational argument or evidence.

**Satisfactory Performance:**

*   Makes an original post and replies to one other classmate in the discussion.
*   The posting addresses key issues, questions, or problems related to the text and the discussion activity, but in some cases, only indirectly. It does not always apply course concepts fully.
*   The posting does offer some insight, analysis, or observation to the topic but may not demonstrate a full understanding or knowledge of concepts and ideas pertaining to the discussion topic.
*   The posting generally supports claims and opinions with evidence or argument but may leave some gaps where unsupported opinions still appear.

**Needs Improvement:**

*   Makes an original post but doesn't reply to others within the discussion.
*   The posting does not directly address the question or problem posed by the discussion activity.
*   The posting does not offer any significant insight, analysis, or observation related to the topic. No knowledge or understanding is demonstrated regarding concepts and ideas pertaining to the discussion topic.
*   The posting does not support its claims with either evidence or argument. The posting contains largely unsupported opinion.

This table provides a comprehensive framework for evaluating discussion rubrics across various criteria, offering a clear structure for assessing performance levels.

Table to Markdown

The Image Understanding Agent can also convert tables or text from an image into plain text in Markdown format, which is convenient for tasks such as creating web or documentation content, organizing notes, sharing text with formatting, and ensuring portability across platforms. While the default Image Understanding Agent can perform this conversion task, we have created a specialized agent called the Markdown Agent, which inherits from the Image Understanding Agent.

Let us use the same table image from the previous example. The prompt and image provided to the agent are as follows:

Query:

Convert the table shown in the image into markdown format

table to text

Output: Response from the Markdown Agent,

# Discussion Rubric

| Criteria | Exemplary Performance | Satisfactory Performance | Needs Improvement |
| --- | --- | --- | --- |
| **Participation** | Makes an original post and replies to at least two other classmates in the discussion. | Makes an original post and replies to one other classmate in the discussion. 
| Makes an original post but doesn't reply to others within the discussion. |
| **Relevance** | The posting directly addresses key issues, questions, or problems related to the text and the discussion activity. The posting applies course concepts well. | The posting addresses key issues, questions, or problems related to the text and the discussion activity, but in some cases, only indirectly. It does not always apply course concepts fully. | The posting does not directly address the question or problem posed by the discussion activity. |
| **Insight** | The posting offers original or thoughtful insight, analysis, or observation that demonstrates a strong grasp of concepts and ideas pertaining to the discussion topic. | The posting does offer some insight, analysis, or observation to the topic but may not demonstrate a full understanding or knowledge of concepts and ideas pertaining to the discussion topic. | The posting does not offer any significant insight, analysis, or observation related to the topic. No knowledge or understanding is demonstrated regarding concepts and ideas pertaining to the discussion topic. |
| **Support** | The posting supports all claims and opinions with either rational argument or evidence. | The posting generally supports claims and opinions with evidence or argument but may leave some gaps where unsupported opinions still appear. | The posting does not support its claims with either evidence or argument. The posting contains largely unsupported opinion. |

Multi-Agent Call

The following examples leverage multiple agents, including the Image Understanding Agent, to solve the user queries. The orchestrator knows that the query can be better answered using two agents, so it decomposes the query and asks each agent in sequence.

Example 1: The History Behind the Provided Image

For the following example, the orchestrator will ask the Image Understanding Agent to describe the image and the Search Agent to find the history from the web.

Query:

Describe the image and explain the history of the provided image
general image reading

Output:

Response from the Image Understanding Agent,

The image showcases the Olympic rings, a symbol synonymous with the Olympic Games, set against the backdrop of the iconic Eiffel Tower in Paris, France. The Olympic rings are five interconnected rings of equal size, colored blue, yellow, black, green, and red, respectively, positioned on a white background. This arrangement represents the five continents of the world united by Olympism.

In the background, the Eiffel Tower stands tall, its iron latticework gleaming in the sunlight. The tower's four main pillars support the central column, which rises to a height of 324 meters. The image captures the tower's intricate details, including the staircases, elevators, and observation decks.

The sky above is a brilliant blue, dotted with white, fluffy clouds that add a touch of serenity to the scene. The overall atmosphere of the image is one of excitement and anticipation, as if the viewer is about to embark on an adventure or witness a momentous event.

The image appears to be a photograph taken during the day, with the sun shining brightly overhead. The lighting is natural, with no artificial illumination visible. The overall effect is one of warmth and energy, capturing the essence of the Olympic spirit and the beauty of the Eiffel Tower.

Response from the Search Agent,

History of the Provided Image
The image provided does not have a specific history, as it was not explicitly mentioned in the query. However, based on the context and the information available, it can be inferred that the image is related to the Olympic Games and the Eiffel Tower in Paris, France.

The Olympic rings, a symbol of the Olympic Games, have a rich history dating back to 1912 when they were designed by Pierre de Coubertin, the founder of the International Olympic Committee (IOC). The rings represent the five continents of the world united by Olympism, and they have been an integral part of the Olympic Games ever since.

The Eiffel Tower, on the other hand, has a history dating back to 1889 when it was built for the World's Fair in Paris, France. The tower was designed by Gustave Eiffel and his engineering company, and it was initially intended to be a temporary structure. However, it became an instant icon of Paris and a symbol of French culture and engineering ingenuity.

The image of the Olympic rings against the backdrop of the Eiffel Tower is likely a representation of the Olympic Games held in Paris, France. The most recent Olympic Games held in Paris were the 2024 Summer Olympics, which took place from July 26 to August 11, 2024.

References
[1] https://www.olympic.org/ [2] https://www.toureiffel.paris/en [3] https://en.wikipedia.org/wiki/2024_Summer_Olympics

Since we enabled chat_history for the Search Agent, the agent is aware of the previous output generated by the Image Understanding Agent, searches the relevant contents from the web, and provides an answer.

Example 2: A Story Inspired by the Provided Image

Letโ€™s create an enchanting story inspired by the โ€˜nostalgicโ€™ image we used before. For the following query, the orchestrator will decompose the task into two subtasks: first, it will call the Image Understanding Agent, followed by the Story Teller Agent.

Query:

Describe the image and create an interesting story based on the description

synthetic image reading

Output: Response from the Image Understanding Agent,

The image depicts a vintage car parked in front of a building. The car is light blue with a black stripe running along the side, and it has a distinctive curved shape. The building behind the car appears to be made of stone or brick, with three arched windows visible. A red object, possibly a shoe, is hanging on the wall to the right of the windows. The overall atmosphere of the image suggests a nostalgic or retro feel, with the vintage car and old-fashioned building evoking a sense of the past. The image may be intended to evoke a sense of nostalgia or to highlight the beauty of older vehicles and architecture.

Output: Response from the Story Teller Agent,

Once upon a time, in a world not so far away, there was a magical little town called Willowdale. It was a place where time stood still, and the air was sweet with the scent of nostalgia. In the heart of Willowdale, on a charming street lined with ancient trees, stood a beautiful stone building with three arched windows that sparkled like diamonds in the sunlight. And parked right in front of this building was a stunning light blue vintage car with a sleek black stripe running along its side.

This was no ordinary car, for it had a mind of its own and a heart full of adventure. Its name was Luna, and she had been a faithful companion to the town's beloved inventor, Professor Pocketwatch. Together, they had traveled far and wide, exploring the world and collecting curious treasures.

One day, while the Professor was busy tinkering in his workshop, a mischievous little creature snuck into the building and hid a bright red shoe on the wall, just to the right of the windows. The shoe belonged to a tiny fairy named Rosie, who loved to play tricks on the townspeople. But as she flew away, she left behind a trail of sparkles that caught Luna's attention.

Luna's engine purred to life, and she began to wonder about the mysterious shoe. She decided to embark on a quest to find Rosie and learn the secret behind the shoe. As she drove through the winding streets of Willowdale, the townspeople waved and smiled, for they knew that Luna was on a mission.

With a honk of her horn and a flash of her headlights, Luna set off into the sunset, ready to uncover the magic that lay hidden in the heart of the town. And as she disappeared into the distance, the people of Willowdale whispered to each other, "Luna's on an adventure! Who knows what wonders she'll discover?"

The story starts with the timeless phrase, 'Once upon a time', as requested in the magic_prompt in YAML config. The story does envoke a nostalgic feeling.

Remark

In this tutorial, we explored some capabilities of the Image Understanding Agent and experimented with its interaction alongside other agents. You can create as many interesting and specialized agents as you like. Depending on the complexity of your request, the orchestrator agent automatically decomposes the task, assigns the subtasks to the corresponding agents. The agents will work together to fulfill your request.