Explore the Capabilities of the Image Understanding Agent¶

Overview¶

The Image Understanding Agent is a utility agent designed to fulfill user requests by interpreting the contents of provided images. It can perform tasks such as natural image description, chart reading, Optical Character Recognition (OCR), and more. This extends the scope of agentic frameworks beyond text-based applications.

Configuration¶

To utilize the Image Understanding Agent, you need to define its configuration in a YAML file. This configuration specifies the agent's settings, with Llama 3.2-90B-Vision-Instruct as the default. For simplicity, create a framework that includes only the Image Understanding Agent.

orchestrator:
  agent_list:
    - agent_name: "Image Understanding Agent"

utility_agents:
  - agent_class: ImageUnderstandingAgent
    agent_name: Image Understanding Agent
    agent_description: This agent can help you understand and analyze an image.
    config: {}

Python File¶

Request the framework to describe what's in the image using the following URL: https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-logo-v2@2x.png

import os
import asyncio
import random
import string

from openai import AsyncOpenAI

from air import login, DistillerClient
from air import utils


auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
)

async def image_understanding():
    # create a distiller client
    distiller_client = DistillerClient()

    # upload your config file to register a new distiller project
    distiller_client.create_project(config_path="example.yaml", project="example")

    async with distiller_client(
        project="example",
        uuid="test_user",
    ) as dc:
        responses = await dc.query(
            query="what's in the image?",
            image=utils.image_to_base64(
                "https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-logo-v2@2x.png"
            ),
        )
        async for response in responses:
            print(response["content"])  # Extract and print the 'content' field


if __name__ == "__main__":
    # Uncomment the functions you wish to run
    # print("Image Understanding")
    # asyncio.run(image_understanding())

    print("Image Understanding")
    asyncio.run(image_understanding())

Replace the arguments to query and image with your own query and image.

Example Usage¶

Below are example use cases for the Image Understanding Agent:

Chart reading
General image reading
Synthetic image reading
Table-to-text (OCR) conversion
Table-to-Markdown conversion

For each example, replace the query and image parameters in the Python file above with the corresponding query and the local image path after saving the images.

Chart Reading¶

This example demonstrates the agent's ability to read and comprehend the chart provided in the image.

Query:

What's in this image?

Output:

Here, we paste the model generated output to the user query. The agent was able to extract the important information from the chart.

Based on the horizontal bar plot, I can describe the graph as follows:

**Title**: The title of the graph is "Rate of Change in Hourly Bill Rates as Negotiation Rounds Increase (Top 3 Roles)".

**Groups**: There are three groups of bars, each representing a different category. The groups are labeled as "Rate Change Type".

**Colors**: The bars in each group are colored differently. The colors used are blue, brown, and a darker blue.

**X-axis**: The x-axis represents the roles, which are labeled as "DEVELOPER/ENGINEER", "PROJECT MANAGER", and "TECHNICAL CONSULTANT".

**Y-axis**: The y-axis represents the rate of change, with major ticks at -60, -40, -20, 0, 20, 40, and 60. There are also minor ticks between the major ticks.

**Bars**: Each group has three bars, one for each role. The length of each bar represents the rate of change for that role.

**Observations**:

* The blue bars (first group) have values ranging from around -10 to 60.
* The brown bars (second group) have values ranging from around -20 to 10.
* The darker blue bars (third group) have values ranging from around -30 to 10.
* The bars for "DEVELOPER/ENGINEER" and "TECHNICAL CONSULTANT" have similar lengths across all three groups.
* The bars for "PROJECT MANAGER" have longer lengths compared to the other two roles in the second and third groups.

General Image Reading¶

This example demonstrates the agent's ability to understand the image in the context of a recent event.

Query:

Describe the image

Output:

The image features the Olympic rings logo prominently displayed in front of the Eiffel Tower, with a crowd of people gathered around it. The Olympic rings are depicted in their signature colors: blue, yellow, black, green, and red. In the background, the iconic Eiffel Tower stands tall against a blue sky dotted with white clouds. The overall atmosphere suggests that the image was taken during the 2024 Summer Olympics in Paris, France.

The output shows that the agent not only reads the contents of the image but recognizes the recent event, the Paris Olympic 2024.

Synthetic Image Reading¶

The Image Understanding Agent can also interpret synthetic images generated by image generation models, such as Flux.1-schnell. The image below was generated using Stable Diffusion 1.5. We provide the image with the prompt:

Query:

Describe the image

Output:

The image shows a vintage light blue car parked in front of an old stone building. The car has a distinctive design, featuring a long hood and a short rear deck, with a spare tire mounted on the trunk lid. A red dog statue is prominently displayed on the car's hood, adding a playful touch to the scene.

  **Key Features:**

  * **Car:** The car is the main focus of the image, with its light blue color and classic design making it stand out.
  * **Dog Statue:** The red dog statue on the hood of the car adds a whimsical element to the scene.
  * **Stone Building:** The old stone building in the background provides a sense of history and context to the image.
  * **Cobblestone Street:** The cobblestone street in front of the car adds texture and depth to the scene.

  **Overall Impression:**

  The image evokes a sense of nostalgia and charm, with the vintage car and old stone building creating a timeless atmosphere. The red dog statue adds a playful touch, making the image feel more lighthearted and fun.

The Image Understanding Agent accurately describes the image with detailed observations, demonstrating its potential for use with image generation models.

Table to Text (OCR)¶

The Image Understanding Agent exhibits strong performance in recognizing text within an image. While the agent is capable of performing complex tasks beyond text recognition, in this example, we ask it to extract the text from an image and summarize it.

Query:

Describe the image and summary the text in each row

Output:

The image presents a table with four columns and six rows, each representing a distinct category for evaluating discussion rubrics. The categories are: Criteria, Exemplary Performance, Satisfactory Performance, and Needs Improvement.

**Criteria:**

*   Participation
*   Relevance
*   Insight
*   Support

**Exemplary Performance:**

*   Makes an original post and replies to at least two other classmates in the discussion.
*   The posting directly addresses key issues, questions, or problems related to the text and the discussion activity. The posting applies course concepts well.
*   The posting offers original or thoughtful insight, analysis, or observation that demonstrates a strong grasp of concepts and ideas pertaining to the discussion topic.
*   The posting supports all claims and opinions with either rational argument or evidence.

**Satisfactory Performance:**

*   Makes an original post and replies to one other classmate in the discussion.
*   The posting addresses key issues, questions, or problems related to the text and the discussion activity, but in some cases, only indirectly. It does not always apply course concepts fully.
*   The posting does offer some insight, analysis, or observation to the topic but may not demonstrate a full understanding or knowledge of concepts and ideas pertaining to the discussion topic.
*   The posting generally supports claims and opinions with evidence or argument but may leave some gaps where unsupported opinions still appear.

**Needs Improvement:**

*   Makes an original post but doesn't reply to others within the discussion.
*   The posting does not directly address the question or problem posed by the discussion activity.
*   The posting does not offer any significant insight, analysis, or observation related to the topic. No knowledge or understanding is demonstrated regarding concepts and ideas pertaining to the discussion topic.
*   The posting does not support its claims with either evidence or argument. The posting contains largely unsupported opinion.

This table provides a comprehensive framework for evaluating discussion rubrics across various criteria, offering a clear structure for assessing performance levels.

Table to Markdown¶

The Image Understanding Agent can also convert tables or text from an image into plain text in Markdown format, which is convenient for tasks such as creating web or documentation content, organizing notes, sharing text with formatting, and ensuring portability across platforms. Let us use the same table image from the previous example. The prompt and image provided to the agent are:

Query:

Convert the table shown in the image into markdown format

Output:

# Discussion Rubric

| Criteria | Exemplary Performance | Satisfactory Performance | Needs Improvement |
| --- | --- | --- | --- |
| **Participation** | Makes an original post and replies to at least two other classmates in the discussion. | Makes an original post and replies to one other classmate in the discussion. 
| Makes an original post but doesn't reply to others within the discussion. |
| **Relevance** | The posting directly addresses key issues, questions, or problems related to the text and the discussion activity. The posting applies course concepts well. | The posting addresses key issues, questions, or problems related to the text and the discussion activity, but in some cases, only indirectly. It does not always apply course concepts fully. | The posting does not directly address the question or problem posed by the discussion activity. |
| **Insight** | The posting offers original or thoughtful insight, analysis, or observation that demonstrates a strong grasp of concepts and ideas pertaining to the discussion topic. | The posting does offer some insight, analysis, or observation to the topic but may not demonstrate a full understanding or knowledge of concepts and ideas pertaining to the discussion topic. | The posting does not offer any significant insight, analysis, or observation related to the topic. No knowledge or understanding is demonstrated regarding concepts and ideas pertaining to the discussion topic. |
| **Support** | The posting supports all claims and opinions with either rational argument or evidence. | The posting generally supports claims and opinions with evidence or argument but may leave some gaps where unsupported opinions still appear. | The posting does not support its claims with either evidence or argument. The posting contains largely unsupported opinion. |