Skip to content

Image Understanding Agent

This documentation provides an overview of the ImageUnderstandingAgent class configuration, and example usage.

The ImageUnderstandingAgent class is a utility agent within the AI Refinery SDK designed to help with the analysis and understanding of an image provided by the user to the SDK. The user can provide an image that has been converted to a base 64 string or a direct image url and ask questions such as, "Can you analyze this image? What is the history of this image and its role in the world today?"

Usage

As a built-in utility agent in the AI Refinery SDK, you can easily integrate ImageUnderstandingAgent into your project by updating your project YAML file with the following configurations:

  • Add a utility agent with agent_class: ImageUnderstandingAgent under utility_agents.
  • Ensure the agent_name you chose for your ImageUnderstandingAgent is listed in the agent_list under orchestrator.

For a tutorial of this agent, visit this link.

Quickstart

To quickly set up a project with a ImageUnderstandingAgent, use the following YAML configuration. Note that additional agents can be added per your needs. You can add more agents and retrievers as needed. Refer to the next section for a detailed overview of configurable options for ImageUnderstandingAgent.

utility_agents:
  - agent_class: ImageUnderstandingAgent
    agent_name: "Image Understanding Agent"
    agent_description: "This agent can help you understand and analyze your image."
    config:
      vlm_config:
        api_type: "openai"
        model: meta-llama/Llama-3.2-90B-Vision-Instruct

orchestrator:
  agent_list:
    - agent_name: "Image Understanding Agent" # The name you chose for your ImageUnderstandingAgent above.

Template YAML Configuration of ImageUnderstandingAgent

In addition to the configurations mentioned for the example above, the ImageUnderstandingAgent supports several other configurable options. See the template YAML configuration below for all available settings.

agent_class: ImageUnderstandingAgent
agent_name: <name of the agent> # A name that you choose for your ImageUnderstandingAgent
agent_description: <description of the agent> #Optional
config:
# Optional configurations for ImageUnderstandingAgent
  output_style: <"markdown" or "conversational" or "html">  # Optional field
  contexts:  # Optional field
    - "date"
    - "chat_history"
    - "chat_summary"
  vlm_config:
  # Optional. Customized vlm config (if you want the image understanding agent to use a different vision language model than the one in your base config)
    model: <model_name>