Image Understanding Agent¶
This documentation provides an overview of the ImageUnderstandingAgent
class configuration, and example usage.
The ImageUnderstandingAgent
class is a utility agent within the AI Refinery SDK designed to help with the analysis and understanding of an image provided by the user to the SDK. The user can provide an image that has been converted to a base 64 string or a direct image url and ask questions such as, "Can you analyze this image? What is the history of this image and its role in the world today?"
Usage¶
As a built-in utility agent in the AI Refinery SDK, you can easily integrate ImageUnderstandingAgent
into your project by updating your project YAML file with the following configurations:
- Add a utility agent with
agent_class: ImageUnderstandingAgent
underutility_agents.
- Ensure the
agent_name
you chose for yourImageUnderstandingAgent
is listed in theagent_list
underorchestrator.
For a tutorial of this agent, visit this link.
Quickstart¶
To quickly set up a project with a ImageUnderstandingAgent
, use the following YAML configuration. Note that additional agents can be added per your needs. You can add more agents and retrievers as needed. Refer to the next section for a detailed overview of configurable options for ImageUnderstandingAgent.
utility_agents:
- agent_class: ImageUnderstandingAgent
agent_name: "Image Understanding Agent"
agent_description: "This agent can help you understand and analyze your image."
config:
vlm_config:
api_type: "openai"
model: meta-llama/Llama-3.2-90B-Vision-Instruct
orchestrator:
agent_list:
- agent_name: "Image Understanding Agent" # The name you chose for your ImageUnderstandingAgent above.
Template YAML Configuration of ImageUnderstandingAgent
¶
In addition to the configurations mentioned for the example above, the ImageUnderstandingAgent
supports several other configurable options. See the template YAML configuration below for all available settings.
agent_class: ImageUnderstandingAgent
agent_name: <name of the agent> # A name that you choose for your ImageUnderstandingAgent
agent_description: <description of the agent> #Optional
config:
# Optional configurations for ImageUnderstandingAgent
output_style: <"markdown" or "conversational" or "html"> # Optional field
contexts: # Optional field
- "date"
- "chat_history"
- "chat_summary"
vlm_config:
# Optional. Customized vlm config (if you want the image understanding agent to use a different vision language model than the one in your base config)
model: <model_name>