Utilize the Image Generation Agent¶

Overview¶

The Image Generation Agent is a utility agent designed to generate an image based on user queries. Users can provide either:

a textual description of the image they want to generate, or
an image to use as a reference, along with a textual description of the desired image.

The former is referred to as text-to-image, and the latter as text-guided image-to-image. In this tutorial, we show how to leverage the agent to create a concept design.

Goals¶

The goals of this tutorial are to demonstrate some of the agent's capabilities and illustrate how different agents interact to solve user queries within a user-defined agentic framework. By the end, you will know how to configure your own agentic framework, consisting of custom and default agents including the Image Generation Agent to solve simplified tasks involving images.

Steps¶

1. Configuration¶

You need to define the configration in a YAML file. The configuration is as follows

orchestrator:
  agent_list:
    - agent_name: "Report Agent"
    - agent_name: "Search Agent"
    - agent_name: "Image Understanding Agent"
    - agent_name: "Image Generation Agent"
    - agent_name: "Story Teller Agent"

utility_agents:
  - agent_class: UtilityAgent
    agent_name: "Report Agent" # This agent will write a report based on the contents genearted by other agents and the request by the user
    agent_description: "This is capable of writing a report"
    config:
      magic_prompt: "You are writing a report based on user query. Format your report in Markdown format.\nUser query:\n{query}"
      output_style: "markdown"
      contexts:  # Optional field
        - "date" # This will add a date stamp to the agent's output, which can be leveraged later.
        - "chat_history" # This enables the agent to utilize the previous chat history to fulfill the user's query

  - agent_class: UtilityAgent
    agent_name: "Story Teller Agent" # This agent will create stories as requested by the user
    agent_description: "This is capable of writing a story"
    config:
      magic_prompt: "You are a master of enchanting stories for children. Your story must begin with the timeless phrase, 'Once upon a time...'\nUser query:\n{query}"
      contexts:  # Optional field
        - "date"
        - "chat_history"

  - agent_class: SearchAgent
    agent_name: "Search Agent" # This agent will fulfill the user's query by web search
    config:
      contexts: # Optional field
        - "date"
        - "chat_history"

  - agent_class: ImageUnderstandingAgent
    agent_name: Image Understanding Agent # This agent can perform queries related to image
    config: {}

  - agent_class: ImageGenerationAgent
    agent_name: "Image Generation Agent" # This agent generate an image based on text input and/or image input
    config: 
      rewriter_config: True
      contexts:
        - "date"
        - "chat_history"

The rewriter_config option enables automatic enhancement of your input query for image-to-image generation. It refines the prompt, making it more descriptive based on the provided image, which can lead to improved image generation results. This feature is designed to assist developers in creating more detailed and accurate prompts for image-to-image generation.

In this tutorial, we will test the agent with and without rewriter_config enabled and compare the results.

Note that some of the agents use chat_history because they rely on the previous conversation history to perform the requested tasks, as shown in the examples below. We did not provide the agent_description for the default agents (i.e., Search Agent and Image Understanding/Generation Agents). If the agent_description is not provided, the default description will be used.

The Report Agent uses the output_style parameter set to "markdown" to generate a structured output that is directly usable for reports.

2. Python file¶

Request the framework to generate an image of a Wikipedia soccer ball concept design using a reference image from this Wikipedia logo URL: https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-logo-v2@2x.png. The python script with the request and image is

import os
import asyncio

from openai import AsyncOpenAI

from air import login, DistillerClient
from air import utils


auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
)

async def image_generation():
    # create a distiller client
    distiller_client = DistillerClient()

    # upload your config file to register a new distiller project
    distiller_client.create_project(config_path="example.yaml", project="example")

    async with distiller_client(
        project="example",
        uuid="test_user",
    ) as dc:
        # If you want to use text2img, remove the image param, otherwise, pass in your image as a base 64, or url
        responses = await dc.query(
            query="Generate an image of a wikipedia soccer ball concept design",
            image=utils.image_to_base64(
                "https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-logo-v2@2x.png"
            ),
        )

        async for response in responses:

            if (response["role"] == "Image Generation Agent") and (response["image"]):
                generated_base64_image = response["image"]["image_data"]
                utils.save_base64_image(
                    generated_base64_image,
                    "<CHANGE_THIS_TO_THE_FILENAME>",
                )

            else:
                print(response)


if __name__ == "__main__":
    print("Image Generation")
    asyncio.run(image_generation())

Replace <CHANGE_THIS_TO_THE_FILENAME> with the local path and filename where you want to save the generated image.

Example Usage with Rewriter¶

The following examples are based on rewriter_config: True. We will later show the output based on rewriter_config: False for comparison. For all Text-to-Image examples, remove the image parameter from the dc.query() call.

Example 1 (Text-to-Image)¶

Input text query:

Generate an image of a wikipedia soccer ball concept design

Output:

Example 2 (Image-to-Image)¶

Input text query:

Generate an image of a wikipedia soccer ball concept design inspired by the provided image

Input image

Output:

Observation:

The generated image successfully captured the details of the original Wikipedia image, including the puzzle pieces in the ball. By providing a reference image, the agent was able to generate a significantly improved result.

Example 3 (Image-to-Text)¶

This example demonstrates a potential use case by combining the Image Understanding and Generation Agents. We can first generate an image using the Generation Agent and then create a story based on the generated image using the image understanding agent. Save the image generated by the image generation agent in your local machine and send it as the input image to run the example.

Input text query:

What is interesting about the provided image? Create a story based on the interesting details you observe in the image.

Input image

Output:

Response from the Image Understanding Agent,

The image depicts a 3D globe with a white surface and black text, featuring various words and logos. The globe is positioned against a bright green background, which serves to accentuate its features.

Upon closer inspection, it becomes apparent that the globe is not a traditional representation of the Earth, but rather a symbolic representation of the internet. The words and logos on the globe's surface are likely references to various websites, companies, and organizations that are prominent online. The use of a globe as a symbol for the internet is a clever play on the idea of the internet as a global network that connects people and information from all over the world.

The bright green background adds a pop of color to the image and helps to draw attention to the globe. It also creates a sense of contrast between the cool, calming tone of the green and the more dynamic, energetic feel of the globe.

Overall, the image is an interesting and thought-provoking representation of the internet as a global network. It encourages the viewer to think about the ways in which the internet connects people and information, and how it has become an integral part of our daily lives.

The Story Teller Agent follows and create a story based on the description of the image. The response from the Story Teller Agent,

Once upon a time, in a world not so different from our own, there was a magical globe that floated in a bright green sky. This was no ordinary globe, for it was a map of the internet, a mystical realm where information and ideas flowed like rivers and connected people from all corners of the world.

The globe was covered in words and logos, each one a doorway to a new and wondrous place. There was a great, glowing "G" that led to a vast library of knowledge, where answers to any question could be found. Next to it was a colorful logo that opened the doors to a fantastical marketplace, where people from all over the world could buy and sell the most curious things.

As you turned the globe, you might stumble upon a brave knight's shield, emblazoned with the words "Social Media." This was a place where people gathered to share their thoughts and ideas, and to connect with one another in ways both grand and small.

But beware, for not all was sunshine and rainbows in this magical realm. There were also dark corners and hidden dangers, where malicious creatures lurked and sought to do harm. The wise and brave explorers of the internet knew to be cautious, and to always keep their wits about them as they navigated the twists and turns of the digital world.

One day, a young adventurer named Lily stumbled upon the magical globe. She had always been fascinated by the internet, and she spent hours exploring its many wonders. As she turned the globe, she discovered new and exciting things, and she began to realize just how connected the world truly was.

Lily's journey took her to the farthest reaches of the internet, where she met all manner of strange and wonderful creatures. She met a wise old owl who taught her about the importance of online safety, and a mischievous imp who showed her the secrets of coding and computer magic.

As she explored, Lily began to realize that the internet was not just a tool, but a community. It was a place where people came together to share and learn, to create and inspire. And she knew that she wanted to be a part of it, to use her own skills and talents to make the internet an even brighter and more wondrous place.

And so, Lily's journey continued, as she explored the magical globe and all its secrets. She became a brave and clever explorer of the internet, and she helped to make the digital world a brighter and more wonderful place, one click at a time.

As instructed in the magic_prompt, the Story Teller Agent starts with story with the timeless phrase, 'Once upon a time'.

Example 4 (Image-to-Text)¶

This example demonstrates another use case for combining the Image Understanding and Generation Agents to write a report.

Input text query:

Here is the image I created for the Wikipedia soccer ball concept design. Conduct a market analysis on potential consumer reactions to this image and write a brief report based on your findings.

Input image

Output: The Image Understanding Agent responds first,

The image you've shared appears to be a concept design for a Wikipedia soccer ball. The ball is predominantly white, featuring a unique design that incorporates various elements related to Wikipedia. Here's a breakdown of the key components:

Wikipedia Logo: The Wikipedia logo is prominently displayed on the ball, indicating its connection to the online encyclopedia.


Language Codes: Scattered across the ball are various language codes, such as "en" for English, "fr" for French, and "es" for Spanish. These codes represent the different language versions of Wikipedia.


Other Elements: The ball also features other elements, including what appears to be a globe, possibly symbolizing the global reach of Wikipedia, and a puzzle piece design, which could represent the collaborative effort behind the creation and maintenance of Wikipedia's content.


Color Scheme: The primary color of the ball is white, with the text and logos in black. This color scheme is consistent with Wikipedia's branding.


Background: The background of the image is a solid lime green color, which provides a striking contrast to the white ball and helps the design elements stand out.

Overall, the image effectively combines Wikipedia's branding with a creative soccer ball design, highlighting the platform's global presence and collaborative nature.

Then the Report Agent writes a report about the potential consumer reaction to the image based on the description. The response from the Report Agent,

Wikipedia Soccer Ball Concept Design: Potential Consumer Reactions
## Overview
Based on the analysis of the Wikipedia soccer ball concept design, potential consumer reactions can be inferred. The design effectively incorporates Wikipedia's branding, global presence, and collaborative nature, which may elicit various responses from consumers.

## Positive Reactions
**Brand Recognition and Loyalty**: Fans of Wikipedia may appreciate the creative incorporation of the platform's logo and language codes, fostering brand loyalty and recognition.

**Unique Design**: The combination of a soccer ball with Wikipedia's branding and design elements may appeal to consumers looking for a distinctive and conversation-starting product.

**Global Appeal**: The inclusion of various language codes and a globe may resonate with consumers who value global connectivity and cultural diversity.

## Negative Reactions
**Lack of Traditional Soccer Ball Design**: Some consumers may find the design too unconventional, preferring traditional soccer ball designs that focus on team logos or national flags.

**Overemphasis on Branding**: The prominent display of the Wikipedia logo and language codes may be seen as overly promotional, potentially deterring consumers who prefer more subtle branding.

**Color Scheme**: The lime green background may not appeal to all consumers, potentially clashing with personal preferences or team colors.

## Neutral Reactions
**Novelty Factor**: Some consumers may view the Wikipedia soccer ball as a novelty item, interesting to look at but not necessarily something they would use or purchase.

**Limited Appeal**: The design may appeal primarily to Wikipedia enthusiasts or individuals with a strong interest in language and culture, limiting its broader market appeal.

Overall, consumer reactions to the Wikipedia soccer ball concept design will likely be varied, with some appreciating its unique blend of branding and design elements, while others may find it too unconventional or promotional.

Note that the output is in Markdown format as instructed in the YAML configuration file.

Example 5 (Text-to-Image)¶

This example demonstrates how you can combing a search agent with the image understanding agent.

Input:

Accenture is consulting for a children's toy company. Search the web to identify the popular toys currently trending and the features they share. Based on these common features, create an image of a trendy toy concept design that encourages purchases.

Output:

The Search Agent responds first,

Popular Children's Toys Currently Trending and Their Common Features
Currently trending children's toys showcase a blend of innovation, interactivity, and nostalgia. Key features among these popular toys include:

Interactive and Educational Elements: Many trending toys, such as the GeoSafari Jr. Talking Telescope and Microscope, offer interactive learning experiences. These toys combine fun with education, making them appealing to both children and parents.
STEM and Science Kits: Toys like the National Geographic Mega Slime Kit & Putty Lab and the ThinkFun Gravity Maze Marble Run Brain Game are popular for their focus on science, technology, engineering, and mathematics (STEM). These kits encourage critical thinking and creativity.
Nostalgia and Revivals: There's a significant trend towards nostalgic toys, with brands reviving classic toys from the 1990s and Y2K era. This includes the return of popular franchises and characters from the past, appealing to both children and adults who are nostalgic for their childhood.
Customization and Personalization: Toys that allow for customization, such as the Bitzee Interactive Digital Pet and the WowWee Fingerlings Interactive Baby Monkey, are trending. These toys offer unique experiences tailored to each child's preferences.
Imaginative Play and Role-Playing: Toys that encourage imaginative play, like the Beast Lab: Shark Beast Creator and the Little Live Pets My Baby Monkey, are popular. These toys foster creativity and role-playing skills in children.
Collectibility: Toys that come in sets or have collectible elements, such as the Thames & Kosmos Gecko Run: Marble Run Starter Set and the FurReal Peanut, are trending. These toys encourage children to collect and trade items, promoting social interaction and a sense of accomplishment.
Common Features Among Trending Toys
Technology Integration: Many popular toys incorporate technology, such as sensors, apps, and interactive features, to enhance play and learning experiences. Sustainability and Environmental Awareness: There's a growing interest in eco-friendly toys and those that promote environmental awareness, reflecting a broader societal trend towards sustainability. Cross-Generational Appeal: Toys that appeal to both children and adults, often through nostalgia or shared play experiences, are trending. This includes toys that allow for family game nights or cooperative play. Emphasis on Creativity and Imagination: Toys that foster creativity, imagination, and problem-solving skills are highly valued. These toys often encourage open-ended play and allow children to express themselves.
References
[1] CNET - Best toys for kids [2] Good Housekeeping - Best New Toys of 2023 [3] TODAY - Hot Holiday Toys 2024 [4] Exploding Topics - Toy Industry Trends [5] The Toy Association - Trend Spotting [6] Trend Hunter - Toys

The Image Generation Agent follows and generates the image based on the common features found from the Search Agent. This is the generated output,

Example Usage without Rewriter¶

The rewriter is a VLM-based prompt rewriter for image-to-image tasks. It modifies your prompt to better align the query with the provided image, resulting in a more refined output. In the following examples, we generate images without using the rewriter functionality (i.e., rewriter_config: False in the YAML configuration) for comparison.

Example 6 (Image-to-Image)¶

Let's use the same image and same query as Example 2.

Input text query:

Generate an image of a wikipedia soccer ball concept design inspired by the provided image

Input image

Output:

Observation:

The generated image lacks the details of the provided image.

Remark¶

In this tutorial, we explored some capabilities of the Image Generation Agent and experimented with its interaction alongside other agents. You can create as many interesting and specialized agents as you like. Depending on the complexity of your request, the orchestrator agent automatically decomposes the task, assigns the subtasks to the corresponding agents. The agents will work together to fulfill your request.