Realtime Voice with Flow Super Agent (Push to Talk)¶

Overview¶

Realtime voice interaction extends your Flow Super Agent with speech capabilities, enabling natural conversation flows through bidirectional audio streaming. This adds voice processing to multi-agent workflows while maintaining all existing coordination functionality.

Key capabilities: - Voice input: Process spoken queries for complex workflow execution - Voice output: Receive audio responses from coordinated agents - Multi-voice support: Different voices for different agents in the workflow - Push-to-talk: Control when to capture and send voice input

Objective¶

This tutorial will guide you through adding speech interaction to your Flow Super Agent. You will:

Create or modify a YAML configuration file with input and output speech settings
Set different voice preferences for orchestrator, super agent, and individual agents
Implement push-to-talk voice interaction for multi-agent workflows
Test voice-enabled workflow execution with coordinated agent responses

Steps¶

1. Configuration file¶

The YAML configuration has several key sections for realtime functionality:

audio_config (Required)

Configures the Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models
The TTS model specified here serves as the default for all voice outputs
ASR Parameters:
- model (Required): Model ID of the ASR model used for realtime speech transcription
- prefix_padding_ms (Optional): Lead-in audio (in milliseconds) retained before detected speech
- silence_duration_ms (Optional): Trailing silence duration after audio input (in milliseconds) to end a chunk
- language (Optional): Language to detect and transcribe (default: "en-US")

speech_config (Optional)

Should only be added to agents where you want voice output
Can be configured at three levels:
- Orchestrator level: Sets the default voice for the orchestrator
- Individual agent level: Overrides the default voice for specific agents
- Super agent level: Sets the voice for the Flow Super Agent's responses
Parameters:
- model (Required): Model ID used to generate the speech
- language (Optional): Language ID for speech synthesis
- voice (Optional): Voice ID for speech synthesis
- speed (Optional): Speech speed multiplier (0.25 to 4.0). (Default: 1.0)
- enable_speech (Optional): Bool to enable or disable agent speech synthesis (Default: True)
- normalize_text (Optional): Bool to clean up formatting symbols and references so they aren't read aloud (Default: True)
- sample_rate (Optional): Sample rate of the output audio in Hz. Supported values: 8000,16000,22050,24000,44100,48000 (Default: 16000)

In this example configuration, notice how different agents use different voices (Australian, Canadian, British, and US voices).

Save the following configuration as flow_superagent_realtime.yaml.

memory_config:  
  save_config:
    auto_load: false

orchestrator:
  agent_list:
    - agent_name: "Investment Strategy Advisor"
  speech_config:
    model: 'Azure/AI-Speech'
    language: 'en-AU'
    voice: 'en-AU-WilliamNeural'

utility_agents:
  - agent_class: PlanningAgent
    agent_name: "Stock Planner"
    agent_description: "Create a detailed plan to hedge losses against stock price variance."
    config:
      output_style: "conversational"

  - agent_class: PlanningAgent
    agent_name: "Currency Planner"
    agent_description: "Create a plan to hedge losses against currency price variance."
    config:
      output_style: "conversational"
    speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-CA'
      voice: 'en-CA-LiamNeural'
      normalize_text: true

  - agent_class: PlanningAgent
    agent_name: "Risk Assessment Planner"
    agent_description: "Analyze portfolio risk metrics, volatility, correlation analysis, and stress testing scenarios."
    config:
      output_style: "conversational"
    speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-GB'
      voice: 'en-GB-LibbyNeural'
      enable_speech: False
      normalize_text: false

super_agents:
  - agent_class: FlowSuperAgent
    agent_name: "Investment Strategy Advisor"
    agent_description: "Provides investment insights based on stock and finance research."
    config:
      goal: "Generate investment recommendations."

      agent_list:
        - agent_name: "Stock Planner"
          next_step:
            - "Currency Planner"
            - "Risk Assessment Planner"

        - agent_name: "Currency Planner"
        - agent_name: "Risk Assessment Planner"
    speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-US'
      voice: 'en-US-JennyNeural'

audio_config:
  asr:
    model: "Azure/AI-Transcription"
    silence_duration_ms: 3500
  tts:
    model: "Azure/AI-Speech"

2. Python file¶

This example uses audio input (push-to-talk) with per-agent cancellation. In multi-agent flows, each agent's TTS can be skipped individually — press SPACE during an agent's response to skip it and the next agent begins automatically. Press SPACE again to skip that one too. The session ends normally after all agents have either played or been skipped.

import asyncio
import logging
import os
import random
import string
import traceback

from air import AsyncAIRefinery
from air.distiller.utils import realtime_helper  # provides CancelOnKeypress helper
from air.utils.async_helper import async_input
from dotenv import load_dotenv

load_dotenv()  # loads your API_KEY from your local '.env' file
api_key = str(os.getenv("API_KEY"))


async def test_voice_async():
    """Test the flowsuperagent with real-time voice and per-agent cancellation."""
    # Generate a unique session identifier for this conversation
    test_uuid = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
    project = "example"

    try:
        # Initialize the AI Refinery client with authentication
        client = AsyncAIRefinery(api_key=api_key)

        # Create/update the project with the realtime configuration
        client.realtime_distiller.create_project(
            config_path="flow_superagent_realtime.yaml", project="example"
        )

        # Establish WebSocket connection for realtime voice streaming
        async with client.realtime_distiller(
            project=project,
            uuid=test_uuid + "_voice",
        ) as vc:
            print("Voice endpoint connected successfully!")

            while True:
                # Display example query for user reference
                queries = [
                    "I am expecting my currency and stock investments to lose value. How can I protect myself against this?",
                ]
                print("Example query you can try and speak:")
                for i, query in enumerate(queries, 1):
                    print(f"  {i}. {query}")
                print("Press <ENTER> to record, or 'q' + <ENTER> to quit:")
                cmd = (await async_input("")).strip().lower()
                if cmd == "q":
                    break

                print("Press [SPACE] to skip the current agent's response...")

                # CancelOnKeypress listens for spacebar in a background thread.
                # It yields a cancel_event (asyncio.Event) that is set on press.
                # In multi-agent flows, the event is automatically cleared at each
                # agent boundary, so each SPACE press only skips one agent.
                async with realtime_helper.CancelOnKeypress() as cancel_event:
                    # listen_and_respond captures mic audio, streams it to the
                    # server, then plays back each agent's TTS through the speaker.
                    # cancel_event allows skipping individual agents mid-playback.
                    await vc.listen_and_respond(
                        sample_rate=16000,
                        cancel_event=cancel_event,
                    )
                print("Speech Response playback completed\n")
                print("Query Completed\n")

            print("Session Closed")

    except Exception as e:
        # Log any errors that occur during voice interaction
        logging.error(f"Voice endpoint failed: {e}")
        traceback.print_exc()
        raise


if __name__ == "__main__":
    # Test voice endpoint
    asyncio.run(test_voice_async())

Realtime Wrapper Methods¶

The example above uses vc.listen_and_respond(), a high-level method that handles the complete voice interaction loop including microphone capture, server communication, and audio playback.

For full details on parameters and behavior, see the Realtime Wrapper Methods section in the API documentation.