Skip to content

Realtime Voice with Flow Super Agent (Push to Talk)

Overview

Realtime voice interaction extends your Flow Super Agent with speech capabilities, enabling natural conversation flows through bidirectional audio streaming. This adds voice processing to multi-agent workflows while maintaining all existing coordination functionality.

Key capabilities: - Voice input: Process spoken queries for complex workflow execution - Voice output: Receive audio responses from coordinated agents - Multi-voice support: Different voices for different agents in the workflow - Push-to-talk: Control when to capture and send voice input

Objective

This tutorial will guide you through adding speech interaction to your Flow Super Agent. You will:

  • Create or modify a YAML configuration file with input and output speech settings
  • Set different voice preferences for orchestrator, super agent, and individual agents
  • Implement push-to-talk voice interaction for multi-agent workflows
  • Test voice-enabled workflow execution with coordinated agent responses

Steps

1. Configuration file

The YAML configuration has several key sections for realtime functionality:

audio_config (Required)

  • Configures the Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models
  • The TTS model specified here serves as the default for all voice outputs

  • ASR Parameters:

    • model (Required): Model ID of the ASR model used for realtime speech transcription
    • prefix_padding_ms (Optional): Lead-in audio (in milliseconds) retained before detected speech
    • silence_duration_ms (Optional): Trailing silence duration after audio input (in milliseconds) to end a chunk
    • language (Optional): Language to detect and transcribe (default: "en-US")

speech_config (Optional)

  • Should only be added to agents where you want voice output
  • Can be configured at three levels:

    • Orchestrator level: Sets the default voice for the orchestrator
    • Individual agent level: Overrides the default voice for specific agents
    • Super agent level: Sets the voice for the Flow Super Agent's responses
  • Parameters:

    • model (Required): Model ID used to generate the speech
    • language (Optional): Language ID for speech synthesis
    • voice (Optional): Voice ID for speech synthesis
    • speed (Optional): Speech speed multiplier (0.25 to 4.0). (Default: 1.0)
    • enable_speech (Optional): Bool to enable or disable agent speech synthesis (Default: True)
    • normalize_text (Optional): Bool to clean up formatting symbols and references so they aren't read aloud (Default: True)
    • sample_rate (Optional): Sample rate of the output audio in Hz. Supported values: 8000,16000,22050,24000,44100,48000 (Default: 16000)

In this example configuration, notice how different agents use different voices (Australian, Canadian, British, and US voices).

Save the following configuration as flow_superagent_realtime.yaml.

memory_config:  
  save_config:
    auto_load: false

orchestrator:
  agent_list:
    - agent_name: "Investment Strategy Advisor"
  speech_config:
    model: 'Azure/AI-Speech'
    language: 'en-AU'
    voice: 'en-AU-WilliamNeural'

utility_agents:
  - agent_class: PlanningAgent
    agent_name: "Stock Planner"
    agent_description: "Create a detailed plan to hedge losses against stock price variance."
    config:
      output_style: "conversational"

  - agent_class: PlanningAgent
    agent_name: "Currency Planner"
    agent_description: "Create a plan to hedge losses against currency price variance."
    config:
      output_style: "conversational"
    speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-CA'
      voice: 'en-CA-LiamNeural'
      normalize_text: true

  - agent_class: PlanningAgent
    agent_name: "Risk Assessment Planner"
    agent_description: "Analyze portfolio risk metrics, volatility, correlation analysis, and stress testing scenarios."
    config:
      output_style: "conversational"
    speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-GB'
      voice: 'en-GB-LibbyNeural'
      enable_speech: False
      normalize_text: false

super_agents:
  - agent_class: FlowSuperAgent
    agent_name: "Investment Strategy Advisor"
    agent_description: "Provides investment insights based on stock and finance research."
    config:
      goal: "Generate investment recommendations."

      agent_list:
        - agent_name: "Stock Planner"
          next_step:
            - "Currency Planner"
            - "Risk Assessment Planner"

        - agent_name: "Currency Planner"
        - agent_name: "Risk Assessment Planner"
    speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-US'
      voice: 'en-US-JennyNeural'

audio_config:
  asr:
    model: "Azure/AI-Transcription"
    silence_duration_ms: 3500
  tts:
    model: "Azure/AI-Speech"

2. Python file

Now, you can start the development of your assistant using these lines of code:

import asyncio
import logging
import os
import random
import string
import traceback

from air import AsyncAIRefinery
from air.utils.async_helper import async_input
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))


async def test_voice_async():
    """Test the flowsuperagent with real-time."""
    # Generate a unique session identifier for this conversation
    test_uuid = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
    project = "example"

    try:
        # Initialize the AI Refinery client with authentication
        client = AsyncAIRefinery(api_key=api_key)

        # Create/update the project with the realtime configuration
        client.realtime_distiller.create_project(
            config_path="flow_superagent_realtime.yaml", project="example"
        )

        # Establish WebSocket connection for realtime voice streaming
        async with client.realtime_distiller(
            project=project,
            uuid=test_uuid + "_voice",
        ) as vc:
            print("Voice endpoint connected successfully!")

            while True:

                # Display example query for user reference
                queries = [
                    "I am expecting my currency and stock investments to lose value. How can I protect myself against this?",
                ]
                print("Example query you can try and speak:")
                for i, query in enumerate(queries, 1):
                    print(f"  {i}. {query}")
                print("Press <ENTER> to record, or 'q' + <ENTER> to quit:")
                cmd = (await async_input("")).strip().lower()
                if cmd == "q":
                    break

                # Capture audio from microphone, process through workflow, and play coordinated responses
                await vc.listen_and_respond(sample_rate=16000)
                print("Speech Response playback completed\n")
                print("Query Completed\n")

            print("Session Closed")

    except Exception as e:

        # Log any errors that occur during voice interaction
        logging.error(f"Voice endpoint failed: {e}")
        traceback.print_exc()
        raise


if __name__ == "__main__":

    # Test voice endpoint
    asyncio.run(test_voice_async())

Realtime Wrapper Methods

The example above uses vc.listen_and_respond(), which is a high-level method that handles the complete voice interaction loop including microphone capture, server communication, and audio playback.

For full details on parameters and behavior, see the Realtime Wrapper Methods section in the API documentation.