Skip to content

Realtime Voice with Tool Use Agent (Push to Talk)

Overview

Realtime voice interaction extends your existing Tool Use Agent with speech capabilities, enabling natural conversation flows through bidirectional audio streaming. This adds voice processing to your agent workflows while maintaining all existing functionality.

Key capabilities: - Voice input: Process spoken commands for tool execution - Voice output: Receive audio responses with transcription feedback - Text input with voice output: Submit text queries and receive audio responses - Push-to-talk: Control when to capture and send voice input

Objective

This tutorial will guide you through configuring and using realtime voice features to add speech interaction to your Tool Use Agent. You will:

  • Create or modify a YAML configuration file with input and output speech settings
  • Set voice preferences at orchestrator and agent levels
  • Build a push-to-talk interface with dual input modes (text and voice)
  • Test voice-enabled tool execution with your custom agents

Steps

1. Configuration file

The YAML configuration has several key sections for realtime functionality:

audio_config (Required)

  • Configures the Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models
  • The TTS model specified here serves as the default for all voice outputs

  • ASR Parameters:

    • model (Required): Model ID of the ASR model used for realtime speech transcription
    • prefix_padding_ms (Optional): Lead-in audio (in milliseconds) retained before detected speech
    • silence_duration_ms (Optional): Trailing silence duration after audio input (in milliseconds) to end a chunk
    • language (Optional): Language to detect and transcribe (default: "en-US")

speech_config (Optional)

  • Should only be added to agents where you want voice output
  • Can be configured at two levels:

    • Orchestrator level: Sets the default voice for the orchestrator
    • Individual agent level: Overrides the default voice for specific agents
  • Parameters:

    • model (Required): Model ID used to generate the speech
    • language (Optional): Language ID for speech synthesis
    • voice (Optional): Voice ID for speech synthesis
    • speed (Optional): Speech speed multiplier (0.25 to 4.0). (Default: 1.0)
    • enable_speech (Optional): Bool to enable or disable agent speech synthesis (Default: True)
    • normalize_text (Optional): Bool to clean up formatting symbols and references so they aren't read aloud (Default: True)
    • sample_rate (Optional): Sample rate of the output audio in Hz. Supported values: 8000,16000,22050,24000,44100,48000 (Default: 16000)

In this example configuration, the Tool Use Agent has its own speech configuration for voice output.

Save the following configuration as example_realtime.yaml.

orchestrator:
  agent_list:
    - agent_name: "Recommender Agent"
    - agent_name: "Tool Use Agent"

  speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-AU'
      voice: 'en-AU-WilliamNeural'

utility_agents:
  - agent_class: CustomAgent
    agent_name: "Recommender Agent"
    agent_description: |
      The Recommender Agent is a specialist in item recommendations. For instance,
      it can provide users with costume recommendations, items to purchase, food,
      decorations, and so on. 
    config:
      output_style: "conversational"
    speech_config:
      model: 'Azure/AI-Speech' # (string, Required): Model ID used to generate the speech.
      language: 'en-US' # (string, Optional): Language ID for speech synthesis
      voice: 'en-US-JennyNeural' # (string, Optional): Voice ID for speech synthesis.
      speed: 2.0 # (number, Optional): Speech speed multiplier (0.25 to 4.0). (Default: 1.0)
      enable_speech: True # (bool, Optional) bool to enable or disable agent speech synthesis (Default: True)
      normalize_text: True # (bool, Optional) bool to clean up formatting symbols so they aren't read aloud  (Default: True)
      sample_rate: 16000 # (number, Optional): Sample rate of the output audio in Hz. Supported values: 8000,16000,22050,24000,44100,48000 (Default: 16000).


  - agent_class: ToolUseAgent
    agent_name: "Tool Use Agent"
    agent_description: "An agent that performs function calling using provided tools."
    config:
      wait_time: 120
      output_style: "conversational"
      enable_interpreter: true
      builtin_tools:
        - "calculate_expression"
      custom_tools:
        - |
          {
            "type": "function",
            "function": {
              "name": "convert_temperature",
              "description": "Convert temperature between Celsius and Fahrenheit.",
              "parameters": {
                "type": "object",
                "properties": {
                  "value": {
                    "type": "float",
                    "description": "The temperature value to convert."
                  },
                  "to_scale": {
                    "type": "string",
                    "description": "The scale to convert the temperature to ('Celsius' or 'Fahrenheit').",
                    "enum": ["Celsius", "Fahrenheit"],
                    "default": "Celsius"
                  }
                },
                "required": ["value"]
              }
            }
          }
    speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-AU'
      voice: 'en-AU-WilliamNeural'
      enable_speech: True
      normalize_text: false

audio_config:
# defaults
  asr:
    model: "Azure/AI-Transcription" # (string, Required): Model ID of the ASR model used for realtime speech transcription
    prefix_padding_ms: 1000 # (integer, 0–5000 ms, Optional): Lead-in audio retained before detected speech.
    silence_duration_ms: 3500 # (integer, 0–5000 ms, Optional): Trailing silence duration to end a chunk.
    language: "en-US" # (string, Optional): Language to detect and transcribe. (default: "en-US").
  tts:
    model: "Azure/AI-Speech" 

2. Python file

Common Setup Functions

The examples below use a custom agent and a tool use agent which can be defined as follows:

async def recommender_agent(query: str):
    """Basic Agent to give recommendation"""
    prompt = """Given the query below, your task is to provide the user with useful and cool
       recommendation followed by a one-sentence justification.\n\nQUERY: {query}"""
    prompt = prompt.format(query=query)
    airefinery_client = AsyncAIRefinery(api_key=api_key)
    response = await airefinery_client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}],
        model="meta-llama/Llama-3.1-70B-Instruct",
    )
    return response.choices[0].message.content

def convert_temperature(value: float, to_scale: str = "Celsius") -> float:
    """Convert temperature between Celsius and Fahrenheit."""
    if to_scale not in ["Celsius", "Fahrenheit"]:
        raise ValueError("to_scale must be 'Celsius' or 'Fahrenheit'.")
    if to_scale == "Celsius": return (value - 32) * 5 / 9
    return (value * 9 / 5) + 32

Interactive Push-to-Talk with Dual Input Modes

This example demonstrates interactive voice input with the option to switch between text and voice input modes within the same session.

Note: Include the functions from "Common Setup Functions" above in your Python file before running the main example.

import asyncio
import logging
import os
import random
import string
import traceback

from air import AsyncAIRefinery
from air.utils.async_helper import async_input
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))


async def test_voice_async():
    """Test realtime voice with custom agent and tool use agent."""
    # Generate a unique session identifier for this conversation
    test_uuid = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
    project = "example"

    # Map agent names and tool names to their implementations
    # The keys must match the agent_name in YAML and function names in custom_tools
    executor_dict = {
        "convert_temperature": convert_temperature,
        "Recommender Agent": recommender_agent,
    }

    try:
        # Initialize the AI Refinery client with authentication
        client = AsyncAIRefinery(api_key=api_key)

        # Create/update the project with the realtime configuration
        client.realtime_distiller.create_project(
            config_path="example_realtime.yaml", project="example"
        )

        # Establish WebSocket connection for realtime voice streaming
        async with client.realtime_distiller(
            project=project,
            uuid=test_uuid + "_voice",
            executor_dict=executor_dict,
        ) as vc:
            print("Voice endpoint connected successfully!")

            while True:

                # Display example queries for user reference
                queries = [
                    "whats 123 times 12 times 3 minus 3",
                    "convert 23 F into celsius",
                    "recommend a summer activity",
                ]
                print("Example queries you can try and speak:")
                for i, query in enumerate(queries, 1):
                    print(f"  {i}. {query}")

                # Present input mode menu
                print("\nChoose input method:")
                print("1. Press 't' + <ENTER> for text input")
                print("2. Press 'a' + <ENTER> for audio input")
                print("3. Press 'q' + <ENTER> to quit")

                cmd = (await async_input("Enter choice: ")).strip().lower()
                if cmd == "q":
                    break

                if cmd == "t":

                    # Text input mode: Send typed query and receive voice response
                    text_query = await async_input("Enter your text query: ")
                    if text_query.strip():
                        print(f"Sending text query: {text_query}")

                        # Process text query and stream audio response back
                        await vc.send_text_and_respond(
                            text=text_query, sample_rate=16000
                        )
                        print("Query completed\n")

                elif cmd == "a":

                    # Audio input mode: Record from microphone and receive voice response
                    print("Press <ENTER> to start recording...")
                    await async_input("")

                    # Capture audio, transcribe, process, and play response
                    await vc.listen_and_respond(sample_rate=16000)
                    print("Speech Response playback completed\n")
                    print("Query completed\n")

                else:
                    print("Invalid choice. Please try again.")
                    continue

            print("Session Closed")

    except Exception as e:

        # Log any errors that occur during voice interaction
        logging.error(f"Voice endpoint failed: {e}")
        traceback.print_exc()
        raise


if __name__ == "__main__":
    # Test voice endpoint
    asyncio.run(test_voice_async())

Realtime Wrapper Methods

The examples above use vc.listen_and_respond() and vc.send_text_and_respond(), which are high-level methods that handle the complete voice interaction loop including microphone capture, server communication, and audio playback.

For full details on parameters and behavior, see the Realtime Wrapper Methods section in the API documentation.