Realtime Voice with Tool Use Agent (Push to Talk)¶
Overview¶
Realtime voice interaction extends your existing Tool Use Agent with speech capabilities, enabling natural conversation flows through bidirectional audio streaming. This adds voice processing to your agent workflows while maintaining all existing functionality.
Key capabilities: - Voice input: Process spoken commands for tool execution - Voice output: Receive audio responses with transcription feedback - Text input with voice output: Submit text queries and receive audio responses - Push-to-talk: Control when to capture and send voice input
Objective¶
This tutorial will guide you through configuring and using realtime voice features to add speech interaction to your Tool Use Agent. You will:
- Create or modify a YAML configuration file with input and output speech settings
- Set voice preferences at orchestrator and agent levels
- Build a push-to-talk interface with dual input modes (text and voice)
- Test voice-enabled tool execution with your custom agents
Steps¶
1. Configuration file¶
The YAML configuration has several key sections for realtime functionality:
audio_config (Required)
- Configures the Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) models
-
The TTS model specified here serves as the default for all voice outputs
-
ASR Parameters:
model(Required): Model ID of the ASR model used for realtime speech transcriptionprefix_padding_ms(Optional): Lead-in audio (in milliseconds) retained before detected speechsilence_duration_ms(Optional): Trailing silence duration after audio input (in milliseconds) to end a chunklanguage(Optional): Language to detect and transcribe (default: "en-US")
speech_config (Optional)
- Should only be added to agents where you want voice output
-
Can be configured at two levels:
- Orchestrator level: Sets the default voice for the orchestrator
- Individual agent level: Overrides the default voice for specific agents
-
Parameters:
model(Required): Model ID used to generate the speechlanguage(Optional): Language ID for speech synthesisvoice(Optional): Voice ID for speech synthesisspeed(Optional): Speech speed multiplier (0.25 to 4.0). (Default: 1.0)enable_speech(Optional): Bool to enable or disable agent speech synthesis (Default: True)normalize_text(Optional): Bool to clean up formatting symbols and references so they aren't read aloud (Default: True)sample_rate(Optional): Sample rate of the output audio in Hz. Supported values: 8000,16000,22050,24000,44100,48000 (Default: 16000)
In this example configuration, the Tool Use Agent has its own speech configuration for voice output.
Save the following configuration as example_realtime.yaml.
orchestrator:
agent_list:
- agent_name: "Recommender Agent"
- agent_name: "Tool Use Agent"
speech_config:
model: 'Azure/AI-Speech'
language: 'en-AU'
voice: 'en-AU-WilliamNeural'
utility_agents:
- agent_class: CustomAgent
agent_name: "Recommender Agent"
agent_description: |
The Recommender Agent is a specialist in item recommendations. For instance,
it can provide users with costume recommendations, items to purchase, food,
decorations, and so on.
config:
output_style: "conversational"
speech_config:
model: 'Azure/AI-Speech' # (string, Required): Model ID used to generate the speech.
language: 'en-US' # (string, Optional): Language ID for speech synthesis
voice: 'en-US-JennyNeural' # (string, Optional): Voice ID for speech synthesis.
speed: 2.0 # (number, Optional): Speech speed multiplier (0.25 to 4.0). (Default: 1.0)
enable_speech: True # (bool, Optional) bool to enable or disable agent speech synthesis (Default: True)
normalize_text: True # (bool, Optional) bool to clean up formatting symbols so they aren't read aloud (Default: True)
sample_rate: 16000 # (number, Optional): Sample rate of the output audio in Hz. Supported values: 8000,16000,22050,24000,44100,48000 (Default: 16000).
- agent_class: ToolUseAgent
agent_name: "Tool Use Agent"
agent_description: "An agent that performs function calling using provided tools."
config:
wait_time: 120
output_style: "conversational"
enable_interpreter: true
builtin_tools:
- "calculate_expression"
custom_tools:
- |
{
"type": "function",
"function": {
"name": "convert_temperature",
"description": "Convert temperature between Celsius and Fahrenheit.",
"parameters": {
"type": "object",
"properties": {
"value": {
"type": "float",
"description": "The temperature value to convert."
},
"to_scale": {
"type": "string",
"description": "The scale to convert the temperature to ('Celsius' or 'Fahrenheit').",
"enum": ["Celsius", "Fahrenheit"],
"default": "Celsius"
}
},
"required": ["value"]
}
}
}
speech_config:
model: 'Azure/AI-Speech'
language: 'en-AU'
voice: 'en-AU-WilliamNeural'
enable_speech: True
normalize_text: false
audio_config:
# defaults
asr:
model: "Azure/AI-Transcription" # (string, Required): Model ID of the ASR model used for realtime speech transcription
prefix_padding_ms: 1000 # (integer, 0–5000 ms, Optional): Lead-in audio retained before detected speech.
silence_duration_ms: 3500 # (integer, 0–5000 ms, Optional): Trailing silence duration to end a chunk.
language: "en-US" # (string, Optional): Language to detect and transcribe. (default: "en-US").
tts:
model: "Azure/AI-Speech"
2. Python file¶
Common Setup Functions¶
The examples below use a custom agent and a tool use agent which can be defined as follows:
async def recommender_agent(query: str):
"""Basic Agent to give recommendation"""
prompt = """Given the query below, your task is to provide the user with useful and cool
recommendation followed by a one-sentence justification.\n\nQUERY: {query}"""
prompt = prompt.format(query=query)
airefinery_client = AsyncAIRefinery(api_key=api_key)
response = await airefinery_client.chat.completions.create(
messages=[{"role": "user", "content": prompt}],
model="meta-llama/Llama-3.1-70B-Instruct",
)
return response.choices[0].message.content
def convert_temperature(value: float, to_scale: str = "Celsius") -> float:
"""Convert temperature between Celsius and Fahrenheit."""
if to_scale not in ["Celsius", "Fahrenheit"]:
raise ValueError("to_scale must be 'Celsius' or 'Fahrenheit'.")
if to_scale == "Celsius": return (value - 32) * 5 / 9
return (value * 9 / 5) + 32
Interactive Push-to-Talk with Dual Input Modes¶
This example demonstrates interactive voice input with the option to switch between text and voice input modes within the same session.
Note: Include the functions from "Common Setup Functions" above in your Python file before running the main example.
import asyncio
import logging
import os
import random
import string
import traceback
from air import AsyncAIRefinery
from air.utils.async_helper import async_input
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))
async def test_voice_async():
"""Test realtime voice with custom agent and tool use agent."""
# Generate a unique session identifier for this conversation
test_uuid = "".join(random.choices(string.ascii_lowercase + string.digits, k=8))
project = "example"
# Map agent names and tool names to their implementations
# The keys must match the agent_name in YAML and function names in custom_tools
executor_dict = {
"convert_temperature": convert_temperature,
"Recommender Agent": recommender_agent,
}
try:
# Initialize the AI Refinery client with authentication
client = AsyncAIRefinery(api_key=api_key)
# Create/update the project with the realtime configuration
client.realtime_distiller.create_project(
config_path="example_realtime.yaml", project="example"
)
# Establish WebSocket connection for realtime voice streaming
async with client.realtime_distiller(
project=project,
uuid=test_uuid + "_voice",
executor_dict=executor_dict,
) as vc:
print("Voice endpoint connected successfully!")
while True:
# Display example queries for user reference
queries = [
"whats 123 times 12 times 3 minus 3",
"convert 23 F into celsius",
"recommend a summer activity",
]
print("Example queries you can try and speak:")
for i, query in enumerate(queries, 1):
print(f" {i}. {query}")
# Present input mode menu
print("\nChoose input method:")
print("1. Press 't' + <ENTER> for text input")
print("2. Press 'a' + <ENTER> for audio input")
print("3. Press 'q' + <ENTER> to quit")
cmd = (await async_input("Enter choice: ")).strip().lower()
if cmd == "q":
break
if cmd == "t":
# Text input mode: Send typed query and receive voice response
text_query = await async_input("Enter your text query: ")
if text_query.strip():
print(f"Sending text query: {text_query}")
# Process text query and stream audio response back
await vc.send_text_and_respond(
text=text_query, sample_rate=16000
)
print("Query completed\n")
elif cmd == "a":
# Audio input mode: Record from microphone and receive voice response
print("Press <ENTER> to start recording...")
await async_input("")
# Capture audio, transcribe, process, and play response
await vc.listen_and_respond(sample_rate=16000)
print("Speech Response playback completed\n")
print("Query completed\n")
else:
print("Invalid choice. Please try again.")
continue
print("Session Closed")
except Exception as e:
# Log any errors that occur during voice interaction
logging.error(f"Voice endpoint failed: {e}")
traceback.print_exc()
raise
if __name__ == "__main__":
# Test voice endpoint
asyncio.run(test_voice_async())
Realtime Wrapper Methods¶
The examples above use vc.listen_and_respond() and vc.send_text_and_respond(), which are high-level methods that handle the complete voice interaction loop including microphone capture, server communication, and audio playback.
For full details on parameters and behavior, see the Realtime Wrapper Methods section in the API documentation.