Realtime Voice with Barge-In¶
Overview¶
Barge-in enables users to interrupt AI audio responses by speaking during playback — just like in a real conversation. The server continuously monitors the incoming audio stream using Voice Activity Detection (VAD) and sends a response.interrupted event when user speech is detected.
Key capabilities:
- Natural interruption: Speak at any time during AI playback to interrupt
- Server-side VAD: Speech detection runs on the server — no client-side processing needed
- Configurable sensitivity: Tune detection threshold, lockout period, and grace period
Barge-in requires Acoustic Echo Cancellation (AEC) on the client side — without it, the microphone picks up the AI's voice from the speaker, causing false interruptions. Common AEC options include:
- Browser WebRTC:
echoCancellation: trueingetUserMedia(used in this tutorial) - Telephony hardware: Phone systems and VoIP gateways provide built-in AEC
- Native libraries:
webrtc-audio-processing,speexdsp, or platform APIs (e.g., Apple Voice Processing I/O) - Headphones: Physical isolation — no speaker audio reaches the microphone
Objective¶
This tutorial will guide you through:
- Creating a YAML configuration with a
vadblock to enable barge-in - Setting up and running a barge-in voice demo
- Testing voice interruption of AI responses
Prerequisites¶
- Familiarity with the Realtime Voice with Flow Super Agent (Push to Talk) tutorial
Steps¶
1. Configuration file¶
Start with the Flow Super Agent realtime configuration and add a vad block under audio_config to enable barge-in. For details on speech_config parameters, see the push-to-talk tutorial.
Save the following configuration as bargein_flow_superagent_realtime.yaml.
memory_config:
save_config:
auto_load: false
orchestrator:
agent_list:
- agent_name: "Investment Strategy Advisor"
speech_config:
model: 'Azure/AI-Speech'
language: 'en-AU'
voice: 'en-AU-WilliamNeural'
utility_agents:
- agent_class: PlanningAgent
agent_name: "Stock Planner"
agent_description: "Create a detailed plan to hedge losses against stock price variance."
config:
output_style: "conversational"
- agent_class: PlanningAgent
agent_name: "Currency Planner"
agent_description: "Create a plan to hedge losses against currency price variance."
config:
output_style: "conversational"
speech_config:
model: 'Azure/AI-Speech'
language: 'en-CA'
voice: 'en-CA-LiamNeural'
normalize_text: true
summarize_config:
enable_summarize: True
- agent_class: PlanningAgent
agent_name: "Risk Assessment Planner"
agent_description: "Analyze portfolio risk metrics, volatility, correlation analysis, and stress testing scenarios."
config:
output_style: "conversational"
speech_config:
model: 'Azure/AI-Speech'
language: 'en-GB'
voice: 'en-GB-LibbyNeural'
enable_speech: False
normalize_text: false
summarize_config:
enable_summarize: True
super_agents:
- agent_class: FlowSuperAgent
agent_name: "Investment Strategy Advisor"
agent_description: "Provides investment insights based on stock and finance research."
config:
goal: "Generate investment recommendations."
agent_list:
- agent_name: "Stock Planner"
next_step:
- "Currency Planner"
- "Risk Assessment Planner"
- agent_name: "Currency Planner"
- agent_name: "Risk Assessment Planner"
speech_config:
model: 'Azure/AI-Speech'
language: 'en-US'
voice: 'en-US-JennyNeural'
summarize_config:
enable_summarize: True
audio_config:
asr:
model: "Azure/AI-Transcription"
silence_duration_ms: 3500
tts:
model: "Azure/AI-Speech"
vad:
enable_barge_in: true # Activate server-side VAD
frame_ms: 30 # Analyze audio in 30ms frames
threshold: 15 # Speech frames needed to trigger (lower = more sensitive)
window_size: 24 # Sliding window size for frame counting
lockout_seconds: 0.5 # Wait after TTS starts before allowing interrupts
grace_period_seconds: 1.0 # Keep detecting after last audio sent
silero_min_samples_seconds: 1.0
The only difference from the push-to-talk configuration is the vad block. Without it, the system operates in push-to-talk mode.
VAD Parameters:
| Parameter | Default | Description |
|---|---|---|
enable_barge_in |
false |
Enable VAD-based barge-in detection |
threshold |
30 |
Speech-positive frames needed to trigger. Lower = more sensitive |
window_size |
48 |
Sliding window size (in frames) for counting speech |
lockout_seconds |
1.0 |
Delay after TTS starts before allowing interrupts |
grace_period_seconds |
3.0 |
Time after last audio sent to keep detecting |
frame_ms |
30 |
VAD frame duration in milliseconds |
silero_min_samples_seconds |
1.5 |
Minimum audio buffer before running VAD inference |
Note: Tuning sensitivity
Too many false positives (AI interrupted too easily):
Increase threshold and lockout_seconds.
Too few interruptions (AI hard to interrupt):
Decrease threshold and window_size.
2. Bridge file¶
The bridge connects the browser frontend to AI Refinery via WebSocket. It uses the SDK's RealtimeVoiceBridge class which handles audio routing, event forwarding, and periodic audio commits.
Save the following as bargein_realtime_voice_bridge.py.
"""Barge-in voice bridge — connects a browser frontend to AI Refinery."""
import asyncio
import logging
import os
import time
from air import AsyncAIRefinery
from air.distiller.utils.realtime_helper import RealtimeVoiceBridge
# Logging — shows bridge events and debug messages
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
logging.getLogger("air.distiller.utils.realtime_helper").setLevel(logging.DEBUG)
api_key = os.getenv("API_KEY", "")
async def main():
"""Create the AI Refinery project and start the WebSocket bridge."""
client = AsyncAIRefinery(api_key=api_key)
# Register the project with barge-in configuration
client.realtime_distiller.create_project(
config_path="bargein_flow_superagent_realtime.yaml", project="example"
)
# Start the WebSocket bridge on port 8000
# The browser connects here; the bridge forwards to AI Refinery
uuid = f"websocket_user_{int(time.time())}"
bridge = RealtimeVoiceBridge(
client.realtime_distiller, project="example", uuid=uuid
)
await bridge.serve(port=8000)
if __name__ == "__main__":
asyncio.run(main())
This mirrors the push-to-talk example pattern: set configuration, call the SDK, run. The RealtimeVoiceBridge class handles all WebSocket server management, audio routing, and event forwarding internally.
3. HTML frontend file¶
The frontend captures microphone audio, streams it to the bridge, plays AI audio responses, and handles barge-in interrupts. It uses the browser's built-in echo cancellation to prevent the AI's voice from triggering false interruptions.
Save the following as bargein_ui.html in the same directory.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Barge-In Voice Demo</title>
<style>
body { font-family: sans-serif; max-width: 600px; margin: 40px auto; background: #121212; color: #fff; }
button { padding: 12px 24px; font-size: 16px; border: none; border-radius: 4px; cursor: pointer; color: #fff; }
button:disabled { background: #6c757d; cursor: not-allowed; opacity: 0.7; }
.recording { background: #dc3545; }
.ready { background: #28a745; }
#transcript { margin-top: 20px; padding: 12px; background: #1e1e1e; border-radius: 4px; min-height: 60px; white-space: pre-wrap; }
#status { margin-top: 10px; font-size: 14px; color: #aaa; }
</style>
</head>
<body>
<h1>Barge-In Voice Demo</h1>
<button id="btn" disabled>Connecting...</button>
<div id="status"></div>
<div id="transcript"></div>
<script>
const SAMPLE_RATE = 16000;
const WS_URL = 'ws://localhost:8000';
let ws, audioCtx, mediaStream, workletNode;
let sourceNodes = [], playTime = 0, interrupted = false, recording = false;
const btn = document.getElementById('btn');
const status = document.getElementById('status');
const transcript = document.getElementById('transcript');
// --- WebSocket connection to bridge ---
ws = new WebSocket(WS_URL);
ws.onopen = () => setStatus('WebSocket connected, waiting for session...');
ws.onerror = () => setStatus('WebSocket error');
ws.onclose = () => { setStatus('Disconnected'); btn.disabled = true; btn.textContent = 'Connecting...'; };
ws.onmessage = (evt) => {
const data = JSON.parse(evt.data);
switch (data.type) {
case 'session.created':
btn.disabled = false; btn.textContent = 'Start Recording'; btn.className = 'ready';
setStatus('Ready — click to start recording');
break;
case 'text.delta':
transcript.textContent += data.text || ''; interrupted = false; break;
case 'response.audio_transcript.done':
transcript.textContent = data.text || ''; break;
case 'response.audio.delta':
// Reset interrupt flag when starting a new response
if (sourceNodes.length === 0) interrupted = false;
if (interrupted || !audioCtx) return;
if (sourceNodes.length === 0) {
ws.send(JSON.stringify({ type: 'playback.started' }));
setStatus('AI speaking... (speak to interrupt)');
}
playAudioChunk(data.delta); break;
case 'response.interrupted':
interrupted = true;
sourceNodes.forEach(s => { try { s.stop(0); s.disconnect(); } catch {} });
sourceNodes = [];
if (audioCtx) playTime = audioCtx.currentTime;
ws.send(JSON.stringify({ type: 'response.cancel' }));
setStatus('Interrupted — listening for new input');
break;
}
};
// --- Decode base64 PCM16 audio and play through speaker ---
function playAudioChunk(b64) {
const binary = atob(b64);
const len = binary.length / 2;
const buf = new ArrayBuffer(binary.length);
const view = new DataView(buf);
for (let i = 0; i < binary.length; i++) view.setUint8(i, binary.charCodeAt(i));
const float32 = new Float32Array(len);
for (let i = 0; i < len; i++) float32[i] = view.getInt16(i * 2, true) / 0x7FFF;
const audioBuffer = audioCtx.createBuffer(1, len, SAMPLE_RATE);
audioBuffer.getChannelData(0).set(float32);
const src = audioCtx.createBufferSource();
src.buffer = audioBuffer;
src.connect(audioCtx.destination);
src.onended = () => {
sourceNodes = sourceNodes.filter(s => s !== src);
if (!interrupted && sourceNodes.length === 0) {
ws.send(JSON.stringify({ type: 'playback.stopped' }));
setStatus('AI finished — speak again or stop recording');
}
};
const now = audioCtx.currentTime;
if (playTime < now) playTime = now;
sourceNodes.push(src);
src.start(playTime);
playTime += audioBuffer.duration;
}
// --- Mic capture with inline AudioWorklet (30ms PCM16 chunks at 16kHz) ---
async function startRecording() {
audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE });
const workletCode = `
class P extends AudioWorkletProcessor {
constructor() { super(); this._buf = []; }
process(inputs) {
const input = inputs[0][0];
if (input) {
this._buf.push(...input);
while (this._buf.length >= 480) {
const chunk = this._buf.splice(0, 480);
const int16 = new Int16Array(480);
for (let i = 0; i < 480; i++) int16[i] = Math.max(-1, Math.min(1, chunk[i])) * 0x7FFF;
this.port.postMessage(new Uint8Array(int16.buffer));
}
}
return true;
}
}
registerProcessor('pcm', P);
`;
await audioCtx.audioWorklet.addModule(URL.createObjectURL(new Blob([workletCode], { type: 'application/javascript' })));
mediaStream = await navigator.mediaDevices.getUserMedia({
audio: { sampleRate: SAMPLE_RATE, channelCount: 1, echoCancellation: true }
});
const source = audioCtx.createMediaStreamSource(mediaStream);
workletNode = new AudioWorkletNode(audioCtx, 'pcm');
workletNode.port.onmessage = (e) => {
if (ws.readyState === WebSocket.OPEN && e.data.length > 0) {
ws.send(JSON.stringify({ type: 'input_audio_buffer.append', data: { chunk: btoa(String.fromCharCode(...e.data)) } }));
}
};
// Connect audio — chunks start flowing to the bridge immediately
source.connect(workletNode);
workletNode.connect(audioCtx.destination);
recording = true; btn.textContent = 'Stop Recording'; btn.className = 'recording';
transcript.textContent = ''; setStatus('Recording... speak your query');
}
function stopRecording() {
if (mediaStream) { mediaStream.getTracks().forEach(t => t.stop()); mediaStream = null; }
if (workletNode) { workletNode.disconnect(); workletNode = null; }
ws.send(JSON.stringify({ type: 'stop_recording' }));
recording = false; btn.textContent = 'Start Recording'; btn.className = 'ready';
setStatus('Stopped — click to record again');
}
btn.onclick = () => recording ? stopRecording() : startRecording();
function setStatus(msg) { status.textContent = msg; }
</script>
</body>
</html>
What the frontend does:
- Mic capture: Uses an inline AudioWorklet to capture 30ms PCM16 chunks at 16kHz — matching the server's VAD frame size
- Audio streaming: Sends base64-encoded audio chunks to the bridge via WebSocket
- TTS playback: Decodes incoming audio and plays through
AudioBufferSourceNodewith synchronized timing - Playback signals: Sends
playback.started/playback.stoppedto the server for accurate VAD timing - Barge-in handling: On
response.interrupted, immediately stops all audio and sendsresponse.cancel - Echo cancellation:
echoCancellation: trueingetUserMediaprevents the AI's voice from triggering false interrupts
4. Running the demo¶
You need two terminal windows:
Terminal 1 — Start the bridge:
cd air-sdk/example/realtime_voice
source /path/to/your/venv/bin/activate
export API_KEY="your_api_key_here"
python bargein_realtime_voice_bridge.py
You should see:
Terminal 2 — Serve the HTML frontend:
Open your browser to http://localhost:3030/bargein_ui.html
Using the demo:
- Wait for the button to change from "Connecting..." to "Start Recording"
- Click Start Recording and speak your query (e.g., "How can I protect my investments?")
- The AI responds with audio from each agent in sequence
- Speak at any time to interrupt — the AI stops immediately
- Click Stop Recording when done
Troubleshooting Tips
- "Connecting..." stays forever: Check that the bridge is running on port 8000 and AI Refinery is reachable
- AI doesn't interrupt when you speak:
Lower
thresholdandlockout_secondsin the YAML - AI interrupts too easily:
Increase
threshold(e.g., 25-30). Background noise or speaker echo can cause false triggers
How It Works¶
┌────────────────┐ WebSocket ┌────────────────-┐ SDK ┌────────────────┐
│ Browser │◄──────────►│ Bridge │◄──────►│ AIRefinery │
│ (HTML+JS) │ port 8000 │ (Python) │ │ Server │
│ │ │ │ │ │
│ Mic capture │ │ Routes audio │ │ ASR → LLM │
│ TTS playback │ │ Routes events │ │ → TTS │
│ Interrupts │ │ API key │ │ VAD (server) │
└────────────────┘ └────────────────-┘ └────────────────┘
Barge-in event flow:
Browser Bridge AIRefinery
│ │ │
│──- audio chunks ──────────►│── audio chunks ────────────►│ (ASR + VAD)
│ │◄── response.audio.delta ─────│ (AI speaking)
│◄── response.audio.delta ──-│ │
│──- playback.started ──────►│──- playback.started ────────►│
│ │ │
│──- audio chunks ──────────►│──- audio chunks ────────────►│ (user speaks!)
│ │ │ (VAD triggers)
│ │◄── response.interrupted ─────│
│◄── response.interrupted ──-│ │ (TTS stopped)
│──- response.cancel ───────►│──- response.cancel ─────────►│
│ │◄── response.done ────────────│
│◄── response.done ─────────│ │
Barge-In vs. Push-to-Talk¶
| Behavior | Push-to-Talk | Barge-In |
|---|---|---|
| Audio during AI response | Blocked | Continuous streaming |
| Interruption trigger | Client-initiated (spacebar) | Server-initiated (VAD) |
| User action to interrupt | Press spacebar | Just speak |
| Configuration | No vad block |
vad.enable_barge_in: true |
| Frontend | Python terminal | Client with AEC (browser, telephony, etc.) |
| Extra dependency | None | None |
API Reference¶
For full details on barge-in events, playback notifications, and VAD parameters: