Realtime Voice with Barge-In¶

Overview¶

Barge-in enables users to interrupt AI audio responses by speaking during playback — just like in a real conversation. The server continuously monitors the incoming audio stream using Voice Activity Detection (VAD) and sends a response.interrupted event when user speech is detected.

Key capabilities:

Natural interruption: Speak at any time during AI playback to interrupt
Server-side VAD: Speech detection runs on the server — no client-side processing needed
Configurable sensitivity: Tune detection threshold, lockout period, and grace period

Barge-in requires Acoustic Echo Cancellation (AEC) on the client side — without it, the microphone picks up the AI's voice from the speaker, causing false interruptions. Common AEC options include:

Browser WebRTC: echoCancellation: true in getUserMedia (used in this tutorial)
Telephony hardware: Phone systems and VoIP gateways provide built-in AEC
Native libraries: webrtc-audio-processing, speexdsp, or platform APIs (e.g., Apple Voice Processing I/O)
Headphones: Physical isolation — no speaker audio reaches the microphone

Objective¶

This tutorial will guide you through:

Creating a YAML configuration with a vad block to enable barge-in
Setting up and running a barge-in voice demo
Testing voice interruption of AI responses

Prerequisites¶

Familiarity with the Realtime Voice with Flow Super Agent (Push to Talk) tutorial

Steps¶

1. Configuration file¶

Start with the Flow Super Agent realtime configuration and add a vad block under audio_config to enable barge-in. For details on speech_config parameters, see the push-to-talk tutorial.

Save the following configuration as bargein_flow_superagent_realtime.yaml.

memory_config:
  save_config:
    auto_load: false

orchestrator:
  agent_list:
    - agent_name: "Investment Strategy Advisor"
  speech_config:
    model: 'Azure/AI-Speech'
    language: 'en-AU'
    voice: 'en-AU-WilliamNeural'

utility_agents:
  - agent_class: PlanningAgent
    agent_name: "Stock Planner"
    agent_description: "Create a detailed plan to hedge losses against stock price variance."
    config:
      output_style: "conversational"

  - agent_class: PlanningAgent
    agent_name: "Currency Planner"
    agent_description: "Create a plan to hedge losses against currency price variance."
    config:
      output_style: "conversational"
    speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-CA'
      voice: 'en-CA-LiamNeural'
      normalize_text: true
      summarize_config:
        enable_summarize: True

  - agent_class: PlanningAgent
    agent_name: "Risk Assessment Planner"
    agent_description: "Analyze portfolio risk metrics, volatility, correlation analysis, and stress testing scenarios."
    config:
      output_style: "conversational"
    speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-GB'
      voice: 'en-GB-LibbyNeural'
      enable_speech: False
      normalize_text: false
      summarize_config:
        enable_summarize: True

super_agents:
  - agent_class: FlowSuperAgent
    agent_name: "Investment Strategy Advisor"
    agent_description: "Provides investment insights based on stock and finance research."
    config:
      goal: "Generate investment recommendations."
      agent_list:
        - agent_name: "Stock Planner"
          next_step:
            - "Currency Planner"
            - "Risk Assessment Planner"
        - agent_name: "Currency Planner"
        - agent_name: "Risk Assessment Planner"
    speech_config:
      model: 'Azure/AI-Speech'
      language: 'en-US'
      voice: 'en-US-JennyNeural'
      summarize_config:
        enable_summarize: True

audio_config:
  asr:
    model: "Azure/AI-Transcription"
    silence_duration_ms: 3500
  tts:
    model: "Azure/AI-Speech"
  vad:
    enable_barge_in: true         # Activate server-side VAD
    frame_ms: 30                  # Analyze audio in 30ms frames
    threshold: 15                 # Speech frames needed to trigger (lower = more sensitive)
    window_size: 24               # Sliding window size for frame counting
    lockout_seconds: 0.5          # Wait after TTS starts before allowing interrupts
    grace_period_seconds: 1.0     # Keep detecting after last audio sent
    silero_min_samples_seconds: 1.0

The only difference from the push-to-talk configuration is the vad block. Without it, the system operates in push-to-talk mode.

VAD Parameters:

Parameter	Default	Description
`enable_barge_in`	`false`	Enable VAD-based barge-in detection
`threshold`	`30`	Speech-positive frames needed to trigger. Lower = more sensitive
`window_size`	`48`	Sliding window size (in frames) for counting speech
`lockout_seconds`	`1.0`	Delay after TTS starts before allowing interrupts
`grace_period_seconds`	`3.0`	Time after last audio sent to keep detecting
`frame_ms`	`30`	VAD frame duration in milliseconds
`silero_min_samples_seconds`	`1.5`	Minimum audio buffer before running VAD inference

Note: Tuning sensitivity

Too many false positives (AI interrupted too easily): Increase threshold and lockout_seconds. Too few interruptions (AI hard to interrupt): Decrease threshold and window_size.

2. Bridge file¶

The bridge connects the browser frontend to AI Refinery via WebSocket. It uses the SDK's RealtimeVoiceBridge class which handles audio routing, event forwarding, and periodic audio commits.

Save the following as bargein_realtime_voice_bridge.py.

"""Barge-in voice bridge — connects a browser frontend to AI Refinery."""

import asyncio
import logging
import os
import time

from air import AsyncAIRefinery
from air.distiller.utils.realtime_helper import RealtimeVoiceBridge

# Logging — shows bridge events and debug messages
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
logging.getLogger("air.distiller.utils.realtime_helper").setLevel(logging.DEBUG)

api_key = os.getenv("API_KEY", "")


async def main():
    """Create the AI Refinery project and start the WebSocket bridge."""
    client = AsyncAIRefinery(api_key=api_key)

    # Register the project with barge-in configuration
    client.realtime_distiller.create_project(
        config_path="bargein_flow_superagent_realtime.yaml", project="example"
    )

    # Start the WebSocket bridge on port 8000
    # The browser connects here; the bridge forwards to AI Refinery
    uuid = f"websocket_user_{int(time.time())}"
    bridge = RealtimeVoiceBridge(
        client.realtime_distiller, project="example", uuid=uuid
    )
    await bridge.serve(port=8000)


if __name__ == "__main__":
    asyncio.run(main())

This mirrors the push-to-talk example pattern: set configuration, call the SDK, run. The RealtimeVoiceBridge class handles all WebSocket server management, audio routing, and event forwarding internally.

3. HTML frontend file¶

The frontend captures microphone audio, streams it to the bridge, plays AI audio responses, and handles barge-in interrupts. It uses the browser's built-in echo cancellation to prevent the AI's voice from triggering false interruptions.

Save the following as bargein_ui.html in the same directory.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Barge-In Voice Demo</title>
<style>
  body { font-family: sans-serif; max-width: 600px; margin: 40px auto; background: #121212; color: #fff; }
  button { padding: 12px 24px; font-size: 16px; border: none; border-radius: 4px; cursor: pointer; color: #fff; }
  button:disabled { background: #6c757d; cursor: not-allowed; opacity: 0.7; }
  .recording { background: #dc3545; }
  .ready { background: #28a745; }
  #transcript { margin-top: 20px; padding: 12px; background: #1e1e1e; border-radius: 4px; min-height: 60px; white-space: pre-wrap; }
  #status { margin-top: 10px; font-size: 14px; color: #aaa; }
</style>
</head>
<body>
<h1>Barge-In Voice Demo</h1>
<button id="btn" disabled>Connecting...</button>
<div id="status"></div>
<div id="transcript"></div>

<script>
const SAMPLE_RATE = 16000;
const WS_URL = 'ws://localhost:8000';

let ws, audioCtx, mediaStream, workletNode;
let sourceNodes = [], playTime = 0, interrupted = false, recording = false;
const btn = document.getElementById('btn');
const status = document.getElementById('status');
const transcript = document.getElementById('transcript');

// --- WebSocket connection to bridge ---
ws = new WebSocket(WS_URL);
ws.onopen = () => setStatus('WebSocket connected, waiting for session...');
ws.onerror = () => setStatus('WebSocket error');
ws.onclose = () => { setStatus('Disconnected'); btn.disabled = true; btn.textContent = 'Connecting...'; };

ws.onmessage = (evt) => {
  const data = JSON.parse(evt.data);
  switch (data.type) {
    case 'session.created':
      btn.disabled = false; btn.textContent = 'Start Recording'; btn.className = 'ready';
      setStatus('Ready — click to start recording');
      break;
    case 'text.delta':
      transcript.textContent += data.text || ''; interrupted = false; break;
    case 'response.audio_transcript.done':
      transcript.textContent = data.text || ''; break;
    case 'response.audio.delta':
      // Reset interrupt flag when starting a new response
      if (sourceNodes.length === 0) interrupted = false;
      if (interrupted || !audioCtx) return;
      if (sourceNodes.length === 0) {
        ws.send(JSON.stringify({ type: 'playback.started' }));
        setStatus('AI speaking... (speak to interrupt)');
      }
      playAudioChunk(data.delta); break;
    case 'response.interrupted':
      interrupted = true;
      sourceNodes.forEach(s => { try { s.stop(0); s.disconnect(); } catch {} });
      sourceNodes = [];
      if (audioCtx) playTime = audioCtx.currentTime;
      ws.send(JSON.stringify({ type: 'response.cancel' }));
      setStatus('Interrupted — listening for new input');
      break;
  }
};

// --- Decode base64 PCM16 audio and play through speaker ---
function playAudioChunk(b64) {
  const binary = atob(b64);
  const len = binary.length / 2;
  const buf = new ArrayBuffer(binary.length);
  const view = new DataView(buf);
  for (let i = 0; i < binary.length; i++) view.setUint8(i, binary.charCodeAt(i));
  const float32 = new Float32Array(len);
  for (let i = 0; i < len; i++) float32[i] = view.getInt16(i * 2, true) / 0x7FFF;

  const audioBuffer = audioCtx.createBuffer(1, len, SAMPLE_RATE);
  audioBuffer.getChannelData(0).set(float32);
  const src = audioCtx.createBufferSource();
  src.buffer = audioBuffer;
  src.connect(audioCtx.destination);
  src.onended = () => {
    sourceNodes = sourceNodes.filter(s => s !== src);
    if (!interrupted && sourceNodes.length === 0) {
      ws.send(JSON.stringify({ type: 'playback.stopped' }));
      setStatus('AI finished — speak again or stop recording');
    }
  };
  const now = audioCtx.currentTime;
  if (playTime < now) playTime = now;
  sourceNodes.push(src);
  src.start(playTime);
  playTime += audioBuffer.duration;
}

// --- Mic capture with inline AudioWorklet (30ms PCM16 chunks at 16kHz) ---
async function startRecording() {
  audioCtx = new AudioContext({ sampleRate: SAMPLE_RATE });
  const workletCode = `
    class P extends AudioWorkletProcessor {
      constructor() { super(); this._buf = []; }
      process(inputs) {
        const input = inputs[0][0];
        if (input) {
          this._buf.push(...input);
          while (this._buf.length >= 480) {
            const chunk = this._buf.splice(0, 480);
            const int16 = new Int16Array(480);
            for (let i = 0; i < 480; i++) int16[i] = Math.max(-1, Math.min(1, chunk[i])) * 0x7FFF;
            this.port.postMessage(new Uint8Array(int16.buffer));
          }
        }
        return true;
      }
    }
    registerProcessor('pcm', P);
  `;
  await audioCtx.audioWorklet.addModule(URL.createObjectURL(new Blob([workletCode], { type: 'application/javascript' })));
  mediaStream = await navigator.mediaDevices.getUserMedia({
    audio: { sampleRate: SAMPLE_RATE, channelCount: 1, echoCancellation: true }
  });
  const source = audioCtx.createMediaStreamSource(mediaStream);
  workletNode = new AudioWorkletNode(audioCtx, 'pcm');
  workletNode.port.onmessage = (e) => {
    if (ws.readyState === WebSocket.OPEN && e.data.length > 0) {
      ws.send(JSON.stringify({ type: 'input_audio_buffer.append', data: { chunk: btoa(String.fromCharCode(...e.data)) } }));
    }
  };
  // Connect audio — chunks start flowing to the bridge immediately
  source.connect(workletNode);
  workletNode.connect(audioCtx.destination);
  recording = true; btn.textContent = 'Stop Recording'; btn.className = 'recording';
  transcript.textContent = ''; setStatus('Recording... speak your query');
}

function stopRecording() {
  if (mediaStream) { mediaStream.getTracks().forEach(t => t.stop()); mediaStream = null; }
  if (workletNode) { workletNode.disconnect(); workletNode = null; }
  ws.send(JSON.stringify({ type: 'stop_recording' }));
  recording = false; btn.textContent = 'Start Recording'; btn.className = 'ready';
  setStatus('Stopped — click to record again');
}

btn.onclick = () => recording ? stopRecording() : startRecording();
function setStatus(msg) { status.textContent = msg; }
</script>
</body>
</html>

What the frontend does:

Mic capture: Uses an inline AudioWorklet to capture 30ms PCM16 chunks at 16kHz — matching the server's VAD frame size
Audio streaming: Sends base64-encoded audio chunks to the bridge via WebSocket
TTS playback: Decodes incoming audio and plays through AudioBufferSourceNode with synchronized timing
Playback signals: Sends playback.started / playback.stopped to the server for accurate VAD timing
Barge-in handling: On response.interrupted, immediately stops all audio and sends response.cancel
Echo cancellation: echoCancellation: true in getUserMedia prevents the AI's voice from triggering false interrupts

4. Running the demo¶

You need two terminal windows:

Terminal 1 — Start the bridge:

cd air-sdk/example/realtime_voice
source /path/to/your/venv/bin/activate
export API_KEY="your_api_key_here"
python bargein_realtime_voice_bridge.py

You should see:

Project 'my_voice_project' created successfully
Starting WebSocket bridge on ws://0.0.0.0:8000/

Terminal 2 — Serve the HTML frontend:

cd air-sdk/example/realtime_voice
python -m http.server 3030

Open your browser to http://localhost:3030/bargein_ui.html

Using the demo:

Wait for the button to change from "Connecting..." to "Start Recording"
Click Start Recording and speak your query (e.g., "How can I protect my investments?")
The AI responds with audio from each agent in sequence
Speak at any time to interrupt — the AI stops immediately
Click Stop Recording when done

Troubleshooting Tips

"Connecting..." stays forever: Check that the bridge is running on port 8000 and AI Refinery is reachable
AI doesn't interrupt when you speak: Lower threshold and lockout_seconds in the YAML
AI interrupts too easily: Increase threshold (e.g., 25-30). Background noise or speaker echo can cause false triggers

How It Works¶

┌────────────────┐  WebSocket ┌────────────────-┐   SDK  ┌────────────────┐
│    Browser     │◄──────────►│    Bridge       │◄──────►│   AIRefinery   │
│   (HTML+JS)    │  port 8000 │   (Python)      │        │   Server       │
│                │            │                 │        │                │
│  Mic capture   │            │  Routes audio   │        │  ASR → LLM     │
│  TTS playback  │            │  Routes events  │        │  → TTS         │
│  Interrupts    │            │  API key        │        │  VAD (server)  │
└────────────────┘            └────────────────-┘        └────────────────┘

Barge-in event flow:

Browser                      Bridge                       AIRefinery
  │                            │                              │
  │──- audio chunks ──────────►│──  audio chunks ────────────►│ (ASR + VAD)
  │                            │◄── response.audio.delta ─────│ (AI speaking)
  │◄── response.audio.delta ──-│                              │
  │──- playback.started ──────►│──- playback.started ────────►│
  │                            │                              │
  │──- audio chunks ──────────►│──- audio chunks ────────────►│ (user speaks!)
  │                            │                              │ (VAD triggers)
  │                            │◄── response.interrupted ─────│
  │◄── response.interrupted ──-│                              │ (TTS stopped)
  │──- response.cancel ───────►│──- response.cancel ─────────►│
  │                            │◄── response.done ────────────│
  │◄── response.done  ─────────│                              │

Barge-In vs. Push-to-Talk¶

Behavior	Push-to-Talk	Barge-In
Audio during AI response	Blocked	Continuous streaming
Interruption trigger	Client-initiated (spacebar)	Server-initiated (VAD)
User action to interrupt	Press spacebar	Just speak
Configuration	No `vad` block	`vad.enable_barge_in: true`
Frontend	Python terminal	Client with AEC (browser, telephony, etc.)
Extra dependency	None	None

API Reference¶

For full details on barge-in events, playback notifications, and VAD parameters: