Skip to content

Automatic Speech Recognition (ASR) Transcription API

The Automatic Speech Recognition (ASR) transcription API generates text transcriptions of an input audio file using the AIRefinery or the AsyncAIRefinery client.

This API supports two modes: batch inference mode for processing complete audio files and returning the final transcription after processing is complete, and streaming mode for returning transcription results incrementally as the audio is processed.

Asynchronous Transcription

AsyncAIRefinery.audio.transcriptions.create()

This method asynchronously generates the text transcription of an input audio file.

Parameters
Parameter Type Description
model string (required) Model ID of the ASR model used to generate the transcription.
file IO[bytes] (required) Open file-like object containing the audio to transcribe, in WAV or PCM format.
chunking_strategy string | ChunkingStrategy (optional) Configures server-side VAD and chunking. Accepts "auto" or a ChunkingStrategy object. (default: "auto")
language string (optional) Language to detect and transcribe. (default: "en-US")
response_format string (optional) Desired output format. Supported values: "json", "verbose_json". (default: "json")
timestamp_granularities List[string] (optional) Timestamp types to include in the response. Supported values: "segment", "word". Requires response_format="verbose_json".
stream boolean (optional) If True, enables streaming transcription output. (default: False)
extra_headers map (optional) Additional HTTP headers to include with the request.
extra_body map (optional) Additional fields to merge with or override top-level request parameters.
timeout integer (optional) Request timeout in seconds. (default: 60)

Chunking Strategy (ChunkingStrategy)

Field Type Description
type string ("server_vad") Enables server-side voice activity detection (VAD)–based chunking.
prefix_padding_ms integer (0–5000 ms, optional) Lead-in audio retained before detected speech. Recommended value: ≥4000 ms.
silence_duration_ms integer (0–5000 ms, optional) Trailing silence duration that marks the end of a chunk. Recommended value: 5000 ms.
threshold float (0.0–1.0, optional) VAD sensitivity threshold. Currently ignored.

Note
For audio files with initial silence, set prefix_padding_ms to at least 4000 ms to avoid premature cutoff of detected speech.


Returns:
Batch Inference

The entire audio file is uploaded and processed as a single request, and the final transcription is returned only after processing is complete.

  • ASRResponse

    In this mode (stream=False, default) and without timestamp_granularities, the API returns an ASRResponse object.

    Field Type Description
    text string | null Transcription of the audio file. null if no text was produced.
  • TranscriptionVerbose

    With timestamp_granularities included in the transcription request, the API returns a TranscriptionVerbose object.

    TranscriptionVerbose

    Field Type Description
    task string ("transcribe") Type of task performed. Always "transcribe".
    language string Detected or specified language code (e.g., en-US, fr-FR).
    duration float Total duration of the audio in seconds.
    text string Complete transcribed text aggregated from all segments.
    segments List[Segment] Segment-level transcription results. Included when "segment" is requested in timestamp_granularities.
    words List[Word] (optional) Word-level timing and confidence data. Included when "word" is requested in timestamp_granularities.
    speakers List[string] (optional) List of unique speaker identifiers detected in the audio.

    Segment (TranscriptionVerbose.Segment)

    Field Type Description
    id integer Unique identifier for the segment.
    seek float Offset indicating where the segment starts in the original audio.
    start float Start time of the segment in seconds.
    end float End time of the segment in seconds.
    text string Transcribed text for this segment.
    avg_logprob float Average log probability of word-level confidence scores within the segment.
    compression_ratio float Average characters-per-word compression ratio for the segment.
    speaker_id string (optional) Speaker label (e.g., "Guest-1", "Guest-2", …, "Guest-N" or "Unknown").

    Word (TranscriptionVerbose.Word)

    Field Type Description
    word string Transcribed word text.
    start float Start time of the word in seconds.
    end float End time of the word in seconds.
    confidence float (0.0–1.0, optional) Word-level confidence score.
    segment integer (optional) ID of the segment this word belongs to.
Streaming

Transcription results are returned incrementally as the audio is processed, enabling display of partial transcription results before the full transcription is complete.

In this mode (stream=True), the API returns an AsyncStream[TranscriptionStreamEvent] object, which yields:

  • TranscriptionTextDeltaEvent

    Represents an incremental transcription update emitted during streaming. Provides a newly transcribed text segment (“delta”) as it becomes available, enabling display of partial results.

    Field Type Description
    delta string Newly transcribed text segment emitted as a partial update.
    type string ("transcript.text.delta") Event type identifier. Always "transcript.text.delta".
    logprobs array | null Optional token-level log probabilities associated with the delta.
  • TranscriptionTextDoneEvent

    Represents the final transcription result emitted at the end of audio processing. Marks the completion of the transcription stream and contains the full transcribed text.

    Field Type Description
    text string Complete transcription of the audio input.
    type string ("transcript.text.done") Event type identifier. Always "transcript.text.done".
    logprobs array | null Optional token-level log probabilities for the final transcription.
  • TranscriptionWordEvent

    Represents a real-time word-level transcription event with timing and confidence.

    This event provides detailed word-level information as it becomes available during streaming transcription, including precise timing and confidence scores.

    Provide only when word is included in timestamp_granularities.

    Field Type Description
    word string Transcribed word text.
    start float Start time of the word in seconds.
    end float End time of the word in seconds.
    confidence float (0.0–1.0) Confidence score for the word.
    segment integer Segment ID the word belongs to.
    type string ("transcript.word") Event type identifier. Always "transcript.word".
  • TranscriptionSegmentEvent

    Represents a real-time segment-level transcription event with timing and metadata.

    This event provides detailed segment-level information as it becomes available during streaming transcription, including timing, confidence statistics, and speaker attribution.

    Provide only when segment is included in timestamp_granularities.

    Field Type Description
    segment TranscriptionVerbose.Segment Complete segment data with timing and metadata.
    type string ("transcript.segment") Event type identifier. Always "transcript.segment".

Example Usage:
Batch Inference (Basic response - text transcription only)
import asyncio
import os
from air import AsyncAIRefinery
from dotenv import load_dotenv

# Load environment variables from .env file (contains API_KEY)
load_dotenv()
api_key = str(os.getenv("API_KEY"))

async def generate_transcription(file_name):
    # Initialize the async client with your API key
    client = AsyncAIRefinery(api_key=api_key)

    # Open audio file in binary read mode (supports WAV or PCM format)
    audio_file = open(file_name, "rb")

    # Send transcription request and wait for complete result (batch mode)
    # Returns ASRResponse with text, success, error, and confidence fields
    transcription = await client.audio.transcriptions.create(
        model="Azure/AI-Transcription",  # ASR model ID
        file=audio_file,
    )

    # Access the transcribed text from the response
    print(transcription.text)
    return transcription.text

if __name__ == "__main__":
    asyncio.run(generate_transcription("audio/sample1.wav"))
Batch Inference (Detailed response - Transcription with Timestamps)
import asyncio
import os
from air import AsyncAIRefinery
from air.types.audio import ChunkingStrategy
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))

async def generate_verbose_transcription(file_name):
    client = AsyncAIRefinery(api_key=api_key)
    audio_file = open(file_name, "rb")

    # Request verbose transcription with segment and word-level timestamps
    # Returns TranscriptionVerbose with detailed timing and speaker info
    transcription = await client.audio.transcriptions.create(
        model="Azure/AI-Transcription",
        file=audio_file,
        response_format="verbose_json",  # Required for timestamp data
        timestamp_granularities=["segment", "word"],  # Request both segment and word timestamps
        # Configure Voice Activity Detection (VAD) for chunking
        chunking_strategy=ChunkingStrategy(
            type="server_vad",  # Use server-side VAD
            prefix_padding_ms=4000,  # Keep 4s of audio before detected speech
            silence_duration_ms=5000,  # End chunk after 5s of silence
            threshold=1,  # VAD sensitivity (currently ignored by server)
        ),
    )

    # Access aggregated transcription text and total audio duration
    print(f"Full text: {transcription.text}")
    print(f"Duration: {transcription.duration}s")

    # Iterate through segments with timing and speaker attribution
    for segment in transcription.segments:
        print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.speaker_id}: {segment.text}")

    return transcription

if __name__ == "__main__":
    asyncio.run(generate_verbose_transcription("audio/sample1.wav"))
Streaming Inference (Basic response - text transcription only)
import asyncio
import os
from air import AsyncAIRefinery
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))

async def generate_transcription(file_name):
    client = AsyncAIRefinery(api_key=api_key)
    audio_file = open(file_name, "rb")

    # Enable streaming mode to receive transcription results incrementally
    # Returns AsyncStream[TranscriptionStreamEvent] for real-time processing
    transcription_stream = await client.audio.transcriptions.create(
        model="Azure/AI-Transcription",
        file=audio_file,
        stream=True,  # Enable streaming mode
    )

    print("\n[Streaming Transcription Output]")
    # Iterate over stream events as they arrive
    # Events: TranscriptionTextDeltaEvent (partial) and TranscriptionTextDoneEvent (final)
    async for event in transcription_stream:
        print(event)

if __name__ == "__main__":
    asyncio.run(generate_transcription("audio/sample1.wav"))
Streaming Inference (Detailed response - Transcription with Timestamps)
import asyncio
import os
from air import AsyncAIRefinery
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))

async def generate_streaming_transcription_with_timestamps(file_name):
    client = AsyncAIRefinery(api_key=api_key)
    audio_file = open(file_name, "rb")

    # Combine streaming with verbose output for real-time timestamps
    # Emits word and segment events as audio is processed
    transcription_stream = await client.audio.transcriptions.create(
        model="Azure/AI-Transcription",
        file=audio_file,
        response_format="verbose_json",  # Required for timestamp events
        stream=True,  # Enable streaming mode
        timestamp_granularities=["segment", "word"],  # Request both granularities
    )

    print("\n[Streaming Transcription with Timestamps]")

    # Process each event based on its type
    async for event in transcription_stream:
        if hasattr(event, "type"):
            event_type = event.type

            # TranscriptionTextDeltaEvent: incremental text updates
            if event_type == "transcript.text.delta":
                text = getattr(event, "text", "")
                print(f"Delta: {text}")

            # TranscriptionWordEvent: word-level timing and confidence
            elif event_type == "transcript.word":
                word = getattr(event, "word", "")
                start = getattr(event, "start", 0)
                end = getattr(event, "end", 0)
                confidence = getattr(event, "confidence", 0)
                print(f"Word: {word} [{start:.2f}s - {end:.2f}s] (confidence: {confidence:.2f})")

            # TranscriptionSegmentEvent: segment with speaker attribution
            elif event_type == "transcript.segment":
                segment = getattr(event, "segment", {})
                start = segment.get("start", 0)
                end = segment.get("end", 0)
                speaker_id = segment.get("speaker_id", "Unknown")
                text = segment.get("text", "")
                print(f"Segment: [{start:.2f}s - {end:.2f}s] {speaker_id}: {text}")

            # TranscriptionTextDoneEvent: final complete transcription
            elif event_type == "transcript.text.done":
                text = getattr(event, "text", "")
                print(f"\nFinal text: {text}")

if __name__ == "__main__":
    asyncio.run(generate_streaming_transcription_with_timestamps("audio/sample1.wav"))

Synchronous Transcription

AIRefinery.audio.transcriptions.create()

This method synchronously generates the text transcription of an input audio file. It supports the same parameters and return structure as the asynchronous method.

Example Usage:
Batch Inference (Basic response - text transcription only)
import os
from air import AIRefinery
from dotenv import load_dotenv

# Load environment variables from .env file (contains API_KEY)
load_dotenv()
api_key = str(os.getenv("API_KEY"))

def generate_transcription(file_name):
    # Initialize the synchronous client with your API key
    client = AIRefinery(api_key=api_key)

    # Open audio file in binary read mode (supports WAV or PCM format)
    audio_file = open(file_name, "rb")

    # Send transcription request and wait for complete result (batch mode)
    # Returns ASRResponse with text, success, error, and confidence fields
    transcription = client.audio.transcriptions.create(
        model="Azure/AI-Transcription",  # ASR model ID
        file=audio_file,
    )

    # Access the transcribed text from the response
    print(transcription.text)
    return transcription.text

if __name__ == "__main__":
    generate_transcription("audio/sample1.wav")
Batch Inference (Detailed response - Transcription with Timestamps)
import os
from air import AIRefinery
from air.types.audio import ChunkingStrategy
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))

def generate_verbose_transcription(file_name):
    client = AIRefinery(api_key=api_key)
    audio_file = open(file_name, "rb")

    # Request verbose transcription with segment and word-level timestamps
    # Returns TranscriptionVerbose with detailed timing and speaker info
    transcription = client.audio.transcriptions.create(
        model="Azure/AI-Transcription",
        file=audio_file,
        response_format="verbose_json",  # Required for timestamp data
        timestamp_granularities=["segment", "word"],  # Request both segment and word timestamps
        # Configure Voice Activity Detection (VAD) for chunking
        chunking_strategy=ChunkingStrategy(
            type="server_vad",  # Use server-side VAD
            prefix_padding_ms=4000,  # Keep 4s of audio before detected speech
            silence_duration_ms=5000,  # End chunk after 5s of silence
            threshold=1,  # VAD sensitivity (currently ignored by server)
        ),
    )

    # Access aggregated transcription text and total audio duration
    print(f"Full text: {transcription.text}")
    print(f"Duration: {transcription.duration}s")

    # Iterate through segments with timing and speaker attribution
    for segment in transcription.segments:
        print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.speaker_id}: {segment.text}")

    return transcription

if __name__ == "__main__":
    generate_verbose_transcription("audio/sample1.wav")
Streaming Inference (Basic response - text transcription only)
import os
from air import AIRefinery
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))

def generate_transcription(file_name):
    client = AIRefinery(api_key=api_key)
    audio_file = open(file_name, "rb")

    # Enable streaming mode to receive transcription results incrementally
    # Returns Stream[TranscriptionStreamEvent] for real-time processing
    transcription_stream = client.audio.transcriptions.create(
        model="Azure/AI-Transcription",
        file=audio_file,
        stream=True,  # Enable streaming mode
    )

    # Iterate over stream events as they arrive
    # Events: TranscriptionTextDeltaEvent (partial) and TranscriptionTextDoneEvent (final)
    for event in transcription_stream:
        print(event)

if __name__ == "__main__":
    generate_transcription("audio/sample1.wav")
Streaming Inference (Detailed response - Transcription with Timestamps)
import os
from air import AIRefinery
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))

def generate_streaming_transcription_with_timestamps(file_name):
    client = AIRefinery(api_key=api_key)
    audio_file = open(file_name, "rb")

    # Combine streaming with verbose output for real-time timestamps
    # Emits word and segment events as audio is processed
    transcription_stream = client.audio.transcriptions.create(
        model="Azure/AI-Transcription",
        file=audio_file,
        response_format="verbose_json",  # Required for timestamp events
        stream=True,  # Enable streaming mode
        timestamp_granularities=["segment", "word"],  # Request both granularities
    )

    print("\n[Streaming Transcription with Timestamps]")

    # Process each event based on its type
    for event in transcription_stream:
        if hasattr(event, "type"):
            event_type = event.type

            # TranscriptionTextDeltaEvent: incremental text updates
            if event_type == "transcript.text.delta":
                text = getattr(event, "text", "")
                print(f"Delta: {text}")

            # TranscriptionWordEvent: word-level timing and confidence
            elif event_type == "transcript.word":
                word = getattr(event, "word", "")
                start = getattr(event, "start", 0)
                end = getattr(event, "end", 0)
                confidence = getattr(event, "confidence", 0)
                print(f"Word: {word} [{start:.2f}s - {end:.2f}s] (confidence: {confidence:.2f})")

            # TranscriptionSegmentEvent: segment with speaker attribution
            elif event_type == "transcript.segment":
                segment = getattr(event, "segment", {})
                start = segment.get("start", 0)
                end = segment.get("end", 0)
                speaker_id = segment.get("speaker_id", "Unknown")
                text = segment.get("text", "")
                print(f"Segment: [{start:.2f}s - {end:.2f}s] {speaker_id}: {text}")

            # TranscriptionTextDoneEvent: final complete transcription
            elif event_type == "transcript.text.done":
                text = getattr(event, "text", "")
                print(f"\nFinal text: {text}")

if __name__ == "__main__":
    generate_streaming_transcription_with_timestamps("audio/sample1.wav")