Skip to content

Azure/AI-Transcription

Model Information

Azure/AI-Transcription is a automatic-speech-recognition (ASR) / speech-to-text (STT) service that enables applications, tools, or devices to convert audio into text transcriptions.

  • Model Developer: Microsoft Azure
  • Service Type: Cloud-based ASR API
  • Model Release Date: November 2024
  • Supported Modes: Batch and streaming text transcription
  • Audio Input:
    • Formats: RAW PCM, WAV
    • Sampling Rates: 8 kHz, 16 kHz, 24 kHz, 48 kHz
    • Bits Per Sample: 16 bits
    • Channels: 1
  • Languages: 70+ languages, 140+ distinct locales. Azure Supported Languages
  • Applicable License: Microsoft Online Services License

Model Architecture

Microsoft has not publicly released detailed architectural specifications of Azure AI models.


Parameters

Azure/AI-Transcription supports configurable parameters that can be set for audio inference api and realtime distiller.

Parameter Description
language string — Language code of the speech segment. For a list of supported languages, refer to Azure AI Speech Languages.
prefix_padding_ms integer (0–5000 ms) — Lead-in audio retained before detected speech.
silence_duration_ms integer (0–5000 ms) — Trailing silence duration that marks the end of a chunk.

Benchmark Scores

The following latency performance data shows Azure/AI-Transcription's response time in streaming mode. In this context, a token refers to a unit of text (typically an individual word) that the ASR model outputs, while a segment a section of audio bytes that is processed at once by the model.

Time to First Token was benchmarking using 1 second audio segments containing a single word; all other metrics were measured on long form multi-sentence audio samples. Average Ratio measures real-time performance by comparing each segment’s length to the time required to process it.

Category Metric Result
Quality Word Error Rate (WER) 0.19
Quality Match Error Rate (MER) 0.19
Quality Word Information Loss (WIL) 0.25
Latency Time to First Token (Streaming) 0.87 seconds
Latency Time to First Segment (Streaming) 6.48 seconds
Latency Average Segment Length 9.42 seconds
Latency Average Ratio 1.81

References