Azure/AI-Transcription¶

Model Information¶

Azure/AI-Transcription is a automatic-speech-recognition (ASR) / speech-to-text (STT) service that enables applications, tools, or devices to convert audio into text transcriptions.

Model Developer: Microsoft Azure
Service Type: Cloud-based ASR API
Model Release Date: November 2024
Supported Modes: Batch and streaming text transcription
Audio Input:
- Formats: RAW PCM, WAV
- Sampling Rates: 8 kHz, 16 kHz, 24 kHz, 48 kHz
- Bits Per Sample: 16 bits
- Channels: 1
Languages: 70+ languages, 140+ distinct locales. Azure Supported Languages

Model Architecture¶

Microsoft has not publicly released detailed architectural specifications of Azure AI models.

Benchmark Scores¶

The following latency performance data shows Azure/AI-Transcription's response time in streaming mode. In this context, a token refers to a unit of text (typically an individual word) that the ASR model outputs, while a segment a section of audio bytes that is processed at once by the model.

Time to First Token was benchmarking using 1 second audio segments containing a single word; all other metrics were measured on long form multi-sentence audio samples. Average Ratio measures real-time performance by comparing each segment’s length to the time required to process it.

Category	Metric	Result
Quality	Word Error Rate (WER)	0.19
Quality	Match Error Rate (MER)	0.19
Quality	Word Information Loss (WIL)	0.25
Latency	Time to First Token (Streaming)	0.87 seconds
Latency	Time to First Segment (Streaming)	6.48 seconds
Latency	Average Segment Length	9.42 seconds
Latency	Average Ratio	1.81

Azure/AI-Transcription¶

Model Information¶

Model Architecture¶

Benchmark Scores¶

References¶