Azure/AI-Transcription¶
Model Information¶
Azure/AI-Transcription
is a automatic-speech-recognition (ASR) / speech-to-text (STT) service that enables applications, tools, or devices to convert audio into text transcriptions.
- Model Developer: Microsoft Azure
- Service Type: Cloud-based ASR API
- Model Release Date: November 2024
- Supported Modes: Batch and streaming text transcription
- Audio Input:
- Formats: RAW PCM, WAV
- Sampling Rates: 8 kHz, 16 kHz, 24 kHz, 48 kHz
- Bits Per Sample: 16 bits
- Channels: 1
- Languages: 70+ languages, 140+ distinct locales. Azure Supported Languages
Model Architecture¶
Microsoft has not publicly released detailed architectural specifications of Azure AI models.
Benchmark Scores¶
The following latency performance data shows Azure/AI-Transcription
's response time in streaming mode. In this context, a token refers to a unit of text (typically an individual word) that the ASR model outputs, while a segment a section of audio bytes that is processed at once by the model.
Time to First Token was benchmarking using 1 second audio segments containing a single word; all other metrics were measured on long form multi-sentence audio samples. Average Ratio measures real-time performance by comparing each segmentβs length to the time required to process it.
Category | Metric | Result |
---|---|---|
Quality | Word Error Rate (WER) | 0.19 |
Quality | Match Error Rate (MER) | 0.19 |
Quality | Word Information Loss (WIL) | 0.25 |
Latency | Time to First Token (Streaming) | 0.87 seconds |
Latency | Time to First Segment (Streaming) | 6.48 seconds |
Latency | Average Segment Length | 9.42 seconds |
Latency | Average Ratio | 1.81 |