Skip to content

Azure/AI-Transcription

Model Information

Azure/AI-Transcription is a automatic-speech-recognition (ASR) / speech-to-text (STT) service that enables applications, tools, or devices to convert audio into text transcriptions.

  • Model Developer: Microsoft Azure
  • Service Type: Cloud-based ASR API
  • Model Release Date: November 2024
  • Supported Modes: Batch and streaming text transcription
  • Audio Input:
    • Formats: RAW PCM, WAV
    • Sampling Rates: 8 kHz, 16 kHz, 24 kHz, 48 kHz
    • Bits Per Sample: 16 bits
    • Channels: 1
  • Languages: 70+ languages, 140+ distinct locales. Azure Supported Languages

Model Architecture

Microsoft has not publicly released detailed architectural specifications of Azure AI models.


Benchmark Scores

The following latency performance data shows Azure/AI-Transcription's response time in streaming mode. In this context, a token refers to a unit of text (typically an individual word) that the ASR model outputs, while a segment a section of audio bytes that is processed at once by the model.

Time to First Token was benchmarking using 1 second audio segments containing a single word; all other metrics were measured on long form multi-sentence audio samples. Average Ratio measures real-time performance by comparing each segment’s length to the time required to process it.

Category Metric Result
Quality Word Error Rate (WER) 0.19
Quality Match Error Rate (MER) 0.19
Quality Word Information Loss (WIL) 0.25
Latency Time to First Token (Streaming) 0.87 seconds
Latency Time to First Segment (Streaming) 6.48 seconds
Latency Average Segment Length 9.42 seconds
Latency Average Ratio 1.81

References