Skip to content

Azure/AI-Speech

Model Information

Azure/AI-Speech is a text-to-speech (TTS) service that enables applications, tools, or devices to convert text into human-like synthesized speech.

  • Model Developer: Microsoft
  • Model Release Date: May 2018
  • Supported Languages: 140+ languages and locales with 500+ voices
    • Primary Coverage: English (US/UK/AU/CA/IN/etc.), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Chinese (Mandarin), Hindi, Arabic, Russian
    • Recent Additions: Albanian, Arabic (Lebanon/Oman), Azerbaijani, Bosnian, Georgian, Mongolian, Nepali, Tamil (Malaysia)
  • Audio Output:
    • Sampling Rates: 8 kHz, 16 kHz, 24 kHz, 48 kHz (high-fidelity)
    • Formats: RAW PCM, RIFF, MP3, Opus, OGG, WEBM, AMR-WB, G.722
  • Voice Types: Standard neural voices, High-Definition (HD) voices with emotion detection, custom professional voices, personal voices, and multilingual voices
  • Applicable License: Microsoft Online Services License

Model Architecture

Microsoft has not publicly released detailed architectural specifications of Azure/AI-Speech.


Parameters

Azure/AI-Speech supports configurable parameters that can be set through the inference api or realtime distiller.

Parameter Description
language string — Language code for speech synthesis (e.g., "en-US"). For a list of supported languages, refer to Azure AI Speech Language and Voice.
voice string — Voice name for speech synthesis (e.g., "en-US-JennyNeural"). For a list of supported voices, refer to Azure AI Speech Language and Voice.
speed number (0.25–4.0) — Speech rate multiplier controlling synthesis speed.
sample_rate integer — Sampling rate of the generated audio. Supported values: 8000, 16000, 24000, 48000 Hz.

Benchmark Scores

The following data measures response times from text input to first synthesized speech segment. Tokens represent individual words, while segments are complete sentences ending with punctuation.

Token Count Time to First Segment (Streaming)
100 0.16 seconds
200 0.18 seconds
300 0.17 seconds
400 0.20 seconds
500 0.17 seconds
600 0.19 seconds
700 0.18 seconds
800 0.16 seconds
900 0.16 seconds
1000 0.18 seconds

Performance remains consistently fast across different input lengths, with response times staying between 0.16-0.20 seconds regardless of token count.


References