Skip to content

Azure/AI-Speech

Model Information

Azure/AI-Speech is a text-to-speech (TTS) service that enables applications, tools, or devices to convert text into human-like synthesized speech.

  • Model Developer: Microsoft
  • Model Release Date: May 2018
  • Supported Languages: 140+ languages and locales with 500+ voices
    • Primary Coverage: English (US/UK/AU/CA/IN/etc.), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Chinese (Mandarin), Hindi, Arabic, Russian
    • Recent Additions: Albanian, Arabic (Lebanon/Oman), Azerbaijani, Bosnian, Georgian, Mongolian, Nepali, Tamil (Malaysia)
  • Audio Output:
    • Sampling Rates: 8 kHz, 16 kHz, 24 kHz, 48 kHz (high-fidelity)
    • Formats: RAW PCM, RIFF, MP3, Opus, OGG, WEBM, AMR-WB, G.722
  • Voice Types: Standard neural voices, High-Definition (HD) voices with emotion detection, custom professional voices, personal voices, and multilingual voices

Model Architecture

Microsoft has not publicly released detailed architectural specifications of Azure/AI-Speech.


Benchmark Scores

The following data measures response times from text input to first synthesized speech segment. Tokens represent individual words, while segments are complete sentences ending with punctuation.

Token Count Time to First Segment (Streaming)
100 0.16 seconds
200 0.18 seconds
300 0.17 seconds
400 0.20 seconds
500 0.17 seconds
600 0.19 seconds
700 0.18 seconds
800 0.16 seconds
900 0.16 seconds
1000 0.18 seconds

Performance remains consistently fast across different input lengths, with response times staying between 0.16-0.20 seconds regardless of token count.


References