Azure/AI-Speech¶

Model Information¶

Azure/AI-Speech is a text-to-speech (TTS) service that enables applications, tools, or devices to convert text into human-like synthesized speech.

Model Developer: Microsoft
Model Release Date: May 2018
Supported Languages: 140+ languages and locales with 500+ voices
- Primary Coverage: English (US/UK/AU/CA/IN/etc.), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Chinese (Mandarin), Hindi, Arabic, Russian
- Recent Additions: Albanian, Arabic (Lebanon/Oman), Azerbaijani, Bosnian, Georgian, Mongolian, Nepali, Tamil (Malaysia)
Audio Output:
- Sampling Rates: 8 kHz, 16 kHz, 24 kHz, 48 kHz (high-fidelity)
- Formats: RAW PCM, RIFF, MP3, Opus, OGG, WEBM, AMR-WB, G.722
Voice Types: Standard neural voices, High-Definition (HD) voices with emotion detection, custom professional voices, personal voices, and multilingual voices

Model Architecture¶

Microsoft has not publicly released detailed architectural specifications of Azure/AI-Speech.

Benchmark Scores¶

The following data measures response times from text input to first synthesized speech segment. Tokens represent individual words, while segments are complete sentences ending with punctuation.

Token Count	Time to First Segment (Streaming)
100	0.16 seconds
200	0.18 seconds
300	0.17 seconds
400	0.20 seconds
500	0.17 seconds
600	0.19 seconds
700	0.18 seconds
800	0.16 seconds
900	0.16 seconds
1000	0.18 seconds

Performance remains consistently fast across different input lengths, with response times staying between 0.16-0.20 seconds regardless of token count.

Azure/AI-Speech¶

Model Information¶

Model Architecture¶

Benchmark Scores¶

References¶