Azure/AI-Speech¶
Model Information¶
Azure/AI-Speech is a text-to-speech (TTS) service that enables applications, tools, or devices to convert text into human-like synthesized speech.
- Model Developer: Microsoft
- Model Release Date: May 2018
- Supported Languages: 140+ languages and locales with 500+ voices
- Primary Coverage: English (US/UK/AU/CA/IN/etc.), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Chinese (Mandarin), Hindi, Arabic, Russian
- Recent Additions: Albanian, Arabic (Lebanon/Oman), Azerbaijani, Bosnian, Georgian, Mongolian, Nepali, Tamil (Malaysia)
- Audio Output:
- Sampling Rates: 8 kHz, 16 kHz, 24 kHz, 48 kHz (high-fidelity)
- Formats: RAW PCM, RIFF, MP3, Opus, OGG, WEBM, AMR-WB, G.722
- Voice Types: Standard neural voices, High-Definition (HD) voices with emotion detection, custom professional voices, personal voices, and multilingual voices
Model Architecture¶
Microsoft has not publicly released detailed architectural specifications of Azure/AI-Speech.
Benchmark Scores¶
The following data measures response times from text input to first synthesized speech segment. Tokens represent individual words, while segments are complete sentences ending with punctuation.
| Token Count | Time to First Segment (Streaming) |
|---|---|
| 100 | 0.16 seconds |
| 200 | 0.18 seconds |
| 300 | 0.17 seconds |
| 400 | 0.20 seconds |
| 500 | 0.17 seconds |
| 600 | 0.19 seconds |
| 700 | 0.18 seconds |
| 800 | 0.16 seconds |
| 900 | 0.16 seconds |
| 1000 | 0.18 seconds |
Performance remains consistently fast across different input lengths, with response times staying between 0.16-0.20 seconds regardless of token count.