Azure/AI-Speech¶
Model Information¶
Azure/AI-Speech
is a text-to-speech (TTS) service that enables applications, tools, or devices to convert text into human-like synthesized speech.
- Model Developer: Microsoft
- Model Release Date: May 2018
- Supported Languages: 140+ languages and locales with 500+ voices
- Primary Coverage: English (US/UK/AU/CA/IN/etc.), Spanish, French, German, Italian, Portuguese, Japanese, Korean, Chinese (Mandarin), Hindi, Arabic, Russian
- Recent Additions: Albanian, Arabic (Lebanon/Oman), Azerbaijani, Bosnian, Georgian, Mongolian, Nepali, Tamil (Malaysia)
- Audio Output:
- Sampling Rates: 8 kHz, 16 kHz, 24 kHz, 48 kHz (high-fidelity)
- Formats: RAW PCM, RIFF, MP3, Opus, OGG, WEBM, AMR-WB, G.722
- Voice Types: Standard neural voices, High-Definition (HD) voices with emotion detection, custom professional voices, personal voices, and multilingual voices
Model Architecture¶
Microsoft has not publicly released detailed architectural specifications of Azure/AI-Speech
.
Benchmark Scores¶
The following data measures response times from text input to first synthesized speech segment. Tokens represent individual words, while segments are complete sentences ending with punctuation.
Token Count | Time to First Segment (Streaming) |
---|---|
100 | 0.16 seconds |
200 | 0.18 seconds |
300 | 0.17 seconds |
400 | 0.20 seconds |
500 | 0.17 seconds |
600 | 0.19 seconds |
700 | 0.18 seconds |
800 | 0.16 seconds |
900 | 0.16 seconds |
1000 | 0.18 seconds |
Performance remains consistently fast across different input lengths, with response times staying between 0.16-0.20 seconds regardless of token count.