Text-to-Speech (TTS) API¶
The Text-to-Speech (TTS) API generates spoken audio from text input using the AIRefinery or the AsyncAIRefinery client.
This API supports two modes: batch synthesis mode, which waits for complete synthesis before returning all audio data at once, and streaming mode, which yields audio chunks as they're produced during synthesis.
Asynchronous TTS¶
The AsyncAIRefinery client asynchronously generates speech from input text.
Batch and Streaming Methods¶
audio.speech.create()- Returns complete audio after synthesis (batch synthesis mode)audio.speech.with_streaming_response.create()- Returns audio chunks during synthesis (streaming mode)
Parameters:¶
| Parameter | Type | Description |
|---|---|---|
model |
string (required) | Model ID used to generate the speech. |
input |
string (required) | The text to convert to speech. |
voice |
string (required) | Voice name for speech synthesis (e.g., "en-US-JennyNeural"). |
response_format |
string (optional) | Audio format for output. Supported values: "wav", "mp3", "pcm", "opus". (default: "wav") |
speed |
number (optional) | Speech speed multiplier (0.25 to 4.0). (default: 1.0) |
timeout |
number (optional) | Request timeout in seconds. |
extra_headers |
map (optional) | Additional HTTP headers to include with the request. |
extra_body |
map (optional) | Additional parameters for speech synthesis. See Extra Body Parameters table below. |
Extra Body Parameters (extra_body dict):
These parameters should be passed as a dictionary to the extra_body parameter:
| Parameter | Type | Description |
|---|---|---|
speech_synthesis_language |
string (optional) | Language code for speech synthesis (e.g., "en-US", "fr-FR"). |
sample_rate |
integer (optional) | Audio sampling rate in Hz (e.g., 16000, 24000, 48000). See Supported Sampling Rates table. |
enable_word_boundary |
boolean (optional) | If true, returns word timing metadata alongside audio. (default: false) |
boundary_types |
List[string] (optional) | Filter which boundary types to include. Supported values: "word", "punctuation", "sentence". Omit to receive all three types. Cannot be an empty array. |
Returns:¶
Batch Synthesis¶
The entire text input is processed in a single request, and the complete synthesized audio is returned only after generation is finished.
In this mode, the API returns a TTSResponse object with the following fields/methods:
| Field/Method | Type | Description |
|---|---|---|
content |
bytes | Raw audio bytes of the synthesized speech. |
word_boundaries |
List[TTSWordBoundaryEvent] (optional) | List of word timing metadata. Only present when enable_word_boundary=True. |
write_to_file(file) |
method | Save audio content to the specified file. |
stream_to_file(file, chunk_size) |
method | Stream audio to file in chunks. |
iter_bytes(chunk_size) |
method | Iterate over audio in byte chunks. |
aiter_bytes(chunk_size) |
method | Async iterate over audio in byte chunks. |
Streaming¶
Synthesized audio is returned incrementally in chunks as it is generated, allowing playback to begin before the full audio is ready.
In this mode, the API returns an StreamingResponse object with the following fields/methods:
| Field/Method | Type | Description |
|---|---|---|
iter(stream_generator()) |
iterator | Iterator of bytes chunks (or mixed bytes/TTSWordBoundaryEvent when word boundaries enabled). |
stream_generator.__aiter__() |
async iterator | Async iterator of bytes chunks (or mixed types when word boundaries enabled). |
stream_to_file(file_path) |
method | Saves the full streamed audio content to the specified file. Automatically handles sync or async behavior depending on is_async. |
Supported Audio Formats¶
Different use cases prioritize different trade-offs—fidelity, size, compatibility, or streaming efficiency. Supporting multiple formats ensures the API can serve everything from phone-based IVR to high-quality media production.
| Format | Type | Characteristics | Typical Use Cases |
|---|---|---|---|
| WAV / PCM | Uncompressed | Highest fidelity, large files | Studio recording, audio processing |
| MP3 | Lossy compression | Small file size, universally supported | Web playback, mobile apps, archival |
| Ogg Opus | Modern codec | Excellent quality at low bitrates, efficient streaming | Real-time communication, low-bandwidth scenarios |
Supported Sampling Rates¶
| Sampling Rate (Hz) | Typical Use |
|---|---|
| 8000 | Telephony / IVR |
| 16000 | Wide-band speech |
| 22050 / 24000 | High-quality voice assistants |
| 44100 / 48000 | Broadcast / studio quality |
Word Boundary Events (Azure-specific)¶
When enable_word_boundary is set to true in extra_body, the API returns timing metadata for words, punctuation, and sentences during synthesis.
TTSWordBoundaryEvent fields:
| Field | Type | Description |
|---|---|---|
type |
string ("word_boundary") |
Event type identifier. Always "word_boundary". |
text |
string | The word or punctuation text. |
audio_offset_ms |
float | Time offset in milliseconds from audio start. |
duration_ms |
float | Duration of the word in milliseconds. |
text_offset |
integer | Character offset in the original input text. |
word_length |
integer | Length of the word in characters. |
boundary_type |
string | Type of boundary. Supported values: "word", "punctuation", "sentence". |
Batch mode response (with word boundaries): Returns JSON containing audio (base64-encoded) and word_boundaries array.
Streaming mode response (with word boundaries): Returns NDJSON stream with mixed {"type": "audio", "data": "..."} and {"type": "word_boundary", ...} events.
Example Usage:¶
Batch Synthesis¶
import os
import asyncio
from air import AsyncAIRefinery
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))
async def tts_synthesis_async():
# Initialize the AI Refinery client
client = AsyncAIRefinery(api_key=api_key)
# Generate speech from text (batch mode, async)
# Speech synthesis language and sample rate can
# be specified using the `extra_body` parameter
# Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
response = await client.audio.speech.create(
model="Azure/AI-Speech", # Specify the model to generate audio
input="Hello, this is a test of text-to-speech synthesis.",
voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
response_format="wav",
speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
extra_body={
"speech_synthesis_language": "en-US",
"sample_rate": 24000
}
)
# Save the audio to a file
response.write_to_file("output.wav")
print(f"Audio saved! Size: {len(response.content)} bytes")
# Run the example
if __name__ == "__main__":
asyncio.run(tts_synthesis_async())
Streaming¶
import os
import asyncio
import wave
from air import AsyncAIRefinery
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))
async def tts_synthesis_async():
# Initialize the AsyncAIRefinery client
client = AsyncAIRefinery(api_key=api_key)
# Generate speech from text (streaming mode, async)
# Speech synthesis language and sample rate can
# be specified using the `extra_body` parameter
# Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
with await client.audio.speech.with_streaming_response.create(
model="Azure/AI-Speech", # Specify the model to generate audio chunks
input="Hello, this is a test of text-to-speech synthesis.",
voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
response_format="pcm",
speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
extra_body={
"speech_synthesis_language": "en-US",
"sample_rate": 16000
}
) as response:
# Collect audio chunks as they stream in
audio_data = await response._collect_chunks_async()
# Convert PCM to WAV format to save audio to a file
with wave.open("streaming_output.wav", "wb") as wav_file:
wav_file.setnchannels(1) # Mono audio
wav_file.setsampwidth(2) # 16-bit audio (2 bytes per sample)
wav_file.setframerate(16000) # Match the sample rate from extra_body
wav_file.writeframes(audio_data)
print(f"Audio saved! Size: {len(audio_data)} bytes")
# Run the example
if __name__ == "__main__":
asyncio.run(tts_synthesis_async())
Batch Synthesis with Word Boundaries¶
import os
import asyncio
from air import AsyncAIRefinery
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key = str(os.getenv("API_KEY"))
async def tts_with_word_boundaries():
# Initialize the AsyncAIRefinery client
client = AsyncAIRefinery(api_key=api_key)
# Generate speech from text (batch mode, async)
# Enable word boundary events via `extra_body` to get
# timing metadata (offset, duration) for words and punctuation
# Use `boundary_types` to filter: "word", "punctuation", "sentence"
response = await client.audio.speech.create(
model="Azure/AI-Speech",
input="Hello, this is a test.",
voice="en-US-JennyNeural",
response_format="wav",
extra_body={
"speech_synthesis_language": "en-US",
"sample_rate": 24000,
"enable_word_boundary": True,
"boundary_types": ["word", "punctuation","sentence"]
}
)
response.write_to_file("output.wav")
# Access word boundary events
for event in response.word_boundaries or []:
print(
f"[{event.boundary_type:>11}] '{event.text}' "
f"@ {event.audio_offset_ms:.0f}ms "
f"(duration: {event.duration_ms:.0f}ms)"
)
# Run the example
if __name__ == "__main__":
asyncio.run(tts_with_word_boundaries())
Streaming with Word Boundaries¶
import os
import asyncio
from air import AsyncAIRefinery
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key = str(os.getenv("API_KEY"))
async def tts_streaming_with_word_boundaries():
# Initialize the AsyncAIRefinery client
client = AsyncAIRefinery(api_key=api_key)
# Generate speech from text (streaming mode, async)
# Enable word boundary events via `extra_body` to get
# timing metadata (offset, duration) for words and punctuation
async with await client.audio.speech.with_streaming_response.create(
model="Azure/AI-Speech",
input="Hello, this is a test.",
voice="en-US-JennyNeural",
response_format="pcm",
extra_body={
"speech_synthesis_language": "en-US",
"sample_rate": 16000,
"enable_word_boundary": True
}
) as response:
async for chunk in response:
if isinstance(chunk, bytes):
# Handle audio chunk
process_audio(chunk)
else:
# Handle word boundary event
print(
f"[{chunk.boundary_type:>11}] '{chunk.text}' "
f"@ {chunk.audio_offset_ms:.0f}ms "
f"(duration: {chunk.duration_ms:.0f}ms)"
)
# Run the example
if __name__ == "__main__":
asyncio.run(tts_streaming_with_word_boundaries())
Synchronous TTS¶
The AIRefinery client generates speech from text synchronously. This method supports the same parameters, batch and streaming modes, and return structure as the asynchronous method.
Example Usage:¶
Batch Synthesis¶
import os
from air import AIRefinery
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))
def tts_synthesis_sync():
# Initialize the AI Refinery client
client = AIRefinery(api_key=api_key)
# Generate speech from text (batch mode, sync)
# Speech synthesis language and sample rate can
# be specified using the `extra_body` parameter
# Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
response = client.audio.speech.create(
model="Azure/AI-Speech", # Specify the model to generate audio
input="Hello, this is a synchronous text-to-speech example.",
voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
response_format="wav",
speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
extra_body={
"speech_synthesis_language": "en-US",
"sample_rate": 22050
}
)
# Save the audio to a file
response.write_to_file("sync_output.wav")
print(f"Audio saved! Size: {len(response.content)} bytes")
# Run the example
if __name__ == "__main__":
tts_synthesis_sync()
Streaming¶
import os
import wave
from air import AIRefinery
from dotenv import load_dotenv
load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))
def tts_synthesis_sync():
# Initialize the AI Refinery client
client = AIRefinery(api_key=api_key)
# Generate speech from text (streaming mode, sync)
# Speech synthesis language and sample rate can
# be specified using the `extra_body` parameter
# Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
with client.audio.speech.with_streaming_response.create(
model="Azure/AI-Speech", # Specify the model to generate audio chunks
input="Hello, this is a test of text-to-speech synthesis.",
voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
response_format="pcm",
speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
extra_body={
"speech_synthesis_language": "en-US",
"sample_rate": 16000
}
) as response:
# Collect audio chunks as they stream in
audio_data = response._collect_chunks_sync()
# Convert PCM to WAV format to save audio to a file
with wave.open("streaming_output.wav", "wb") as wav_file:
wav_file.setnchannels(1) # Mono audio
wav_file.setsampwidth(2) # 16-bit audio (2 bytes per sample)
wav_file.setframerate(16000) # Match the sample rate from extra_body
wav_file.writeframes(audio_data)
print(f"Audio saved! Size: {len(audio_data)} bytes")
# Run the example
if __name__ == "__main__":
tts_synthesis_sync()