Skip to content

Text-to-Speech (TTS) API

The Text-to-Speech (TTS) API generates spoken audio from text input using the AIRefinery or the AsyncAIRefinery client.

This API supports two modes: batch synthesis mode, which waits for complete synthesis before returning all audio data at once, and streaming mode, which yields audio chunks as they're produced during synthesis.

Asynchronous TTS

The AsyncAIRefinery client asynchronously generates speech from input text.

Batch and Streaming Methods

  • audio.speech.create() - Returns complete audio after synthesis (batch synthesis mode)
  • audio.speech.with_streaming_response.create() - Returns audio chunks during synthesis (streaming mode)
Parameters:
Parameter Type Description
model string (required) Model ID used to generate the speech.
input string (required) The text to convert to speech.
voice string (required) Voice name for speech synthesis (e.g., "en-US-JennyNeural").
response_format string (optional) Audio format for output. Supported values: "wav", "mp3", "pcm", "opus". (default: "wav")
speed number (optional) Speech speed multiplier (0.25 to 4.0). (default: 1.0)
timeout number (optional) Request timeout in seconds.
extra_headers map (optional) Additional HTTP headers to include with the request.
extra_body map (optional) Additional parameters for speech synthesis. See Extra Body Parameters table below.

Extra Body Parameters (extra_body dict):

These parameters should be passed as a dictionary to the extra_body parameter:

Parameter Type Description
speech_synthesis_language string (optional) Language code for speech synthesis (e.g., "en-US", "fr-FR").
sample_rate integer (optional) Audio sampling rate in Hz (e.g., 16000, 24000, 48000). See Supported Sampling Rates table.
enable_word_boundary boolean (optional) If true, returns word timing metadata alongside audio. (default: false)
boundary_types List[string] (optional) Filter which boundary types to include. Supported values: "word", "punctuation", "sentence". Omit to receive all three types. Cannot be an empty array.
Returns:
Batch Synthesis

The entire text input is processed in a single request, and the complete synthesized audio is returned only after generation is finished.

In this mode, the API returns a TTSResponse object with the following fields/methods:

Field/Method Type Description
content bytes Raw audio bytes of the synthesized speech.
word_boundaries List[TTSWordBoundaryEvent] (optional) List of word timing metadata. Only present when enable_word_boundary=True.
write_to_file(file) method Save audio content to the specified file.
stream_to_file(file, chunk_size) method Stream audio to file in chunks.
iter_bytes(chunk_size) method Iterate over audio in byte chunks.
aiter_bytes(chunk_size) method Async iterate over audio in byte chunks.
Streaming

Synthesized audio is returned incrementally in chunks as it is generated, allowing playback to begin before the full audio is ready.

In this mode, the API returns an StreamingResponse object with the following fields/methods:

Field/Method Type Description
iter(stream_generator()) iterator Iterator of bytes chunks (or mixed bytes/TTSWordBoundaryEvent when word boundaries enabled).
stream_generator.__aiter__() async iterator Async iterator of bytes chunks (or mixed types when word boundaries enabled).
stream_to_file(file_path) method Saves the full streamed audio content to the specified file. Automatically handles sync or async behavior depending on is_async.
Supported Audio Formats

Different use cases prioritize different trade-offs—fidelity, size, compatibility, or streaming efficiency. Supporting multiple formats ensures the API can serve everything from phone-based IVR to high-quality media production.

Format Type Characteristics Typical Use Cases
WAV / PCM Uncompressed Highest fidelity, large files Studio recording, audio processing
MP3 Lossy compression Small file size, universally supported Web playback, mobile apps, archival
Ogg Opus Modern codec Excellent quality at low bitrates, efficient streaming Real-time communication, low-bandwidth scenarios
Supported Sampling Rates
Sampling Rate (Hz) Typical Use
8000 Telephony / IVR
16000 Wide-band speech
22050 / 24000 High-quality voice assistants
44100 / 48000 Broadcast / studio quality

Word Boundary Events (Azure-specific)

When enable_word_boundary is set to true in extra_body, the API returns timing metadata for words, punctuation, and sentences during synthesis.

TTSWordBoundaryEvent fields:

Field Type Description
type string ("word_boundary") Event type identifier. Always "word_boundary".
text string The word or punctuation text.
audio_offset_ms float Time offset in milliseconds from audio start.
duration_ms float Duration of the word in milliseconds.
text_offset integer Character offset in the original input text.
word_length integer Length of the word in characters.
boundary_type string Type of boundary. Supported values: "word", "punctuation", "sentence".

Batch mode response (with word boundaries): Returns JSON containing audio (base64-encoded) and word_boundaries array.

Streaming mode response (with word boundaries): Returns NDJSON stream with mixed {"type": "audio", "data": "..."} and {"type": "word_boundary", ...} events.

Example Usage:
Batch Synthesis
import os
import asyncio
from air import AsyncAIRefinery
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))


async def tts_synthesis_async():

    # Initialize the AI Refinery client
    client = AsyncAIRefinery(api_key=api_key)

    # Generate speech from text (batch mode, async)
    # Speech synthesis language and sample rate can
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    response = await client.audio.speech.create(
        model="Azure/AI-Speech", # Specify the model to generate audio
        input="Hello, this is a test of text-to-speech synthesis.",
        voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
        response_format="wav",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 24000
        }
    )

    # Save the audio to a file
    response.write_to_file("output.wav")
    print(f"Audio saved! Size: {len(response.content)} bytes")

# Run the example
if __name__ == "__main__":
    asyncio.run(tts_synthesis_async())
Streaming
import os
import asyncio
import wave
from air import AsyncAIRefinery
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))


async def tts_synthesis_async():

    # Initialize the AsyncAIRefinery client
    client = AsyncAIRefinery(api_key=api_key)

    # Generate speech from text (streaming mode, async)
    # Speech synthesis language and sample rate can
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    with await client.audio.speech.with_streaming_response.create(
        model="Azure/AI-Speech", # Specify the model to generate audio chunks
        input="Hello, this is a test of text-to-speech synthesis.",
        voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
        response_format="pcm",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 16000
        }
    ) as response:

        # Collect audio chunks as they stream in
        audio_data = await response._collect_chunks_async()

    # Convert PCM to WAV format to save audio to a file
    with wave.open("streaming_output.wav", "wb") as wav_file:
        wav_file.setnchannels(1)  # Mono audio
        wav_file.setsampwidth(2)  # 16-bit audio (2 bytes per sample)
        wav_file.setframerate(16000)  # Match the sample rate from extra_body
        wav_file.writeframes(audio_data)

    print(f"Audio saved! Size: {len(audio_data)} bytes")

# Run the example
if __name__ == "__main__":
    asyncio.run(tts_synthesis_async())
Batch Synthesis with Word Boundaries
import os
import asyncio
from air import AsyncAIRefinery
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key = str(os.getenv("API_KEY"))


async def tts_with_word_boundaries():

    # Initialize the AsyncAIRefinery client
    client = AsyncAIRefinery(api_key=api_key)

    # Generate speech from text (batch mode, async)
    # Enable word boundary events via `extra_body` to get
    # timing metadata (offset, duration) for words and punctuation
    # Use `boundary_types` to filter: "word", "punctuation", "sentence"
    response = await client.audio.speech.create(
        model="Azure/AI-Speech",
        input="Hello, this is a test.",
        voice="en-US-JennyNeural",
        response_format="wav",
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 24000,
            "enable_word_boundary": True,
            "boundary_types": ["word", "punctuation","sentence"]
        }
    )

    response.write_to_file("output.wav")

    # Access word boundary events
    for event in response.word_boundaries or []:
        print(
            f"[{event.boundary_type:>11}] '{event.text}' "
            f"@ {event.audio_offset_ms:.0f}ms "
            f"(duration: {event.duration_ms:.0f}ms)"
        )

# Run the example
if __name__ == "__main__":
    asyncio.run(tts_with_word_boundaries())
Streaming with Word Boundaries
import os
import asyncio
from air import AsyncAIRefinery
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key = str(os.getenv("API_KEY"))


async def tts_streaming_with_word_boundaries():

    # Initialize the AsyncAIRefinery client
    client = AsyncAIRefinery(api_key=api_key)

    # Generate speech from text (streaming mode, async)
    # Enable word boundary events via `extra_body` to get
    # timing metadata (offset, duration) for words and punctuation
    async with await client.audio.speech.with_streaming_response.create(
        model="Azure/AI-Speech",
        input="Hello, this is a test.",
        voice="en-US-JennyNeural",
        response_format="pcm",
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 16000,
            "enable_word_boundary": True
        }
    ) as response:
        async for chunk in response:
            if isinstance(chunk, bytes):

                # Handle audio chunk
                process_audio(chunk)
            else:

                # Handle word boundary event
                print(
                    f"[{chunk.boundary_type:>11}] '{chunk.text}' "
                    f"@ {chunk.audio_offset_ms:.0f}ms "
                    f"(duration: {chunk.duration_ms:.0f}ms)"
                )

# Run the example
if __name__ == "__main__":
    asyncio.run(tts_streaming_with_word_boundaries())

Synchronous TTS

The AIRefinery client generates speech from text synchronously. This method supports the same parameters, batch and streaming modes, and return structure as the asynchronous method.

Example Usage:
Batch Synthesis
import os
from air import AIRefinery
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))


def tts_synthesis_sync():
    # Initialize the AI Refinery client
    client = AIRefinery(api_key=api_key)

    # Generate speech from text (batch mode, sync)
    # Speech synthesis language and sample rate can
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    response = client.audio.speech.create(
        model="Azure/AI-Speech", # Specify the model to generate audio
        input="Hello, this is a synchronous text-to-speech example.",
        voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
        response_format="wav",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 22050
        }
    )

    # Save the audio to a file
    response.write_to_file("sync_output.wav")
    print(f"Audio saved! Size: {len(response.content)} bytes")

# Run the example
if __name__ == "__main__":
    tts_synthesis_sync()
Streaming
import os
import wave
from air import AIRefinery
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))


def tts_synthesis_sync():

    # Initialize the AI Refinery client
    client = AIRefinery(api_key=api_key)

    # Generate speech from text (streaming mode, sync)
    # Speech synthesis language and sample rate can
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    with client.audio.speech.with_streaming_response.create(
        model="Azure/AI-Speech", # Specify the model to generate audio chunks
        input="Hello, this is a test of text-to-speech synthesis.",
        voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
        response_format="pcm",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 16000
        }
    ) as response:

        # Collect audio chunks as they stream in
        audio_data = response._collect_chunks_sync()

    # Convert PCM to WAV format to save audio to a file
    with wave.open("streaming_output.wav", "wb") as wav_file:
        wav_file.setnchannels(1)  # Mono audio
        wav_file.setsampwidth(2)  # 16-bit audio (2 bytes per sample)
        wav_file.setframerate(16000)  # Match the sample rate from extra_body
        wav_file.writeframes(audio_data)

    print(f"Audio saved! Size: {len(audio_data)} bytes")

# Run the example
if __name__ == "__main__":
    tts_synthesis_sync()