Text-to-Speech (TTS) API¶

The Text-to-Speech (TTS) API generates spoken audio from text input using the AIRefinery or the AsyncAIRefinery client.

This API supports two modes: batch synthesis mode, which waits for complete synthesis before returning all audio data at once, and streaming mode, which yields audio chunks as they're produced during synthesis.

Asynchronous TTS¶

The AsyncAIRefinery client asynchronously generates speech from input text.

Batch and Streaming Methods¶

audio.speech.create() - Returns complete audio after synthesis (batch synthesis mode)
audio.speech.with_streaming_response.create() - Returns audio chunks during synthesis (streaming mode)

Parameters:¶

model (string): Model ID used to generate the speech. Required.
input (string): The text to convert to speech. Required.
voice (string): Voice name for speech synthesis (e.g., "en-US-JennyNeural"). Required.
response_format (string): Audio format for output. Optional. Options: "wav", "mp3", "pcm", "opus". Default: "wav".
speed (number): Speech speed multiplier (0.25 to 4.0). Optional. Default: 1.0.
timeout (number): Request timeout in seconds. Optional.
extra_headers (object): Additional HTTP headers. Optional.
extra_body (object): Additional parameters like speech_synthesis_language and sample_rate.

Returns:¶

Batch Synthesis¶

The entire text input is processed in a single request, and the complete synthesized audio is returned only after generation is finished.

In this mode, the API returns a TTSResponse object with:

content: Raw audio bytes
write_to_file(file): Save audio to file
stream_to_file(file, chunk_size): Stream audio to file in chunks
iter_bytes(chunk_size): Iterate over audio in byte chunks
aiter_bytes(chunk_size): Async iterate over audio in byte chunks

Streaming¶

Synthesized audio is returned incrementally in chunks as it is generated, allowing playback to begin before the full audio is ready.

In this mode, the API returns an StreamingResponse object with:

iter(stream_generator()): Iterator of bytes chunks
stream_generator.__aiter__(): Async iterator of bytes chunks
stream_to_file(file_path): Saves the full streamed audio content to the specified file. Automatically handles sync or async behavior depending on is_async.

Supported Audio Formats¶

Different use cases prioritize different trade-offs—fidelity, size, compatibility, or streaming efficiency. Supporting multiple formats ensures the API can serve everything from phone-based IVR to high-quality media production.

WAV / PCM – Uncompressed, highest fidelity, large files
MP3 – Lossy, small, universally supported
Ogg Opus – Modern codec that out-performs MP3 at low bit-rates

Supported Sampling Rates¶

Sampling Rate (Hz)	Typical Use
8000	Telephony / IVR
16000	Wide-band speech
22050 / 24000	High-quality voice assistants
44100 / 48000	Broadcast / studio quality

Example Usage:¶

Batch Synthesis¶

import os
import asyncio
from air import AsyncAIRefinery
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))


async def tts_synthesis_async():

    # Initialize the AI Refinery client
    client = AsyncAIRefinery(api_key=api_key)

    # Generate speech from text (batch mode, async)
    # Speech synthesis language and sample rate can
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    response = await client.audio.speech.create(
        model="Azure/AI-Speech", # Specify the model to generate audio
        input="Hello, this is a test of text-to-speech synthesis.",
        voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
        response_format="wav",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 24000
        }
    )

    # Save the audio to a file
    response.write_to_file("output.wav")
    print(f"Audio saved! Size: {len(response.content)} bytes")

# Run the example
if __name__ == "__main__":
    asyncio.run(tts_synthesis_async())

Streaming¶

import os
import asyncio
import wave
from air import AsyncAIRefinery
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))


async def tts_synthesis_async():

    # Initialize the AsyncAIRefinery client
    client = AsyncAIRefinery(api_key=api_key)

    # Generate speech from text (streaming mode, async)
    # Speech synthesis language and sample rate can
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    with await client.audio.speech.with_streaming_response.create(
        model="Azure/AI-Speech", # Specify the model to generate audio chunks
        input="Hello, this is a test of text-to-speech synthesis.",
        voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
        response_format="pcm",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 16000
        }
    ) as response:

        # Collect audio chunks as they stream in
        audio_data = await response._collect_chunks_async()

    # Convert PCM to WAV format to save audio to a file
    with wave.open("streaming_output.wav", "wb") as wav_file:
        wav_file.setnchannels(1)  # Mono audio
        wav_file.setsampwidth(2)  # 16-bit audio (2 bytes per sample)
        wav_file.setframerate(16000)  # Match the sample rate from extra_body
        wav_file.writeframes(audio_data)

    print(f"Audio saved! Size: {len(audio_data)} bytes")

# Run the example
if __name__ == "__main__":
    asyncio.run(tts_synthesis_async())

Synchronous TTS¶

The AIRefinery client generates speech from text synchronously. This method supports the same parameters, batch and streaming modes, and return structure as the asynchronous method.

Example Usage:¶

Batch Synthesis¶

import os
from air import AIRefinery
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))


def tts_synthesis_sync():
    # Initialize the AI Refinery client
    client = AIRefinery(api_key=api_key)

    # Generate speech from text (batch mode, sync)
    # Speech synthesis language and sample rate can
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    response = client.audio.speech.create(
        model="Azure/AI-Speech", # Specify the model to generate audio
        input="Hello, this is a synchronous text-to-speech example.",
        voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
        response_format="wav",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 22050
        }
    )

    # Save the audio to a file
    response.write_to_file("sync_output.wav")
    print(f"Audio saved! Size: {len(response.content)} bytes")

# Run the example
if __name__ == "__main__":
    tts_synthesis_sync()

Streaming¶

import os
import wave
from air import AIRefinery
from dotenv import load_dotenv

load_dotenv() # loads your API_KEY from your local '.env' file
api_key=str(os.getenv("API_KEY"))


def tts_synthesis_sync():

    # Initialize the AI Refinery client
    client = AIRefinery(api_key=api_key)

    # Generate speech from text (streaming mode, sync)
    # Speech synthesis language and sample rate can
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    with client.audio.speech.with_streaming_response.create(
        model="Azure/AI-Speech", # Specify the model to generate audio chunks
        input="Hello, this is a test of text-to-speech synthesis.",
        voice="en-US-JennyNeural", # Specify the voice used for speech synthesis
        response_format="pcm",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 16000
        }
    ) as response:

        # Collect audio chunks as they stream in
        audio_data = response._collect_chunks_sync()

    # Convert PCM to WAV format to save audio to a file
    with wave.open("streaming_output.wav", "wb") as wav_file:
        wav_file.setnchannels(1)  # Mono audio
        wav_file.setsampwidth(2)  # 16-bit audio (2 bytes per sample)
        wav_file.setframerate(16000)  # Match the sample rate from extra_body
        wav_file.writeframes(audio_data)

    print(f"Audio saved! Size: {len(audio_data)} bytes")

# Run the example
if __name__ == "__main__":
    tts_synthesis_sync()