Text-to-Speech (TTS) API¶

This documentation provides an overview of the TTS API. This API allows you to convert text to speech using batch synthesis. The TTS API currently uses Azure AI Speech as the underlying text-to-speech service. You can utilize this API through our SDK using either the AIRefinery or AsyncAIRefinery clients.

Note: This API currently supports batch synthesis only. Streaming output capabilities will be available in a future release.

Asynchronous TTS¶

AsyncAIRefinery.audio.speech.create()

The AsyncAIRefinery client generates speech from text asynchronously, supporting batch synthesis.

Parameters:¶

model (string): Model ID used to generate the speech. Currently supports "Azure/AI-Speech". For detailed model specifications and capabilities, see the Text-to-Speech model catalog. Required.
input (string): The text to convert to speech. Required.
voice (string): Voice name for speech synthesis (e.g., "en-US-JennyNeural"). See Voice Options for available voices. Required.
response_format (string): Audio format for output. See Supported Audio Formats for format details. Optional. Options: "wav", "mp3", "pcm", "opus". Default: "wav".
speed (number): Speech speed multiplier (0.25 to 4.0). Optional. Default: 1.0.
timeout (number): Request timeout in seconds. Optional.
extra_headers (object): Additional HTTP headers. Optional.
extra_body (object): Additional parameters like speech_synthesis_language and sample_rate. See Supported Sample Rates for available sample rates. Optional.

Returns:¶

Returns a TTSResponse object containing the complete audio data which contains the following methods:
- content: Raw audio bytes
- write_to_file(file): Save audio to file
- stream_to_file(file, chunk_size): Stream audio to file in chunks
- iter_bytes(chunk_size): Iterate over audio in byte chunks
- aiter_bytes(chunk_size): Async iterate over audio in byte chunks

Supported Options:¶

Voice Options¶

The API supports various voices for different languages and regions. Common examples include:

en-US-JennyNeural: Female, American English
en-US-GuyNeural: Male, American English
en-GB-LibbyNeural: Female, British English
es-ES-ElviraNeural: Female, Spanish (Spain)
fr-FR-DeniseNeural: Female, French (France)

For a complete list of available voices, see the Azure AI Speech voice gallery.

Audio Formats¶

The API supports multiple output formats:

Format	Description	Use Case
`wav`	Uncompressed WAV format	High quality, larger file size
`mp3`	Compressed MP3 format	Good quality, smaller file size
`pcm`	Raw PCM audio data	Low-level audio processing
`opus`	Opus codec in OGG container	Compression, web streaming

Sample Rates¶

The following sample rates are supported for each format:

8000 Hz: Telephone quality
16000 Hz: Wide-band speech
22050 Hz: CD quality (half)
24000 Hz: High quality speech
44100 Hz: CD quality
48000 Hz: Professional audio

Example Usage:¶

import os
import asyncio
from air import AsyncAIRefinery, login

# Authenticate using environment variables
auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
    oauth_server=os.getenv("OAUTH_SERVER", ""),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")

async def tts_synthesis_async():

    # Initialize the AI Refinery client
    client = AsyncAIRefinery(**auth.openai(base_url=base_url))

    # Generate speech from text (batch mode, async)
    # Speech synthesis language and sample rate can 
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    response = await client.audio.speech.create(
        model="Azure/AI-Speech",
        input="Hello, this is a test of text-to-speech synthesis.",
        voice="en-US-JennyNeural",
        response_format="wav",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 24000
        }
    )

    # Save the audio to a file
    response.write_to_file("output.wav")
    print(f"Audio saved! Size: {len(response.content)} bytes")

# Run the example
if __name__ == "__main__":
    asyncio.run(tts_synthesis_async())

Below is an example of batch processing with concurrency:

import os
import asyncio
from air import AsyncAIRefinery, login

auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
    oauth_server=os.getenv("OAUTH_SERVER", ""),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")

async def batch_text_to_speech():

    client = AIRefinery(**auth.openai(base_url=base_url))

    # Multiple texts to synthesize
    texts = [
        "This is the first sentence.",
        "Here comes the second sentence.",
        "And finally, the third sentence."
    ]

    # Create concurrent tasks
    async def synthesize_text(text, index):
        response = await client.audio.speech.create(
            model="Azure/AI-Speech",
            input=text,
            voice="en-US-JennyNeural",
            response_format="mp3"
        )

        # Save each audio file
        response.write_to_file(f"batch_output_{index}.mp3")
        return len(response.content)

    # Execute all tasks concurrently
    tasks = [synthesize_text(text, i) for i, text in enumerate(texts)]
    sizes = await asyncio.gather(*tasks)

    print(f"Generated {len(texts)} audio files")
    print(f"Total audio data: {sum(sizes)} bytes")

if __name__ == "__main__":
    asyncio.run(batch_text_to_speech())

Synchronous TTS¶

AIRefinery.audio.speech.create()

The AIRefinery client generates speech from text synchronously. This method supports the same parameters, return structure, and batch processing with concurrency, as the asynchronous method AsyncAIRefinery.audio.speech.create().

Example Usage:¶

import os
from air import AIRefinery, login

# Authenticate using environment variables
auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
    oauth_server=os.getenv("OAUTH_SERVER", ""),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")

def tts_synthesis_sync():
    # Initialize the AI Refinery client
    client = AIRefinery(**auth.openai(base_url=base_url))

    # Generate speech from text (batch mode, sync)
    # Speech synthesis language and sample rate can 
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    response = client.audio.speech.create(
        model="Azure/AI-Speech",
        input="Hello, this is a synchronous text-to-speech example.",
        voice="en-US-JennyNeural",
        response_format="wav",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech 
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 22050
        }
    )

    # Save the audio to a file
    response.write_to_file("sync_output.wav")
    print(f"Audio saved! Size: {len(response.content)} bytes")

# Run the example
if __name__ == "__main__":
    tts_synthesis_sync()