Skip to content

Text-to-Speech (TTS) API

This documentation provides an overview of the TTS API. This API allows you to convert text to speech using batch synthesis. The TTS API currently uses Azure AI Speech as the underlying text-to-speech service. You can utilize this API through our SDK using either the AIRefinery or AsyncAIRefinery clients.

Note: This API currently supports batch synthesis only. Streaming output capabilities will be available in a future release.

Asynchronous TTS

AsyncAIRefinery.audio.speech.create()

The AsyncAIRefinery client generates speech from text asynchronously, supporting batch synthesis.

Parameters:

  • model (string): Model ID used to generate the speech. Currently supports "Azure/AI-Speech". For detailed model specifications and capabilities, see the Text-to-Speech model catalog. Required.
  • input (string): The text to convert to speech. Required.
  • voice (string): Voice name for speech synthesis (e.g., "en-US-JennyNeural"). See Voice Options for available voices. Required.
  • response_format (string): Audio format for output. See Supported Audio Formats for format details. Optional. Options: "wav", "mp3", "pcm", "opus". Default: "wav".
  • speed (number): Speech speed multiplier (0.25 to 4.0). Optional. Default: 1.0.
  • timeout (number): Request timeout in seconds. Optional.
  • extra_headers (object): Additional HTTP headers. Optional.
  • extra_body (object): Additional parameters like speech_synthesis_language and sample_rate. See Supported Sample Rates for available sample rates. Optional.

Returns:

  • Returns a TTSResponse object containing the complete audio data which contains the following methods:

    • content: Raw audio bytes
    • write_to_file(file): Save audio to file
    • stream_to_file(file, chunk_size): Stream audio to file in chunks
    • iter_bytes(chunk_size): Iterate over audio in byte chunks
    • aiter_bytes(chunk_size): Async iterate over audio in byte chunks

Supported Options:

Voice Options

The API supports various voices for different languages and regions. Common examples include:

  • en-US-JennyNeural: Female, American English
  • en-US-GuyNeural: Male, American English
  • en-GB-LibbyNeural: Female, British English
  • es-ES-ElviraNeural: Female, Spanish (Spain)
  • fr-FR-DeniseNeural: Female, French (France)

For a complete list of available voices, see the Azure AI Speech voice gallery.

Audio Formats

The API supports multiple output formats:

Format Description Use Case
wav Uncompressed WAV format High quality, larger file size
mp3 Compressed MP3 format Good quality, smaller file size
pcm Raw PCM audio data Low-level audio processing
opus Opus codec in OGG container Compression, web streaming
Sample Rates

The following sample rates are supported for each format:

  • 8000 Hz: Telephone quality
  • 16000 Hz: Wide-band speech
  • 22050 Hz: CD quality (half)
  • 24000 Hz: High quality speech
  • 44100 Hz: CD quality
  • 48000 Hz: Professional audio

Example Usage:

import os
import asyncio
from air import AsyncAIRefinery, login

# Authenticate using environment variables
auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
    oauth_server=os.getenv("OAUTH_SERVER", ""),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")

async def tts_synthesis_async():

    # Initialize the AI Refinery client
    client = AsyncAIRefinery(**auth.openai(base_url=base_url))

    # Generate speech from text (batch mode, async)
    # Speech synthesis language and sample rate can 
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    response = await client.audio.speech.create(
        model="Azure/AI-Speech",
        input="Hello, this is a test of text-to-speech synthesis.",
        voice="en-US-JennyNeural",
        response_format="wav",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 24000
        }
    )

    # Save the audio to a file
    response.write_to_file("output.wav")
    print(f"Audio saved! Size: {len(response.content)} bytes")

# Run the example
if __name__ == "__main__":
    asyncio.run(tts_synthesis_async())

Below is an example of batch processing with concurrency:

import os
import asyncio
from air import AsyncAIRefinery, login

auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
    oauth_server=os.getenv("OAUTH_SERVER", ""),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")

async def batch_text_to_speech():

    client = AIRefinery(**auth.openai(base_url=base_url))

    # Multiple texts to synthesize
    texts = [
        "This is the first sentence.",
        "Here comes the second sentence.",
        "And finally, the third sentence."
    ]

    # Create concurrent tasks
    async def synthesize_text(text, index):
        response = await client.audio.speech.create(
            model="Azure/AI-Speech",
            input=text,
            voice="en-US-JennyNeural",
            response_format="mp3"
        )

        # Save each audio file
        response.write_to_file(f"batch_output_{index}.mp3")
        return len(response.content)

    # Execute all tasks concurrently
    tasks = [synthesize_text(text, i) for i, text in enumerate(texts)]
    sizes = await asyncio.gather(*tasks)

    print(f"Generated {len(texts)} audio files")
    print(f"Total audio data: {sum(sizes)} bytes")

if __name__ == "__main__":
    asyncio.run(batch_text_to_speech())

Synchronous TTS

AIRefinery.audio.speech.create()

The AIRefinery client generates speech from text synchronously. This method supports the same parameters, return structure, and batch processing with concurrency, as the asynchronous method AsyncAIRefinery.audio.speech.create().

Example Usage:

import os
from air import AIRefinery, login

# Authenticate using environment variables
auth = login(
    account=str(os.getenv("ACCOUNT")),
    api_key=str(os.getenv("API_KEY")),
    oauth_server=os.getenv("OAUTH_SERVER", ""),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")

def tts_synthesis_sync():
    # Initialize the AI Refinery client
    client = AIRefinery(**auth.openai(base_url=base_url))

    # Generate speech from text (batch mode, sync)
    # Speech synthesis language and sample rate can 
    # be specified using the `extra_body` parameter
    # Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
    response = client.audio.speech.create(
        model="Azure/AI-Speech",
        input="Hello, this is a synchronous text-to-speech example.",
        voice="en-US-JennyNeural",
        response_format="wav",
        speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech 
        extra_body={
            "speech_synthesis_language": "en-US",
            "sample_rate": 22050
        }
    )

    # Save the audio to a file
    response.write_to_file("sync_output.wav")
    print(f"Audio saved! Size: {len(response.content)} bytes")

# Run the example
if __name__ == "__main__":
    tts_synthesis_sync()