Text-to-Speech (TTS) API¶
This documentation provides an overview of the TTS API. This API allows you to convert text to speech using batch synthesis. The TTS API currently uses Azure AI Speech as the underlying text-to-speech service. You can utilize this API through our SDK using either the AIRefinery
or AsyncAIRefinery
clients.
Note: This API currently supports batch synthesis only. Streaming output capabilities will be available in a future release.
Asynchronous TTS¶
AsyncAIRefinery.audio.speech.create()
The AsyncAIRefinery client generates speech from text asynchronously, supporting batch synthesis.
Parameters:¶
model
(string): Model ID used to generate the speech. Currently supports "Azure/AI-Speech". For detailed model specifications and capabilities, see the Text-to-Speech model catalog. Required.input
(string): The text to convert to speech. Required.voice
(string): Voice name for speech synthesis (e.g., "en-US-JennyNeural"). See Voice Options for available voices. Required.response_format
(string): Audio format for output. See Supported Audio Formats for format details. Optional. Options: "wav", "mp3", "pcm", "opus". Default: "wav".speed
(number): Speech speed multiplier (0.25 to 4.0). Optional. Default: 1.0.timeout
(number): Request timeout in seconds. Optional.extra_headers
(object): Additional HTTP headers. Optional.extra_body
(object): Additional parameters likespeech_synthesis_language
andsample_rate
. See Supported Sample Rates for available sample rates. Optional.
Returns:¶
-
Returns a
TTSResponse
object containing the complete audio data which contains the following methods:content
: Raw audio byteswrite_to_file(file)
: Save audio to filestream_to_file(file, chunk_size)
: Stream audio to file in chunksiter_bytes(chunk_size)
: Iterate over audio in byte chunksaiter_bytes(chunk_size)
: Async iterate over audio in byte chunks
Supported Options:¶
Voice Options¶
The API supports various voices for different languages and regions. Common examples include:
en-US-JennyNeural
: Female, American Englishen-US-GuyNeural
: Male, American Englishen-GB-LibbyNeural
: Female, British Englishes-ES-ElviraNeural
: Female, Spanish (Spain)fr-FR-DeniseNeural
: Female, French (France)
For a complete list of available voices, see the Azure AI Speech voice gallery.
Audio Formats¶
The API supports multiple output formats:
Format | Description | Use Case |
---|---|---|
wav |
Uncompressed WAV format | High quality, larger file size |
mp3 |
Compressed MP3 format | Good quality, smaller file size |
pcm |
Raw PCM audio data | Low-level audio processing |
opus |
Opus codec in OGG container | Compression, web streaming |
Sample Rates¶
The following sample rates are supported for each format:
- 8000 Hz: Telephone quality
- 16000 Hz: Wide-band speech
- 22050 Hz: CD quality (half)
- 24000 Hz: High quality speech
- 44100 Hz: CD quality
- 48000 Hz: Professional audio
Example Usage:¶
import os
import asyncio
from air import AsyncAIRefinery, login
# Authenticate using environment variables
auth = login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
oauth_server=os.getenv("OAUTH_SERVER", ""),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")
async def tts_synthesis_async():
# Initialize the AI Refinery client
client = AsyncAIRefinery(**auth.openai(base_url=base_url))
# Generate speech from text (batch mode, async)
# Speech synthesis language and sample rate can
# be specified using the `extra_body` parameter
# Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
response = await client.audio.speech.create(
model="Azure/AI-Speech",
input="Hello, this is a test of text-to-speech synthesis.",
voice="en-US-JennyNeural",
response_format="wav",
speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
extra_body={
"speech_synthesis_language": "en-US",
"sample_rate": 24000
}
)
# Save the audio to a file
response.write_to_file("output.wav")
print(f"Audio saved! Size: {len(response.content)} bytes")
# Run the example
if __name__ == "__main__":
asyncio.run(tts_synthesis_async())
Below is an example of batch processing with concurrency:
import os
import asyncio
from air import AsyncAIRefinery, login
auth = login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
oauth_server=os.getenv("OAUTH_SERVER", ""),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")
async def batch_text_to_speech():
client = AIRefinery(**auth.openai(base_url=base_url))
# Multiple texts to synthesize
texts = [
"This is the first sentence.",
"Here comes the second sentence.",
"And finally, the third sentence."
]
# Create concurrent tasks
async def synthesize_text(text, index):
response = await client.audio.speech.create(
model="Azure/AI-Speech",
input=text,
voice="en-US-JennyNeural",
response_format="mp3"
)
# Save each audio file
response.write_to_file(f"batch_output_{index}.mp3")
return len(response.content)
# Execute all tasks concurrently
tasks = [synthesize_text(text, i) for i, text in enumerate(texts)]
sizes = await asyncio.gather(*tasks)
print(f"Generated {len(texts)} audio files")
print(f"Total audio data: {sum(sizes)} bytes")
if __name__ == "__main__":
asyncio.run(batch_text_to_speech())
Synchronous TTS¶
AIRefinery.audio.speech.create()
The AIRefinery client generates speech from text synchronously. This method supports the same parameters, return structure, and batch processing with concurrency, as the asynchronous method AsyncAIRefinery.audio.speech.create()
.
Example Usage:¶
import os
from air import AIRefinery, login
# Authenticate using environment variables
auth = login(
account=str(os.getenv("ACCOUNT")),
api_key=str(os.getenv("API_KEY")),
oauth_server=os.getenv("OAUTH_SERVER", ""),
)
base_url = os.getenv("AIREFINERY_ADDRESS", "")
def tts_synthesis_sync():
# Initialize the AI Refinery client
client = AIRefinery(**auth.openai(base_url=base_url))
# Generate speech from text (batch mode, sync)
# Speech synthesis language and sample rate can
# be specified using the `extra_body` parameter
# Speed can be adjusted from 0.25x (very slow) to 4.0x (very fast)
response = client.audio.speech.create(
model="Azure/AI-Speech",
input="Hello, this is a synchronous text-to-speech example.",
voice="en-US-JennyNeural",
response_format="wav",
speed=1.0, # e.g. speed = 0.75 results in slow speech, speed = 1.5 results in fast speech
extra_body={
"speech_synthesis_language": "en-US",
"sample_rate": 22050
}
)
# Save the audio to a file
response.write_to_file("sync_output.wav")
print(f"Audio saved! Size: {len(response.content)} bytes")
# Run the example
if __name__ == "__main__":
tts_synthesis_sync()