Automatic Speech Recognition (ASR) Transcription API¶
The Automatic Speech Recognition (ASR) transcription API generates text transcriptions of an input audio file using the AIRefinery or the AsyncAIRefinery client.
This API supports two modes: batch inference mode for processing complete audio files and returning the final transcription after processing is complete, and streaming mode for returning transcription results incrementally as the audio is processed.
Asynchronous Transcription¶
AsyncAIRefinery.audio.transcriptions.create()¶
This method asynchronously generates the text transcription of an input audio file.
Parameters¶
| Parameter | Type | Description |
|---|---|---|
model |
string (required) | Model ID of the ASR model used to generate the transcription. |
file |
IO[bytes] (required) | Open file-like object containing the audio to transcribe, in WAV or PCM format. |
chunking_strategy |
string | ChunkingStrategy (optional) | Configures server-side VAD and chunking. Accepts "auto" or a ChunkingStrategy object. (default: "auto") |
language |
string (optional) | Language to detect and transcribe. (default: "en-US") |
response_format |
string (optional) | Desired output format. Supported values: "json", "verbose_json". (default: "json") |
timestamp_granularities |
List[string] (optional) | Timestamp types to include in the response. Supported values: "segment", "word". Requires response_format="verbose_json". |
stream |
boolean (optional) | If True, enables streaming transcription output. (default: False) |
extra_headers |
map (optional) | Additional HTTP headers to include with the request. |
extra_body |
map (optional) | Additional fields to merge with or override top-level request parameters. |
timeout |
integer (optional) | Request timeout in seconds. (default: 60) |
Chunking Strategy (ChunkingStrategy)¶
| Field | Type | Description |
|---|---|---|
type |
string ("server_vad") |
Enables server-side voice activity detection (VAD)–based chunking. |
prefix_padding_ms |
integer (0–5000 ms, optional) | Lead-in audio retained before detected speech. Recommended value: ≥4000 ms. |
silence_duration_ms |
integer (0–5000 ms, optional) | Trailing silence duration that marks the end of a chunk. Recommended value: 5000 ms. |
threshold |
float (0.0–1.0, optional) | VAD sensitivity threshold. Currently ignored. |
Note
For audio files with initial silence, setprefix_padding_msto at least 4000 ms to avoid premature cutoff of detected speech.
Returns:¶
Batch Inference¶
The entire audio file is uploaded and processed as a single request, and the final transcription is returned only after processing is complete.
-
ASRResponse
In this mode (
stream=False, default) and withouttimestamp_granularities, the API returns anASRResponseobject.Field Type Description textstring | null Transcription of the audio file. nullif no text was produced. -
TranscriptionVerbose
With
timestamp_granularitiesincluded in the transcription request, the API returns aTranscriptionVerboseobject.TranscriptionVerbose
Field Type Description taskstring ( "transcribe")Type of task performed. Always "transcribe".languagestring Detected or specified language code (e.g., en-US,fr-FR).durationfloat Total duration of the audio in seconds. textstring Complete transcribed text aggregated from all segments. segmentsList[Segment] Segment-level transcription results. Included when "segment"is requested intimestamp_granularities.wordsList[Word] (optional) Word-level timing and confidence data. Included when "word"is requested intimestamp_granularities.speakersList[string] (optional) List of unique speaker identifiers detected in the audio. Segment (
TranscriptionVerbose.Segment)Field Type Description idinteger Unique identifier for the segment. seekfloat Offset indicating where the segment starts in the original audio. startfloat Start time of the segment in seconds. endfloat End time of the segment in seconds. textstring Transcribed text for this segment. avg_logprobfloat Average log probability of word-level confidence scores within the segment. compression_ratiofloat Average characters-per-word compression ratio for the segment. speaker_idstring (optional) Speaker label (e.g., "Guest-1","Guest-2", …,"Guest-N"or"Unknown").Word (
TranscriptionVerbose.Word)Field Type Description wordstring Transcribed word text. startfloat Start time of the word in seconds. endfloat End time of the word in seconds. confidencefloat (0.0–1.0, optional) Word-level confidence score. segmentinteger (optional) ID of the segment this word belongs to.
Streaming¶
Transcription results are returned incrementally as the audio is processed, enabling display of partial transcription results before the full transcription is complete.
In this mode (stream=True), the API returns an AsyncStream[TranscriptionStreamEvent] object, which yields:
-
TranscriptionTextDeltaEvent
Represents an incremental transcription update emitted during streaming. Provides a newly transcribed text segment (“delta”) as it becomes available, enabling display of partial results.
Field Type Description deltastring Newly transcribed text segment emitted as a partial update. typestring ( "transcript.text.delta")Event type identifier. Always "transcript.text.delta".logprobsarray | null Optional token-level log probabilities associated with the delta. -
TranscriptionTextDoneEvent
Represents the final transcription result emitted at the end of audio processing. Marks the completion of the transcription stream and contains the full transcribed text.
Field Type Description textstring Complete transcription of the audio input. typestring ( "transcript.text.done")Event type identifier. Always "transcript.text.done".logprobsarray | null Optional token-level log probabilities for the final transcription. -
TranscriptionWordEvent
Represents a real-time word-level transcription event with timing and confidence.
This event provides detailed word-level information as it becomes available during streaming transcription, including precise timing and confidence scores.
Provide only when
wordis included intimestamp_granularities.Field Type Description wordstring Transcribed word text. startfloat Start time of the word in seconds. endfloat End time of the word in seconds. confidencefloat (0.0–1.0) Confidence score for the word. segmentinteger Segment ID the word belongs to. typestring ( "transcript.word")Event type identifier. Always "transcript.word". -
TranscriptionSegmentEvent
Represents a real-time segment-level transcription event with timing and metadata.
This event provides detailed segment-level information as it becomes available during streaming transcription, including timing, confidence statistics, and speaker attribution.
Provide only when
segmentis included intimestamp_granularities.Field Type Description segmentTranscriptionVerbose.Segment Complete segment data with timing and metadata. typestring ( "transcript.segment")Event type identifier. Always "transcript.segment".
Example Usage:¶
Batch Inference (Basic response - text transcription only)¶
import asyncio
import os
from air import AsyncAIRefinery
from dotenv import load_dotenv
# Load environment variables from .env file (contains API_KEY)
load_dotenv()
api_key = str(os.getenv("API_KEY"))
async def generate_transcription(file_name):
# Initialize the async client with your API key
client = AsyncAIRefinery(api_key=api_key)
# Open audio file in binary read mode (supports WAV or PCM format)
audio_file = open(file_name, "rb")
# Send transcription request and wait for complete result (batch mode)
# Returns ASRResponse with text, success, error, and confidence fields
transcription = await client.audio.transcriptions.create(
model="Azure/AI-Transcription", # ASR model ID
file=audio_file,
)
# Access the transcribed text from the response
print(transcription.text)
return transcription.text
if __name__ == "__main__":
asyncio.run(generate_transcription("audio/sample1.wav"))
Batch Inference (Detailed response - Transcription with Timestamps)¶
import asyncio
import os
from air import AsyncAIRefinery
from air.types.audio import ChunkingStrategy
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))
async def generate_verbose_transcription(file_name):
client = AsyncAIRefinery(api_key=api_key)
audio_file = open(file_name, "rb")
# Request verbose transcription with segment and word-level timestamps
# Returns TranscriptionVerbose with detailed timing and speaker info
transcription = await client.audio.transcriptions.create(
model="Azure/AI-Transcription",
file=audio_file,
response_format="verbose_json", # Required for timestamp data
timestamp_granularities=["segment", "word"], # Request both segment and word timestamps
# Configure Voice Activity Detection (VAD) for chunking
chunking_strategy=ChunkingStrategy(
type="server_vad", # Use server-side VAD
prefix_padding_ms=4000, # Keep 4s of audio before detected speech
silence_duration_ms=5000, # End chunk after 5s of silence
threshold=1, # VAD sensitivity (currently ignored by server)
),
)
# Access aggregated transcription text and total audio duration
print(f"Full text: {transcription.text}")
print(f"Duration: {transcription.duration}s")
# Iterate through segments with timing and speaker attribution
for segment in transcription.segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.speaker_id}: {segment.text}")
return transcription
if __name__ == "__main__":
asyncio.run(generate_verbose_transcription("audio/sample1.wav"))
Streaming Inference (Basic response - text transcription only)¶
import asyncio
import os
from air import AsyncAIRefinery
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))
async def generate_transcription(file_name):
client = AsyncAIRefinery(api_key=api_key)
audio_file = open(file_name, "rb")
# Enable streaming mode to receive transcription results incrementally
# Returns AsyncStream[TranscriptionStreamEvent] for real-time processing
transcription_stream = await client.audio.transcriptions.create(
model="Azure/AI-Transcription",
file=audio_file,
stream=True, # Enable streaming mode
)
print("\n[Streaming Transcription Output]")
# Iterate over stream events as they arrive
# Events: TranscriptionTextDeltaEvent (partial) and TranscriptionTextDoneEvent (final)
async for event in transcription_stream:
print(event)
if __name__ == "__main__":
asyncio.run(generate_transcription("audio/sample1.wav"))
Streaming Inference (Detailed response - Transcription with Timestamps)¶
import asyncio
import os
from air import AsyncAIRefinery
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))
async def generate_streaming_transcription_with_timestamps(file_name):
client = AsyncAIRefinery(api_key=api_key)
audio_file = open(file_name, "rb")
# Combine streaming with verbose output for real-time timestamps
# Emits word and segment events as audio is processed
transcription_stream = await client.audio.transcriptions.create(
model="Azure/AI-Transcription",
file=audio_file,
response_format="verbose_json", # Required for timestamp events
stream=True, # Enable streaming mode
timestamp_granularities=["segment", "word"], # Request both granularities
)
print("\n[Streaming Transcription with Timestamps]")
# Process each event based on its type
async for event in transcription_stream:
if hasattr(event, "type"):
event_type = event.type
# TranscriptionTextDeltaEvent: incremental text updates
if event_type == "transcript.text.delta":
text = getattr(event, "text", "")
print(f"Delta: {text}")
# TranscriptionWordEvent: word-level timing and confidence
elif event_type == "transcript.word":
word = getattr(event, "word", "")
start = getattr(event, "start", 0)
end = getattr(event, "end", 0)
confidence = getattr(event, "confidence", 0)
print(f"Word: {word} [{start:.2f}s - {end:.2f}s] (confidence: {confidence:.2f})")
# TranscriptionSegmentEvent: segment with speaker attribution
elif event_type == "transcript.segment":
segment = getattr(event, "segment", {})
start = segment.get("start", 0)
end = segment.get("end", 0)
speaker_id = segment.get("speaker_id", "Unknown")
text = segment.get("text", "")
print(f"Segment: [{start:.2f}s - {end:.2f}s] {speaker_id}: {text}")
# TranscriptionTextDoneEvent: final complete transcription
elif event_type == "transcript.text.done":
text = getattr(event, "text", "")
print(f"\nFinal text: {text}")
if __name__ == "__main__":
asyncio.run(generate_streaming_transcription_with_timestamps("audio/sample1.wav"))
Synchronous Transcription¶
AIRefinery.audio.transcriptions.create()¶
This method synchronously generates the text transcription of an input audio file. It supports the same parameters and return structure as the asynchronous method.
Example Usage:¶
Batch Inference (Basic response - text transcription only)¶
import os
from air import AIRefinery
from dotenv import load_dotenv
# Load environment variables from .env file (contains API_KEY)
load_dotenv()
api_key = str(os.getenv("API_KEY"))
def generate_transcription(file_name):
# Initialize the synchronous client with your API key
client = AIRefinery(api_key=api_key)
# Open audio file in binary read mode (supports WAV or PCM format)
audio_file = open(file_name, "rb")
# Send transcription request and wait for complete result (batch mode)
# Returns ASRResponse with text, success, error, and confidence fields
transcription = client.audio.transcriptions.create(
model="Azure/AI-Transcription", # ASR model ID
file=audio_file,
)
# Access the transcribed text from the response
print(transcription.text)
return transcription.text
if __name__ == "__main__":
generate_transcription("audio/sample1.wav")
Batch Inference (Detailed response - Transcription with Timestamps)¶
import os
from air import AIRefinery
from air.types.audio import ChunkingStrategy
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))
def generate_verbose_transcription(file_name):
client = AIRefinery(api_key=api_key)
audio_file = open(file_name, "rb")
# Request verbose transcription with segment and word-level timestamps
# Returns TranscriptionVerbose with detailed timing and speaker info
transcription = client.audio.transcriptions.create(
model="Azure/AI-Transcription",
file=audio_file,
response_format="verbose_json", # Required for timestamp data
timestamp_granularities=["segment", "word"], # Request both segment and word timestamps
# Configure Voice Activity Detection (VAD) for chunking
chunking_strategy=ChunkingStrategy(
type="server_vad", # Use server-side VAD
prefix_padding_ms=4000, # Keep 4s of audio before detected speech
silence_duration_ms=5000, # End chunk after 5s of silence
threshold=1, # VAD sensitivity (currently ignored by server)
),
)
# Access aggregated transcription text and total audio duration
print(f"Full text: {transcription.text}")
print(f"Duration: {transcription.duration}s")
# Iterate through segments with timing and speaker attribution
for segment in transcription.segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.speaker_id}: {segment.text}")
return transcription
if __name__ == "__main__":
generate_verbose_transcription("audio/sample1.wav")
Streaming Inference (Basic response - text transcription only)¶
import os
from air import AIRefinery
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))
def generate_transcription(file_name):
client = AIRefinery(api_key=api_key)
audio_file = open(file_name, "rb")
# Enable streaming mode to receive transcription results incrementally
# Returns Stream[TranscriptionStreamEvent] for real-time processing
transcription_stream = client.audio.transcriptions.create(
model="Azure/AI-Transcription",
file=audio_file,
stream=True, # Enable streaming mode
)
# Iterate over stream events as they arrive
# Events: TranscriptionTextDeltaEvent (partial) and TranscriptionTextDoneEvent (final)
for event in transcription_stream:
print(event)
if __name__ == "__main__":
generate_transcription("audio/sample1.wav")
Streaming Inference (Detailed response - Transcription with Timestamps)¶
import os
from air import AIRefinery
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
api_key = str(os.getenv("API_KEY"))
def generate_streaming_transcription_with_timestamps(file_name):
client = AIRefinery(api_key=api_key)
audio_file = open(file_name, "rb")
# Combine streaming with verbose output for real-time timestamps
# Emits word and segment events as audio is processed
transcription_stream = client.audio.transcriptions.create(
model="Azure/AI-Transcription",
file=audio_file,
response_format="verbose_json", # Required for timestamp events
stream=True, # Enable streaming mode
timestamp_granularities=["segment", "word"], # Request both granularities
)
print("\n[Streaming Transcription with Timestamps]")
# Process each event based on its type
for event in transcription_stream:
if hasattr(event, "type"):
event_type = event.type
# TranscriptionTextDeltaEvent: incremental text updates
if event_type == "transcript.text.delta":
text = getattr(event, "text", "")
print(f"Delta: {text}")
# TranscriptionWordEvent: word-level timing and confidence
elif event_type == "transcript.word":
word = getattr(event, "word", "")
start = getattr(event, "start", 0)
end = getattr(event, "end", 0)
confidence = getattr(event, "confidence", 0)
print(f"Word: {word} [{start:.2f}s - {end:.2f}s] (confidence: {confidence:.2f})")
# TranscriptionSegmentEvent: segment with speaker attribution
elif event_type == "transcript.segment":
segment = getattr(event, "segment", {})
start = segment.get("start", 0)
end = segment.get("end", 0)
speaker_id = segment.get("speaker_id", "Unknown")
text = segment.get("text", "")
print(f"Segment: [{start:.2f}s - {end:.2f}s] {speaker_id}: {text}")
# TranscriptionTextDoneEvent: final complete transcription
elif event_type == "transcript.text.done":
text = getattr(event, "text", "")
print(f"\nFinal text: {text}")
if __name__ == "__main__":
generate_streaming_transcription_with_timestamps("audio/sample1.wav")