Skip to main content
LLMs not running using a Realtime model requires some help to convert the user’s speech and LLM responses into something the user can speak to and hear. To achieve this, the Agent class exposes two parameters tts and stt allowing developers to pass in any text-to-speech and speech-to-text service they like. Using this method, the output voices can be configured, the transcription rate can be adjusted and more. Internally, the Agent class handles the management between these services and things such as setting up the audio track for the STT providers as an example.

STT (Speech-to-Text)

STT components convert audio input into text for processing by the LLM. All implementations follow a standardised interface with consistent event emission. These components process real-time audio with PcmData objects from getstream.video.rtc.track_util, provide partial transcript support for responsive UI, and include comprehensive error handling and connection management. Multiple providers are supported including Deepgram, ElevenLabs, Fast Whisper, and others. All STT providers must call await stt.start() before processing audio to initialize connections and resources. Some STT providers include built-in turn detection (indicated by the turn_detection property). When this is the case, the Agent automatically skips any separately configured TurnDetector to avoid conflicts.

STT Methods

MethodDescription
start()Initialize connections and resources. Must be called before use
process_audio(pcm_data, participant)Process an audio frame (~20ms chunks)
clear()Clear any pending audio or internal state
close()Clean up resources

STT Events

EventDescription
STTTranscriptEventFinal transcript result
STTPartialTranscriptEventInterim transcript for real-time display
STTErrorEventTemporary, recoverable error
STTConnectionEventConnection state change (DISCONNECTED, CONNECTING, CONNECTED, RECONNECTING, ERROR)

TTS (Text-to-Speech)

TTS components convert LLM responses into audio output. They handle audio synthesis and streaming to the output track. These components provide streaming audio synthesis for low latency, multiple voice options and customisation, audio format standardisation using PcmData and AudioFormat from getstream.video.rtc.track_util, and support for providers like ElevenLabs, Cartesia, and others.

TTS Methods

MethodDescription
set_output_format(sample_rate, channels, audio_format)Configure output audio format. Audio is automatically resampled and re-channeled to match
send(text, participant=None)Convert text to speech and emit audio events
stop_audio()Clear the audio queue and stop current playback
close()Clean up resources

TTS Events

EventDescription
TTSAudioEventAudio chunk ready for playback
TTSSynthesisStartEventSynthesis has begun for a text input
TTSSynthesisCompleteEventSynthesis finished (includes metrics like synthesis_time_ms, chunk_count, real_time_factor)
TTSErrorEventSynthesis error (with recoverability flag)
TTSConnectionEventConnection state change

Interruption support

The TTS base class exposes an epoch property and an interrupt() method for handling barge-in scenarios:
MemberTypeDescription
epochintMonotonic counter that increments on each interruption. Used to identify stale audio events.
interrupt()asyncIncrements the epoch and stops the current audio synthesis. Stale events are automatically dropped.
You do not need to call interrupt() manually — the Agent class invokes it when a TurnStartedEvent is received.