Skip to main content
The Realtime component provides end-to-end speech-to-speech communication, combining STT, LLM, and TTS functionality in a single, optimized interface. It delivers ultra-low latency speech processing, direct audio streaming without intermediate text conversion, and support for multiple modalities (audio, video, text).

When to Use Realtime

Use a Realtime LLM when you want the lowest latency voice interactions. The model handles speech recognition, response generation, and speech synthesis natively—no separate STT or TTS services required. Use the traditional STT → LLM → TTS pipeline when you need custom voices (e.g., Cartesia, ElevenLabs), specific transcription providers, or models that don’t support realtime audio.

Supported Providers

Basic Usage

from vision_agents.plugins import openai, getstream
from vision_agents.core.agents import Agent
from vision_agents.core.edge.types import User

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful voice assistant",
    llm=openai.Realtime(model="gpt-realtime", voice="marin"),
    processors=[]
)

Methods

simple_response(text, processors=None, participant=None)

Sends a text prompt to the realtime model. The model responds with audio.
await agent.llm.simple_response("What do you see in the video?")

simple_audio_response(pcm, participant=None)

Sends raw PCM audio data directly to the model for processing.
await agent.llm.simple_audio_response(audio_pcm_data)

Properties

PropertyTypeDescription
connectedboolTrue if the realtime session is active
fpsintVideo frames per second sent to the model (default: 1)
session_idstrUUID identifying the current session
epochintMonotonic interruption counter. Increments each time interrupt() is called, allowing stale audio output events to be identified and dropped.

Methods

interrupt()

Increments the epoch counter so that any in-flight RealtimeAudioOutputEvent from a previous response is detected as stale and discarded by the Agent. The Agent calls this automatically on barge-in.

Events

The Realtime class emits events for monitoring conversations:
EventDescription
RealtimeConnectedEventConnection established with session config & capabilities
RealtimeDisconnectedEventConnection closed (includes reason and clean-close flag)
RealtimeUserSpeechTranscriptionEventTranscript of user speech
RealtimeAgentSpeechTranscriptionEventTranscript of agent speech
RealtimeResponseEventAI response text (with is_complete flag)
RealtimeAudioInputEventAudio sent to the realtime LLM
RealtimeAudioOutputEventAudio received from the realtime LLM
RealtimeAudioOutputDoneEventAudio output complete for a response
RealtimeConversationItemEventConversation state update (message, function call, etc.)
RealtimeErrorEventError during processing (with recoverability flag)
from vision_agents.core.llm.events import RealtimeUserSpeechTranscriptionEvent

@agent.llm.events.on(RealtimeUserSpeechTranscriptionEvent)
async def on_user_speech(event):
    print(f"User said: {event.text}")
For provider-specific parameters and configuration, see the integration docs for OpenAI, Gemini, AWS Bedrock, or Qwen.