Skip to main content

Documentation Index

Fetch the complete documentation index at: https://visionagents.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Realtime component provides end-to-end speech-to-speech communication, combining STT, LLM, and TTS functionality in a single, optimized interface. It delivers ultra-low latency speech processing, direct audio streaming without intermediate text conversion, and support for multiple modalities (audio, video, text).

When to Use Realtime

Use a Realtime LLM when you want the lowest latency voice interactions. The model handles speech recognition, response generation, and speech synthesis natively—no separate STT or TTS services required. Use the traditional STT → LLM → TTS pipeline when you need custom voices (e.g., Cartesia, ElevenLabs), specific transcription providers, or models that don’t support realtime audio.

Supported Providers

Basic Usage

from vision_agents.plugins import openai, getstream
from vision_agents.core.agents import Agent
from vision_agents.core.edge.types import User

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful voice assistant",
    llm=openai.Realtime(model="gpt-realtime", voice="marin"),
    processors=[]
)

Agent methods with realtime

await agent.simple_response("What do you see in the video?", interrupt=True)
Use agent.simple_response(...) to inject text prompts and agent.say(...) for scripted speech. You usually do not call realtime audio methods directly from app code.

Properties

PropertyTypeDescription
connectedboolTrue if the realtime session is active
fpsintVideo frames per second sent to the model (default: 1)
session_idstrUUID identifying the current session
epochintMonotonic interruption counter. Increments each time interrupt() is called, allowing stale audio output events to be identified and dropped.

Realtime methods

interrupt()

Increments the epoch counter so that any in-flight audio output from a previous response is detected as stale and discarded by the Agent. The Agent calls this automatically on barge-in.

Events

The Realtime class emits a small set of events for connection state:
EventDescription
RealtimeConnectedEventConnection established with session config & capabilities
RealtimeDisconnectedEventConnection closed (includes reason and clean flag)
For conversation events, subscribe to the agent-level events — UserTranscriptEvent fires in both classic STT and realtime modes:
from vision_agents.core.agents.events import UserTranscriptEvent

@agent.events.subscribe
async def on_user_speech(event: UserTranscriptEvent):
    print(f"User said: {event.text}")
See Events Reference for the full event surface, including LLM, tool, and error events.
For provider-specific parameters and configuration, see the integration docs for OpenAI, Gemini, AWS Bedrock, or Qwen.