Skip to main content
The Agent class is the central orchestrator that brings together all other components in the Vision Agents framework. It manages the conversation flow, handles real-time audio/video processing, coordinates responses, and integrates with external tools via MCP (Model Context Protocol) servers. It is the main interface for building AI-powered video and voice applications. It supports both traditional STT/TTS workflows and modern realtime speech-to-speech models, making it flexible for various use cases.
from vision_agents.core import agents
from vision_agents.plugins import openai, deepgram, elevenlabs, getstream
from vision_agents.core.edge.types import User

# Traditional STT/TTS mode
agent = agents.Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful AI assistant",
    llm=openai.LLM(model="gpt-5.4"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
    processors=[yolo_processor]
)

# Realtime mode
agent = agents.Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You're a helpful AI assistant",
    llm=openai.Realtime(model="gpt-realtime", voice="marin"),
    processors=[yolo_processor]
)

Constructor Parameters

  • edge (EdgeTransport): The edge network provider for video & audio transport (you can choose any provider here)
  • llm (LLM | AudioLLM | VideoLLM): The language model — can be a text LLM, audio-capable (AudioLLM), video-capable (VideoLLM), or a combined Realtime model
  • agent_user (User): The agent’s user information (name, id, etc.)
Optional Parameters
  • instructions (str): System instructions for the agent. Supports @file.md references to load instructions from markdown files (default: “Keep your replies short and dont use special characters.”)
  • stt (Optional[STT]): Speech-to-text service (not needed for realtime mode)
  • tts (Optional[TTS]): Text-to-speech service (not needed for realtime mode)
  • turn_detection (Optional[TurnDetector]): Turn detection service for managing conversation turns. Ignored automatically if the configured STT plugin already provides built-in turn detection (e.g., elevenlabs.STT, deepgram.STT with eager_turn_detection)
  • vad (Optional[VAD]): Voice activity detection service
  • processors (Optional[List[Processor]]): List of processors for video/audio processing
  • mcp_servers (Optional[List[MCPBaseServer]]): MCP servers for external tool access
  • options (Optional[AgentOptions]): Configuration options including model directory path
  • streaming_tts (bool): Stream TTS audio chunks as sentences complete, rather than waiting for the full LLM response (default: False)
  • broadcast_metrics (bool): Broadcast agent metrics to call participants via custom events (default: False)
  • broadcast_metrics_interval (float): Interval in seconds between metrics broadcasts (default: 5.0)
  • multi_speaker_filter (Optional[AudioFilter]): Filter for multi-speaker audio routing. Defaults to FirstSpeakerWinsFilter, which locks onto the first active speaker and drops other participants’ audio until that speaker finishes
  • tracer (Tracer): OpenTelemetry tracer for distributed tracing (default: trace.get_tracer("agents"))

Core Lifecycle Methods

async join(call: Call, participant_wait_timeout: Optional[float] = 10.0) -> AsyncIterator[None] Joins a video call. Must be called as an async context manager. The agent can join the call only once. Once the call is ended, the agent closes itself. Parameters
  • call (Call): the call to join.
  • participant_wait_timeout (Optional[float]): timeout in seconds to wait for other participants to join before proceeding.
    If 0, do not wait at all. If None, wait forever.
    Default - 10.0.
async with agent.join(call):
    # Agent is now active in the call
    await agent.finish()  # Wait for call to end
async finish() Waits for the call to end gracefully. Subscribes to the call ended event. async close() Cleans up all connections and resources. Safe to call multiple times. async create_user() Creates the agent user in the edge provider if required.

Response Methods

async simple_response(text: str, participant: Optional[Participant] = None) Sends a text prompt to the LLM for processing. The LLM generates a response which is then sent through TTS (or directly as audio in realtime mode).
await agent.simple_response("Hello, how can I help you?")
async simple_audio_response(pcm: PcmData, participant: Optional[Participant] = None) Sends raw PCM audio directly to the LLM for processing. Only works with AudioLLM implementations (e.g., realtime models).
await agent.simple_audio_response(audio_pcm_data)
async say(text: str, user_id: Optional[str] = None, metadata: Optional[dict] = None) Makes the agent speak text directly using TTS, bypassing the LLM entirely. Useful for greetings, status updates, or scripted responses.
await agent.say("Welcome to the call!")

Monitoring Methods

idle_for() -> float Returns the number of seconds since the last participant joined or left the call. on_call_for() -> float Returns the number of seconds since the agent joined the call.

MCP Integration

The Agent supports Model Context Protocol (MCP) for external tool integration:
from vision_agents.core.mcp import MCPServerRemote

# Create MCP server
github_server = MCPServerRemote(
    url="https://api.githubcopilot.com/mcp/",
    headers={"Authorization": f"Bearer {github_pat}"}
)

# Add to agent
agent = agents.Agent(
    # ... other parameters
    mcp_servers=[github_server]
)
MCP tools are automatically registered with the LLM’s function registry and can be called during conversations. Please check out our MCP guide for more.

Event System

The Agent makes it easy for developers to quickly subscribe and listen to events happening across all components. The event system merges all events across the plugin and core allowing you to listen to events in a single place using their respective type.

Core Events

  • Audio Events: AudioReceivedEvent, TrackAddedEvent, TrackRemovedEvent
  • Transcript Events: STTTranscriptEvent, STTPartialTranscriptEvent
  • LLM Events: LLMRequestStartedEvent, LLMResponseChunkEvent, LLMResponseCompletedEvent, LLMErrorEvent
  • Tool Events: ToolStartEvent, ToolEndEvent
  • Turn Detection Events: TurnStartedEvent, TurnEndedEvent
  • Realtime Events: RealtimeConnectedEvent, RealtimeDisconnectedEvent, RealtimeAudioInputEvent, RealtimeAudioOutputEvent, RealtimeAudioOutputDoneEvent, RealtimeResponseEvent, RealtimeConversationItemEvent, RealtimeUserSpeechTranscriptionEvent, RealtimeAgentSpeechTranscriptionEvent, RealtimeErrorEvent
  • Agent Events: AgentSayEvent, AgentSayStartedEvent, AgentSayCompletedEvent, AgentSayErrorEvent
  • Call Events: CallEndedEvent, CallSessionParticipantJoinedEvent

Event Subscription

@agent.events.subscribe
async def on_audio_received(event: AudioReceivedEvent):
    # Handle audio data
    pass

Debugging with local video files

For testing and debugging video processing without a live camera, you can use a local video file as the video source. This is useful for reproducible testing and development.

Using the CLI

Pass the --video-track-override option when running your agent:
uv run agent.py run --video-track-override=/path/to/video.mp4

Using the API

You can also set the video override programmatically:
agent = Agent(...)
agent.set_video_track_override_path("/path/to/video.mp4")
When a video override is set, the local video file plays in a loop at 30 FPS instead of any incoming video tracks from call participants. The track lifecycle remains intact (starts when a user joins, stops when they leave).