Agent class is the central orchestrator that brings together all other components in the Vision Agents framework. It manages the conversation flow, handles real-time audio/video processing, coordinates responses, and integrates with external tools via MCP (Model Context Protocol) servers. It is the main interface for building AI-powered video and voice applications. It supports both traditional STT/TTS workflows and modern realtime speech-to-speech models, making it flexible for various use cases.
Constructor Parameters
edge(EdgeTransport): The edge network provider for video & audio transport (you can choose any provider here)llm(LLM | AudioLLM | VideoLLM): The language model — can be a text LLM, audio-capable (AudioLLM), video-capable (VideoLLM), or a combined Realtime modelagent_user(User): The agent’s user information (name, id, etc.)
instructions(str): System instructions for the agent. Supports@file.mdreferences to load instructions from markdown files (default: “Keep your replies short and dont use special characters.”)stt(Optional[STT]): Speech-to-text service (not needed for realtime mode)tts(Optional[TTS]): Text-to-speech service (not needed for realtime mode)turn_detection(Optional[TurnDetector]): Turn detection service for managing conversation turns. Ignored automatically if the configured STT plugin already provides built-in turn detection (e.g.,elevenlabs.STT,deepgram.STTwitheager_turn_detection)vad(Optional[VAD]): Voice activity detection serviceprocessors(Optional[List[Processor]]): List of processors for video/audio processingmcp_servers(Optional[List[MCPBaseServer]]): MCP servers for external tool accessoptions(Optional[AgentOptions]): Configuration options including model directory pathstreaming_tts(bool): Stream TTS audio chunks as sentences complete, rather than waiting for the full LLM response (default:False)broadcast_metrics(bool): Broadcast agent metrics to call participants via custom events (default:False)broadcast_metrics_interval(float): Interval in seconds between metrics broadcasts (default:5.0)multi_speaker_filter(Optional[AudioFilter]): Filter for multi-speaker audio routing. Defaults toFirstSpeakerWinsFilter, which locks onto the first active speaker and drops other participants’ audio until that speaker finishestracer(Tracer): OpenTelemetry tracer for distributed tracing (default:trace.get_tracer("agents"))
Core Lifecycle Methods
async join(call: Call, participant_wait_timeout: Optional[float] = 10.0) -> AsyncIterator[None]
Joins a video call. Must be called as an async context manager.
The agent can join the call only once.
Once the call is ended, the agent closes itself.
Parameters
call(Call): the call to join.participant_wait_timeout(Optional[float]): timeout in seconds to wait for other participants to join before proceeding.
If0, do not wait at all. IfNone, wait forever.
Default -10.0.
async finish()
Waits for the call to end gracefully. Subscribes to the call ended event.
async close()
Cleans up all connections and resources. Safe to call multiple times.
async create_user()
Creates the agent user in the edge provider if required.
Response Methods
async simple_response(text: str, participant: Optional[Participant] = None)
Sends a text prompt to the LLM for processing. The LLM generates a response which is then sent through TTS (or directly as audio in realtime mode).
async simple_audio_response(pcm: PcmData, participant: Optional[Participant] = None)
Sends raw PCM audio directly to the LLM for processing. Only works with AudioLLM implementations (e.g., realtime models).
async say(text: str, user_id: Optional[str] = None, metadata: Optional[dict] = None)
Makes the agent speak text directly using TTS, bypassing the LLM entirely. Useful for greetings, status updates, or scripted responses.
Monitoring Methods
idle_for() -> float
Returns the number of seconds since the last participant joined or left the call.
on_call_for() -> float
Returns the number of seconds since the agent joined the call.
MCP Integration
The Agent supports Model Context Protocol (MCP) for external tool integration:Event System
TheAgent makes it easy for developers to quickly subscribe and listen to events happening across all components. The event system merges all events across the plugin and core allowing you to listen to events in a single place using their respective type.
Core Events
- Audio Events:
AudioReceivedEvent,TrackAddedEvent,TrackRemovedEvent - Transcript Events:
STTTranscriptEvent,STTPartialTranscriptEvent - LLM Events:
LLMRequestStartedEvent,LLMResponseChunkEvent,LLMResponseCompletedEvent,LLMErrorEvent - Tool Events:
ToolStartEvent,ToolEndEvent - Turn Detection Events:
TurnStartedEvent,TurnEndedEvent - Realtime Events:
RealtimeConnectedEvent,RealtimeDisconnectedEvent,RealtimeAudioInputEvent,RealtimeAudioOutputEvent,RealtimeAudioOutputDoneEvent,RealtimeResponseEvent,RealtimeConversationItemEvent,RealtimeUserSpeechTranscriptionEvent,RealtimeAgentSpeechTranscriptionEvent,RealtimeErrorEvent - Agent Events:
AgentSayEvent,AgentSayStartedEvent,AgentSayCompletedEvent,AgentSayErrorEvent - Call Events:
CallEndedEvent,CallSessionParticipantJoinedEvent
Event Subscription
Debugging with local video files
For testing and debugging video processing without a live camera, you can use a local video file as the video source. This is useful for reproducible testing and development.Using the CLI
Pass the--video-track-override option when running your agent:

