TheDocumentation Index
Fetch the complete documentation index at: https://visionagents.ai/llms.txt
Use this file to discover all available pages before exploring further.
Agent class is the central orchestrator that brings together all other components in the Vision Agents framework. It manages the conversation flow, handles real-time audio/video processing, coordinates responses, and integrates with external tools via MCP (Model Context Protocol) servers. It is the main interface for building AI-powered video and voice applications. It supports both traditional STT/TTS workflows and modern realtime speech-to-speech models, making it flexible for various use cases.
Constructor Parameters
edge(EdgeTransport): The edge network provider for video & audio transport (you can choose any provider here)llm(LLM | AudioLLM | VideoLLM): The language model — can be a text LLM, audio-capable (AudioLLM), video-capable (VideoLLM), or a combined Realtime modelagent_user(User): The agent’s user information (name, id, etc.)
instructions(str): System instructions for the agent. Supports@file.mdreferences to load instructions from markdown files (default: “Keep your replies short and dont use special characters.”)stt(Optional[STT]): Speech-to-text service (automatically disabled in realtime mode)tts(Optional[TTS]): Text-to-speech service (automatically disabled in realtime mode)turn_detection(Optional[TurnDetector]): Turn detection service for managing conversation turns. Automatically disabled in realtime mode. Ignored automatically if the configured STT plugin already provides built-in turn detection (e.g.,elevenlabs.STT,deepgram.STTwitheager_turn_detection)processors(Optional[List[Processor]]): List of processors for video/audio processingavatar(Optional[Avatar]): Avatar provider for lip-synced video output (for exampleanam.Avatar(...))mcp_servers(Optional[List[MCPBaseServer]]): MCP servers for external tool accessoptions(Optional[AgentOptions]): Configuration options including model directory pathbroadcast_metrics(bool): Broadcast agent metrics to call participants via custom events (default:False)broadcast_metrics_interval(float): Interval in seconds between metrics broadcasts (default:5.0)multi_speaker_filter(Optional[AudioFilter]): Filter for multi-speaker audio routing. Defaults toFirstSpeakerWinsFilter, which locks onto the first active speaker and drops other participants’ audio until that speaker finishestracer(Tracer): OpenTelemetry tracer for distributed tracing (default:trace.get_tracer("agents"))profiler(Optional[Profiler]): Performance profiler for the agent and its plugins
Core Lifecycle Methods
async join(call: Call, participant_wait_timeout: Optional[float] = 10.0) -> AsyncIterator[None]
Joins a video call. Must be called as an async context manager.
The agent can join the call only once.
Once the call is ended, the agent closes itself.
Parameters
call(Call): the call to join.participant_wait_timeout(Optional[float]): timeout in seconds to wait for other participants to join before proceeding.
If0, do not wait at all. IfNone, wait forever.
Default -10.0.
async finish()
Waits for the call to end gracefully. Subscribes to the call ended event.
async close()
Cleans up all connections and resources. Safe to call multiple times.
async authenticate()
Authenticates the agent user with the edge provider. Idempotent — safe to call multiple times. Called automatically from join() and create_call(), so you usually don’t need to call it directly.
Response Methods
async simple_response(text: str, participant: Optional[Participant] = None, interrupt: bool = True)
Sends a text prompt to the active inference flow. The LLM generates a response which is then spoken through TTS (or realtime audio output). Use interrupt=True when you want this request to preempt an in-flight response.
async say(text: str, interrupt: bool = False)
Makes the agent speak text directly using TTS, bypassing the LLM entirely. Use interrupt=True to stop current output before speaking.
Monitoring Methods
idle_for() -> float
Returns the number of seconds since the last participant joined or left the call.
on_call_for() -> float
Returns the number of seconds since the agent joined the call.
MCP Integration
The Agent supports Model Context Protocol (MCP) for external tool integration:Event System
TheAgent makes it easy for developers to quickly subscribe and listen to events happening across all components. The event system merges all events across the plugin and core allowing you to listen to events in a single place using their respective type.
Core Events
- Agent Lifecycle Events:
UserTurnStartedEvent,UserTurnEndedEvent,UserTranscriptEvent,AgentTurnStartedEvent,AgentTurnEndedEvent,AgentJoinedCallEvent,AgentLeftCallEvent,AgentFinishEvent - Edge / Call Events:
ParticipantJoinedEvent,ParticipantLeftEvent,CallEndedEvent,TrackAddedEvent,TrackRemovedEvent,AudioReceivedEvent - LLM Events:
LLMResponseChunkEvent,LLMResponseCompletedEvent,LLMResponseFinalEvent,LLMErrorEvent - Tool Events:
ToolStartEvent,ToolEndEvent - Realtime Events:
RealtimeConnectedEvent,RealtimeDisconnectedEvent - STT Events:
STTConnectedEvent,STTDisconnectedEvent,STTErrorEvent - TTS Events:
TTSSynthesisStartEvent,TTSSynthesisCompleteEvent,TTSConnectedEvent,TTSDisconnectedEvent,TTSErrorEvent
Event Subscription
Debugging with local video files
For testing and debugging video processing without a live camera, you can use a local video file as the video source. This is useful for reproducible testing and development.Using the CLI
Pass the--video-track-override option when running your agent:

