Vision Agents requires a Stream account for real-time transport.
Before You Build
Many providers support OpenAI-compatible APIs. Before writing a custom plugin, check if you can use existing plugins with a custombase_url:
- The provider uses a proprietary API format
- You need provider-specific features not exposed through Chat Completions
- The provider requires custom authentication or connection handling
Plugin Categories
| Category | Base Class | Abstract Methods | Example |
|---|---|---|---|
| STT | STT | process_audio() | Deepgram |
| TTS | TTS | stream_audio(), stop_audio() | ElevenLabs |
| LLM | LLM | simple_response() | OpenRouter |
| VLM | VideoLLM | watch_video_track(), stop_watching_video_track() | NVIDIA |
| Realtime | Realtime | connect(), simple_audio_response() | Gemini Live |
| Turn Detection | TurnDetector | process_audio() | SmartTurn |
| Processor | Processor / VideoProcessor / AudioProcessor | close() | Ultralytics |
Quickstart Template
Create your plugin inplugins/acme/:
pyproject.toml
init.py
stt.py
Base Class Interfaces
STT
TTS
LLM
VLM (Video Language Model)
VLM plugins process video frames alongside text. The framework providesVideoForwarder for frame management.
- VideoForwarder: Manages frame buffering and distributes to multiple handlers at different FPS
- Shared forwarder: Multiple plugins can share one forwarder to avoid duplicate frame processing
- Frame buffer: Store recent frames for context (configurable size)
- FPS control: Request frames at the rate your model needs (1-30 fps typical)
Realtime (Speech-to-Speech)
Audio Utilities
Vision Agents provides utilities to simplify audio handling in STT and TTS plugins.PcmData Resampling
Most STT providers expect 16kHz mono audio. Use the built-in resampling:TTS Output Format
The TTS base class handles output format conversion automatically:AudioQueue for Buffering
For plugins that need to buffer audio (e.g., accumulating before processing):Function Calling
To support function calling in your LLM plugin, override these methods:- Function registration via
@llm.register_function() - Tool execution with
_execute_tools()(concurrent, with timeout) - Tool call deduplication by (name, arguments)
- Multi-round tool calling (configurable via
max_tool_rounds)
Event Emission
Base classes provide helper methods for common events:| Base Class | Helper Methods |
|---|---|
| STT | _emit_transcript_event(), _emit_partial_transcript_event(), _emit_turn_started_event(), _emit_turn_ended_event() |
| TTS | _emit_chunk() (called automatically by send()) |
| Realtime | _emit_connected_event(), _emit_disconnected_event(), _emit_audio_output_event() |
Gotchas & Best Practices
Connection Lifecycle
Owned vs shared clients: Track whether your plugin created the client or received it:Cleanup Order
Follow this order inclose() to prevent deadlocks:
Error Handling
Temporary errors (network timeouts, transient API errors): Emit and continue:Threading for Blocking Operations
Some SDKs have blocking calls. Use a thread pool:Concurrency Control
Prevent concurrent processing when your provider doesn’t support it:Sample Rate Requirements
| Plugin Type | Expected Input | Standard Rate |
|---|---|---|
| STT | Resampled audio | 16kHz mono |
| TTS | Output configurable | 16-48kHz |
| Realtime | Raw PCM | 24kHz or 48kHz (provider-specific) |
Reconnection with Backoff
For WebSocket-based plugins:Testing
Contribution Checklist
- Implement required abstract methods
- Add tests with reasonable coverage
- Pass
uv run pre-commit run --all-files - Add
README.mddocumenting usage and events - Open a PR to the Vision Agents repo

