Skip to main content
Vision Agents ships with 30+ plugins that connect AI providers to your real-time voice and video applications. Each plugin wraps a provider’s API with a consistent interface — swap providers without rewriting your agent logic.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Which plugin do I need?

Pick based on what your agent needs to do:
I want to…Start hereWhat you get
Handle calls and respond naturally by voiceRealtimeEnd-to-end voice agent with multimodal support, unified under one plugin and model
Connect to my own tools, APIs, or knowledge baseLanguage ModelsFunction calling, RAG, and full control over STT/TTS choices
Transcribe what users say in real timeSpeech-to-TextStreaming transcription, some with built-in turn detection
Give my agent a distinct, natural voiceText-to-SpeechCloud and local options, from expressive to ultra-low latency
See and understand what’s on cameraVision & VideoObject detection, video analysis, and style transfer
Put a face on my agentAvatarsReal-time lip-synced visual characters
Make conversations feel natural, not roboticTurn DetectionSmart interruption handling and silence detection
Run open-source models on my own infrastructureInfrastructureSelf-hosted inference, model routing, and vector search

Installation

Plugins install as extras. Add only the ones you need:
uv add "vision-agents[gemini,deepgram,elevenlabs]"
See the Installation guide for the full list of available extras.

Browse by Category

Language Models

Text generation with function calling. Requires separate STT/TTS plugins.
ProviderNotes
GeminiBuilt-in tools: search, code execution, RAG
OpenAIResponses API (GPT-5+) and ChatCompletions
xAI (Grok)Advanced reasoning, function calling
OpenRouterUnified API for Claude, Gemini, GPT, and more
Kimi AIOpenAI-compatible via ChatCompletions
QwenDashScope API via ChatCompletions

Realtime

End-to-end speech-to-speech with built-in STT/TTS. Lowest latency, simplest setup.
ProviderNotes
Gemini RealtimeWebSocket, optional video, built-in VAD
OpenAI RealtimeWebRTC, built-in STT/TTS
Qwen RealtimeNative audio I/O, video support
AWS BedrockAmazon Nova models, auto session management

Speech-to-Text

Real-time transcription. Some include built-in turn detection.
ProviderNotes
DeepgramNova-3, built-in turn detection
ElevenLabsScribe v2, ~150ms latency, built-in VAD
AssemblyAIPunctuation-based turn detection
Fish AudioAuto language detection
Mistral VoxtralWebSocket streaming, requires separate turn detection
Fast-WhisperLocal, CPU/GPU accelerated
WizperWhisper v3, on-the-fly translation

Text-to-Speech

Voice synthesis for agent responses.
ProviderNotes
ElevenLabsHighly realistic, multilingual
CartesiaLow-latency Sonic model
DeepgramAura-2, low-latency
OpenAIgpt-4o-mini-tts, streaming
Fish AudioProsody control, voice cloning
InworldExpressive game character voices
KokoroLocal, runs on CPU, no API key
Pocket TTSLocal, ~200ms latency, voice cloning
AWS PollyStandard and neural engines

Vision & Video

Video understanding, object detection, and video transformation.
ProviderNotes
MoondreamZero-shot detection, VQA, cloud or local
NVIDIACosmos Reason2, real-time video understanding
RoboflowPre-trained and custom detection models
Ultralytics YOLOPose estimation, object detection
DecartReal-time AI video style transfer

Avatars

Visual AI characters with synchronized lip-sync.
ProviderNotes
HeyGenRealistic AI avatars, automatic lip-sync
LemonSliceReal-time interactive avatars

Turn Detection

Controls when the agent should start and stop speaking.
ProviderNotes
Smart TurnSilero VAD + Whisper features
VogentNeural turn completion prediction
Deepgram and ElevenLabs STT include built-in turn detection — no separate plugin needed.

Infrastructure

Inference platforms and data services for running models on your own terms.
ProviderNotes
BasetenOpenAI-compatible endpoints for open-source models
HuggingFace InferenceUnified API routing to Together, Groq, Cerebras, and more
TurboPufferVector database for RAG with hybrid search

Consistent Interface

Plugins of the same type share a common interface — swap providers in one line:
# Any STT plugin works the same way
stt = deepgram.STT()
stt = elevenlabs.STT()
stt = fish.STT()

# Any TTS plugin works the same way
tts = elevenlabs.TTS()
tts = cartesia.TTS()
tts = kokoro.TTS()

# Any LLM plugin works the same way
llm = gemini.LLM("gemini-3-flash-preview")
llm = openai.LLM(model="gpt-5.4")
llm = openrouter.LLM(model="anthropic/claude-sonnet-4")

Creating Custom Plugins

Don’t see your provider? Build your own plugin to connect additional services. See the Create Your Own Plugin guide.