Introduction to Integrations

Vision Agents ships with 30+ plugins that connect AI providers to your real-time voice and video applications. Each plugin wraps a provider’s API with a consistent interface — swap providers without rewriting your agent logic.

Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Which plugin do I need?

Pick based on what your agent needs to do:

I want to…	Start here	What you get
Handle calls and respond naturally by voice	Realtime	End-to-end voice agent with multimodal support, unified under one plugin and model
Connect to my own tools, APIs, or knowledge base	Language Models	Function calling, RAG, and full control over STT/TTS choices
Transcribe what users say in real time	Speech-to-Text	Streaming transcription, some with built-in turn detection
Give my agent a distinct, natural voice	Text-to-Speech	Cloud and local options, from expressive to ultra-low latency
See and understand what’s on camera	Vision & Video	Object detection, video analysis, and style transfer
Put a face on my agent	Avatars	Real-time lip-synced visual characters
Make conversations feel natural, not robotic	Turn Detection	Smart interruption handling and silence detection
Run open-source models on my own infrastructure	Infrastructure	Self-hosted inference, model routing, and vector search
Connect users to my agent over WebRTC	Edge Transport	Stream’s global edge network — sub-500ms latency with frontend SDKs
Deploy agents over Tencent’s network in China	Edge Transport	Alternative transport layer with low latency in mainland China

Installation

Plugins install as extras. Add only the ones you need:

uv add "vision-agents[gemini,deepgram,elevenlabs]"

See the Installation guide for the full list of available extras.

Browse by Category

Language Models

Text generation with function calling. Requires separate STT/TTS plugins.

Provider	Notes
Anthropic (Claude)	Messages API, streaming, function calling
Gemini	Built-in tools: search, code execution, RAG
OpenAI	Responses API (GPT-5+) and ChatCompletions
xAI (Grok)	Advanced reasoning, function calling
OpenRouter	Unified API for Claude, Gemini, GPT, and more
Kimi AI	OpenAI-compatible via ChatCompletions
Qwen	DashScope API via ChatCompletions

Realtime

End-to-end speech-to-speech with built-in STT/TTS. Lowest latency, simplest setup.

Provider	Notes
Gemini Realtime	WebSocket, optional video, built-in VAD
Inworld Realtime	WebRTC, protocol-compatible with OpenAI
OpenAI Realtime	WebRTC, built-in STT/TTS
Qwen Realtime	Native audio I/O, video support
xAI Realtime	WebSocket, server VAD, web + X search
AWS Bedrock	Amazon Nova models, auto session management

Speech-to-Text

Real-time transcription. Some include built-in turn detection.

Provider	Notes
Deepgram	Nova-3, built-in turn detection
ElevenLabs	Scribe v2, ~150ms latency, built-in VAD
AssemblyAI	Punctuation-based turn detection
Fish Audio	Auto language detection
Mistral Voxtral	WebSocket streaming, requires separate turn detection
Fast-Whisper	Local, CPU/GPU accelerated
Wizper	Whisper v3, on-the-fly translation

Text-to-Speech

Voice synthesis for agent responses.

Provider	Notes
ElevenLabs	Highly realistic, multilingual
Cartesia	Low-latency Sonic model
Deepgram	Aura-2, low-latency
OpenAI	gpt-4o-mini-tts, streaming
Fish Audio	Prosody control, voice cloning
Inworld	Expressive game character voices
Kokoro	Local, runs on CPU, no API key
Pocket TTS	Local, ~200ms latency, voice cloning
xAI	Five expressive voices, speech tags
AWS Polly	Standard and neural engines

Vision & Video

Video understanding, object detection, and video transformation.

Provider	Notes
Moondream	Zero-shot detection, VQA, cloud or local
NVIDIA	Cosmos Reason2, real-time video understanding
Roboflow	Pre-trained and custom detection models
Ultralytics YOLO	Pose estimation, object detection
Decart	Real-time AI video style transfer

Avatars

Visual AI characters with synchronized lip-sync.

Provider	Notes
Anam	Real-time conversational avatars
LiveAvatar	Realistic AI avatars (HeyGen), automatic lip-sync
LemonSlice	Real-time interactive avatars

Turn Detection

Controls when the agent should start and stop speaking.

Provider	Notes
Smart Turn	Silero VAD + Whisper features
Vogent	Neural turn completion prediction

Deepgram and ElevenLabs STT include built-in turn detection — no separate plugin needed.

Infrastructure

Inference platforms and data services for running models on your own terms.

Provider	Notes
Baseten	OpenAI-compatible endpoints for open-source models
HuggingFace Inference	Unified API routing to Together, Groq, Cerebras, and more
TurboPuffer	Vector database for RAG with hybrid search

Edge Transport

Alternative real-time transport layers for deploying agents in specific regions.

Provider	Notes
Stream Video RTC	Default transport — global WebRTC, chat-backed conversation, frontend SDKs
Local transport	Microphone, speakers, and camera as the agent edge
Tencent RTC	Low-latency in China, frontend SDKs

Consistent Interface

Plugins of the same type share a common interface — swap providers in one line:

# Any STT plugin works the same way
stt = deepgram.STT()
stt = elevenlabs.STT()
stt = fish.STT()

# Any TTS plugin works the same way
tts = elevenlabs.TTS()
tts = cartesia.TTS()
tts = kokoro.TTS()

# Any LLM plugin works the same way
llm = gemini.LLM("gemini-3-flash-preview")
llm = openai.LLM(model="gpt-5.4")
llm = openrouter.LLM(model="anthropic/claude-sonnet-4")

Creating Custom Plugins

Don’t see your provider? Build your own plugin to connect additional services. See the Create Your Own Plugin guide.

Documentation Index

​Which plugin do I need?

​Installation

​Browse by Category

​Language Models

​Realtime

​Speech-to-Text

​Text-to-Speech

​Vision & Video

​Avatars

​Turn Detection

​Infrastructure

​Edge Transport

​Consistent Interface

​Creating Custom Plugins

Which plugin do I need?

Installation

Browse by Category

Language Models

Realtime

Speech-to-Text

Text-to-Speech

Vision & Video

Avatars

Turn Detection

Infrastructure

Edge Transport

Consistent Interface

Creating Custom Plugins