Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.
Which plugin do I need?
Pick based on what your agent needs to do:| I want to… | Start here | What you get |
|---|---|---|
| Handle calls and respond naturally by voice | Realtime | End-to-end voice agent with multimodal support, unified under one plugin and model |
| Connect to my own tools, APIs, or knowledge base | Language Models | Function calling, RAG, and full control over STT/TTS choices |
| Transcribe what users say in real time | Speech-to-Text | Streaming transcription, some with built-in turn detection |
| Give my agent a distinct, natural voice | Text-to-Speech | Cloud and local options, from expressive to ultra-low latency |
| See and understand what’s on camera | Vision & Video | Object detection, video analysis, and style transfer |
| Put a face on my agent | Avatars | Real-time lip-synced visual characters |
| Make conversations feel natural, not robotic | Turn Detection | Smart interruption handling and silence detection |
| Run open-source models on my own infrastructure | Infrastructure | Self-hosted inference, model routing, and vector search |
Installation
Plugins install as extras. Add only the ones you need:Browse by Category
Language Models
Text generation with function calling. Requires separate STT/TTS plugins.| Provider | Notes |
|---|---|
| Gemini | Built-in tools: search, code execution, RAG |
| OpenAI | Responses API (GPT-5+) and ChatCompletions |
| xAI (Grok) | Advanced reasoning, function calling |
| OpenRouter | Unified API for Claude, Gemini, GPT, and more |
| Kimi AI | OpenAI-compatible via ChatCompletions |
| Qwen | DashScope API via ChatCompletions |
Realtime
End-to-end speech-to-speech with built-in STT/TTS. Lowest latency, simplest setup.| Provider | Notes |
|---|---|
| Gemini Realtime | WebSocket, optional video, built-in VAD |
| OpenAI Realtime | WebRTC, built-in STT/TTS |
| Qwen Realtime | Native audio I/O, video support |
| AWS Bedrock | Amazon Nova models, auto session management |
Speech-to-Text
Real-time transcription. Some include built-in turn detection.| Provider | Notes |
|---|---|
| Deepgram | Nova-3, built-in turn detection |
| ElevenLabs | Scribe v2, ~150ms latency, built-in VAD |
| AssemblyAI | Punctuation-based turn detection |
| Fish Audio | Auto language detection |
| Mistral Voxtral | WebSocket streaming, requires separate turn detection |
| Fast-Whisper | Local, CPU/GPU accelerated |
| Wizper | Whisper v3, on-the-fly translation |
Text-to-Speech
Voice synthesis for agent responses.| Provider | Notes |
|---|---|
| ElevenLabs | Highly realistic, multilingual |
| Cartesia | Low-latency Sonic model |
| Deepgram | Aura-2, low-latency |
| OpenAI | gpt-4o-mini-tts, streaming |
| Fish Audio | Prosody control, voice cloning |
| Inworld | Expressive game character voices |
| Kokoro | Local, runs on CPU, no API key |
| Pocket TTS | Local, ~200ms latency, voice cloning |
| AWS Polly | Standard and neural engines |
Vision & Video
Video understanding, object detection, and video transformation.| Provider | Notes |
|---|---|
| Moondream | Zero-shot detection, VQA, cloud or local |
| NVIDIA | Cosmos Reason2, real-time video understanding |
| Roboflow | Pre-trained and custom detection models |
| Ultralytics YOLO | Pose estimation, object detection |
| Decart | Real-time AI video style transfer |
Avatars
Visual AI characters with synchronized lip-sync.| Provider | Notes |
|---|---|
| HeyGen | Realistic AI avatars, automatic lip-sync |
| LemonSlice | Real-time interactive avatars |
Turn Detection
Controls when the agent should start and stop speaking.| Provider | Notes |
|---|---|
| Smart Turn | Silero VAD + Whisper features |
| Vogent | Neural turn completion prediction |
Infrastructure
Inference platforms and data services for running models on your own terms.| Provider | Notes |
|---|---|
| Baseten | OpenAI-compatible endpoints for open-source models |
| HuggingFace Inference | Unified API routing to Together, Groq, Cerebras, and more |
| TurboPuffer | Vector database for RAG with hybrid search |

