Copy this prompt into Claude Code, Cursor, Windsurf, or any coding agent to scaffold your project.
Vision Agents requires a Stream account for real-time transport. Stream offers 333,000 free participant minutes monthly, plus additional credits through the Maker Program for indie developers. Most AI providers also offer free tiers.
Two Modes
| Mode | Best For |
|---|---|
| Realtime Models | Fastest path, built-in STT/TTS |
| Custom Pipeline | Full control over STT, LLM, TTS |
openai.Realtime() and gemini.Realtime() handle speech-to-speech natively via WebRTC or WebSocket — no separate STT/TTS needed. The Quickstart uses this approach.
Custom pipelines let you mix providers: Deepgram for STT, any LLM, ElevenLabs for TTS, with configurable turn detection.
Custom Pipeline Mode
For granular control over your voice pipeline, use separate STT, LLM, and TTS components. Add the additional plugins beyond the quickstart:.env:
| Component | Options |
|---|---|
| LLM | Gemini, OpenAI, OpenRouter, Anthropic, Grok, HuggingFace |
| STT | Deepgram, ElevenLabs, Fast-Whisper, Fish, Wizper |
| TTS | ElevenLabs, Cartesia, Deepgram, Pocket, AWS Polly |
| Turn Detection | Deepgram (built-in), ElevenLabs (built-in), Smart Turn, Vogent |
Function Calling & MCP
Register functions that your agent can call:What’s Next
Phone Integration
Connect agents to Twilio for inbound and outbound calls
RAG Support
Add knowledge bases with Gemini FileSearch or TurboPuffer
Production Deployment
Deploy with Docker, Kubernetes, and monitoring
Built-in HTTP Server
Console mode and HTTP server for running agents
Examples
- Simple Agent — Minimal voice agent with Deepgram STT + ElevenLabs TTS + Gemini LLM
- Phone & RAG — Twilio calling with TurboPuffer knowledge base

