Star Vision Agents on GitHub
Get started with examples, contribute, and stay updated
Vision Agents is provider-agnostic—bring your own API keys for Stream, LLM providers (OpenAI, Google, etc.), and speech services (Deepgram, ElevenLabs, etc.). Most offer free tiers to get started.
Two Approaches
Vision Agents supports two modes for building voice agents:| Mode | Best For |
|---|---|
| Realtime Models | Fastest path, built-in STT/TTS |
| Custom Pipeline | Full control over STT, LLM, TTS |
openai.Realtime() and gemini.Realtime() handle speech-to-speech natively via WebRTC or WebSocket—no separate STT/TTS needed.
Custom pipelines let you mix providers: Deepgram for STT, any LLM, ElevenLabs for TTS, with configurable turn detection.
Quickstart: Realtime Mode
The fastest path to a working voice agent would be to use one of the realtime models in a freshly created Python project:Vision Agents requires that you have CPython configured on your machine. Before continuing, please ensure these are set up along with Python 3.12 to avoid issues later in the tutorial. For those using AI Coding tools, we recommend adding our MCP server and Skill.md for the best experience
.env file in your project root containing the required API keys for the services we will be connecting with. By default, Vision Agents will scan and register these keys automatically so you aren’t required to pass them to each client manually. Should you choose to use plugins outside of the ones in this tutorial, please reference their respective integration page for the expected .env key
main.py file add the following for a realtime voice agent with Gemini:
llm=openai.Realtime()
Custom Pipeline Mode
For use-cases where you’d like to have more granular control over your voice pipeline, Vision Agents supports a wide variety of LLM providers, STT and TTS plugins as well as various models for turn detection. Since Vision Agents follows a BOYK model, check with each integration on the integrations tab for the required additions to your.env file.
For this tutorial, you will need the following in your .env:
main.py to define the various stages in our Agent:
| Component | Options |
|---|---|
| LLM | Gemini, OpenAI, OpenRouter, Anthropic, Grok, HuggingFace |
| STT | Deepgram, Fast-Whisper, Fish, Wizper |
| TTS | ElevenLabs, Cartesia, Deepgram, Pocket, AWS Polly |
| Turn Detection | Deepgram (built-in), Smart Turn, Vogent |
Function Calling & MCP
Register functions that your agent can call, or connect to MCP servers for external tools:What’s Next
Phone Integration
Connect agents to Twilio for inbound and outbound calls
RAG Support
Add knowledge bases with Gemini FileSearch or Turbopuffer
Production Deployment
Deploy with Docker, Kubernetes, and monitoring
Running Agents
Console mode and HTTP server for running agents
Examples
- Simple Agent — Minimal voice agent
- Phone & RAG — Twilio + knowledge base

