Skip to main content
Build voice agents with swappable providers, phone integration, function calling, and production deployment with built-in metrics.

Copy this prompt into Claude Code, Cursor, Windsurf, or any coding agent to scaffold your project.

CursorOpen in Cursor
Vision Agents requires a Stream account for real-time transport. Stream offers 333,000 free participant minutes monthly, plus additional credits through the Maker Program for indie developers. Most AI providers also offer free tiers.
Prerequisites: Complete the Quickstart first.

Two Modes

ModeBest For
Realtime ModelsFastest path, built-in STT/TTS
Custom PipelineFull control over STT, LLM, TTS
Realtime models like openai.Realtime() and gemini.Realtime() handle speech-to-speech natively via WebRTC or WebSocket — no separate STT/TTS needed. The Quickstart uses this approach. Custom pipelines let you mix providers: Deepgram for STT, any LLM, ElevenLabs for TTS, with configurable turn detection.

Custom Pipeline Mode

For granular control over your voice pipeline, use separate STT, LLM, and TTS components. Add the additional plugins beyond the quickstart:
uv add "vision-agents[deepgram,elevenlabs]"
Add these keys to your .env:
DEEPGRAM_API_KEY=your_deepgram_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
Then update your agent to use the custom pipeline:
from dotenv import load_dotenv

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, gemini, deepgram, elevenlabs

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Assistant", id="agent"),
        instructions="You're a helpful voice assistant.",
        llm=gemini.LLM(),
        stt=deepgram.STT(eager_turn_detection=True),
        tts=elevenlabs.TTS(),
    )


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("Greet the user")
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Mix and match any combination:
ComponentOptions
LLMGemini, OpenAI, OpenRouter, Anthropic, Grok, HuggingFace
STTDeepgram, ElevenLabs, Fast-Whisper, Fish, Wizper
TTSElevenLabs, Cartesia, Deepgram, Pocket, AWS Polly
Turn DetectionDeepgram (built-in), ElevenLabs (built-in), Smart Turn, Vogent

Function Calling & MCP

Register functions that your agent can call:
@llm.register_function(description="Get weather for a location")
async def get_weather(location: str) -> dict:
    return {"temperature": "22C", "condition": "Sunny"}
Functions are automatically converted to the right format for each LLM provider. For MCP servers, external tools, and advanced patterns, see the Function Calling & MCP guide.

What’s Next

Phone Integration

Connect agents to Twilio for inbound and outbound calls

RAG Support

Add knowledge bases with Gemini FileSearch or TurboPuffer

Production Deployment

Deploy with Docker, Kubernetes, and monitoring

Built-in HTTP Server

Console mode and HTTP server for running agents

Examples

  • Simple Agent — Minimal voice agent with Deepgram STT + ElevenLabs TTS + Gemini LLM
  • Phone & RAG — Twilio calling with TurboPuffer knowledge base