Skip to main content
Build low-latency voice agents on Stream’s edge network with 25+ AI provider integrations. Swap providers in one line, deploy to production with built-in metrics, and connect to phone networks.

Star Vision Agents on GitHub

Get started with examples, contribute, and stay updated
Vision Agents is provider-agnostic—bring your own API keys for Stream, LLM providers (OpenAI, Google, etc.), and speech services (Deepgram, ElevenLabs, etc.). Most offer free tiers to get started.

Two Approaches

Vision Agents supports two modes for building voice agents:
ModeBest For
Realtime ModelsFastest path, built-in STT/TTS
Custom PipelineFull control over STT, LLM, TTS
Realtime models like openai.Realtime() and gemini.Realtime() handle speech-to-speech natively via WebRTC or WebSocket—no separate STT/TTS needed. Custom pipelines let you mix providers: Deepgram for STT, any LLM, ElevenLabs for TTS, with configurable turn detection.

Quickstart: Realtime Mode

The fastest path to a working voice agent would be to use one of the realtime models in a freshly created Python project:
uv add "vision-agents[getstream,gemini,deepgram,elevenlabs,openai]" python-dotenv
Vision Agents requires that you have CPython configured on your machine. Before continuing, please ensure these are set up along with Python 3.12 to avoid issues later in the tutorial. For those using AI Coding tools, we recommend adding our MCP server and Skill.md for the best experience
Next, add a .env file in your project root containing the required API keys for the services we will be connecting with. By default, Vision Agents will scan and register these keys automatically so you aren’t required to pass them to each client manually. Should you choose to use plugins outside of the ones in this tutorial, please reference their respective integration page for the expected .env key
# Stream API credentials
STREAM_API_KEY=your_stream_api_key_here
STREAM_API_SECRET=your_stream_api_secret_here

# Google API Key from Google AI Studio
GOOGLE_API_KEY=your_google_api_key_here

OPENAI_API_KEY=your_openai_api_key_here
Finally, your main.py file add the following for a realtime voice agent with Gemini:
from dotenv import load_dotenv

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, gemini

load_dotenv() # Automatically loads your keys from .env


async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Assistant", id="agent"),
        instructions="You're a helpful voice assistant. Be concise.",
        llm=gemini.Realtime(),
    )

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("Greet the user")
        await agent.finish()

if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
To run:
uv run main.py run
Swap to OpenAI in one line: llm=openai.Realtime()

Custom Pipeline Mode

For use-cases where you’d like to have more granular control over your voice pipeline, Vision Agents supports a wide variety of LLM providers, STT and TTS plugins as well as various models for turn detection. Since Vision Agents follows a BOYK model, check with each integration on the integrations tab for the required additions to your .env file. For this tutorial, you will need the following in your .env:
# Stream API credentials
STREAM_API_KEY=your_stream_api_key_here
STREAM_API_SECRET=your_stream_api_secret_here

# Google API Key from Google AI Studio
GOOGLE_API_KEY=your_google_api_key_here

# API Key from OpenAI.com
OPENAI_API_KEY=your_openai_api_key_here

# ElevenLabs API credentials
ELEVENLABS_API_KEY=your_elevenlabs_api_key_here

# Deepgram API credentials
DEEPGRAM_API_KEY=your_deepgram_api_key_here
Next, we can update our main.py to define the various stages in our Agent:
from vision_agents.core import Agent, User
from vision_agents.plugins import getstream, gemini, deepgram, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You're a helpful voice assistant.",
    llm=gemini.LLM("gemini-2.5-flash"),
    stt=deepgram.STT(eager_turn_detection=True),
    tts=elevenlabs.TTS(),
)
Mix and match any combination:
ComponentOptions
LLMGemini, OpenAI, OpenRouter, Anthropic, Grok, HuggingFace
STTDeepgram, Fast-Whisper, Fish, Wizper
TTSElevenLabs, Cartesia, Deepgram, Pocket, AWS Polly
Turn DetectionDeepgram (built-in), Smart Turn, Vogent

Function Calling & MCP

Register functions that your agent can call, or connect to MCP servers for external tools:
@llm.register_function(description="Get weather for a location")
async def get_weather(location: str) -> dict:
    return {"temperature": "22°C", "condition": "Sunny"}
Functions are automatically converted to the right format for each LLM provider. For MCP servers, external tools, and advanced patterns, see the Function Calling & MCP guide.

What’s Next

Examples