Skip to main content
Vision Agents is an open-source framework for building real-time voice and video AI applications. Built by Stream, it ships with our global edge network for low-latency transport, but is edge-agnostic—bring any transport layer you like.

Star Vision Agents on GitHub

Get started with examples, contribute, and stay updated

What Can You Build?

  • Voice Agents — customer support, phone bots, voice assistants
  • Video AI — coaching, avatars, surveillance, manufacturing
  • Phone Integration — inbound and outbound calling via Twilio
  • RAG Applications — knowledge-powered agents with Gemini or Turbopuffer

Key Features

FeatureDescription
25+ IntegrationsOpenAI, Gemini, Deepgram, ElevenLabs, NVIDIA, and more
Two ModesRealtime (WebRTC/WebSocket) or custom STT→LLM→TTS pipelines
Video ProcessingYOLO, Roboflow, custom processors for computer vision
Production ReadyHTTP server, Prometheus metrics, Docker deployment
Phone SupportTwilio integration for voice calls
RAGGemini FileSearch, Turbopuffer vector search

Built-in Integrations

LLMs: OpenAI, Gemini, Anthropic, xAI, OpenRouter, HuggingFace Realtime: OpenAI (WebRTC), Gemini (WebSocket), Qwen, AWS Nova STT: Deepgram, Fast-Whisper, Fish, Wizper TTS: ElevenLabs, Cartesia, Deepgram, Pocket, AWS Polly Vision: NVIDIA Cosmos, HuggingFace VLMs, Moondream, Roboflow, Ultralytics Turn Detection: Deepgram (built-in), Vogent, Smart Turn Each integration is extensible. Build custom processors with VideoProcessor, add new LLM providers, or integrate any speech service.

Next Steps