Star Vision Agents on GitHub
Get started with examples, contribute, and stay updated
Vision Agents is provider-agnostic—bring your own API keys for Stream, LLM providers (OpenAI, Google, NVIDIA, etc.), and speech services. Most offer free tiers to get started.
Three Approaches
| Mode | Best For | How It Works |
|---|---|---|
| Realtime Models | Lowest latency, native video | WebRTC/WebSocket direct to OpenAI or Gemini |
| VLMs | Video understanding, analysis | Frame buffering + chat completions API |
| Processors | Computer vision, detection | Custom ML pipelines before the LLM |
Realtime Mode
Stream video directly to models with native vision support. Before we run these, please ensure you already have a fresh Python3.12 project configured on your machine.
To get started, we’ll need to add the dependencies for this project:
For those using AI Coding tools, we recommend adding our MCP server and Skill.md for the best experience
python-dotenv.
main.py file with the code needed to run our Agent in realtime mode:
Vision Language Models (VLMs)
For video understanding and analysis, use VLMs that support the chat completions spec. Vision Agents automatically buffers frames and includes them with each request.| Provider | Use Case |
|---|---|
| NVIDIA | Cosmos 2 for advanced video reasoning |
| HuggingFace | Open-source VLMs (Qwen2-VL, etc.) via inference API |
| OpenRouter | Unified access to Claude, Gemini, and more |
Video Processors
For computer vision tasks like object detection, pose estimation, or custom ML models, use processors. They intercept video frames, run inference, and forward results to the LLM.| Processor | What It Does |
|---|---|
| Ultralytics YOLO | Object detection, pose estimation, segmentation |
| Roboflow | Cloud or local detection with RF-DETR |
| Custom | Extend VideoProcessor for any ML model |
Custom Pipeline
Combine VLMs with separate STT and TTS for full control:What’s Next
Video Processors
Build custom detection and analysis pipelines
Production Deployment
Deploy with Docker, Kubernetes, and monitoring
Examples
- Golf Coach — Realtime pose detection + coaching
- Security Camera — Face recognition + package detection
- Football Commentator — Object detection + live commentary

