Copy this prompt into Claude Code, Cursor, Windsurf, or any coding agent to scaffold your project.
Vision Agents requires a Stream account for real-time transport. Stream offers 333,000 free participant minutes monthly, plus additional credits through the Maker Program for indie developers. Most AI providers also offer free tiers.
Three Approaches
| Mode | Best For | How It Works |
|---|---|---|
| Realtime Models | Lowest latency, native video | WebRTC/WebSocket direct to OpenAI or Gemini |
| VLMs | Video understanding, analysis | Frame buffering + chat completions API |
| Processors | Computer vision, detection | Custom ML pipelines alongside the LLM |
Realtime Mode
Stream video directly to models with native vision support. Thefps parameter controls how many frames per second are sent to the model:
Vision Language Models (VLMs)
For video understanding and analysis, use VLMs that support the chat completions spec. Vision Agents automatically buffers frames and includes them with each request. Add the video-specific plugins:.env:
| Provider | Use Case |
|---|---|
| NVIDIA | Cosmos 2 for advanced video reasoning |
| HuggingFace | Open-source VLMs (Qwen2-VL, etc.) via inference API |
| OpenRouter | Unified access to Claude, Gemini, and more |
Video Processors
For computer vision tasks like object detection, pose estimation, or custom ML models, use processors. They intercept video frames, run inference, and forward results to the LLM.| Processor | What It Does |
|---|---|
| Ultralytics YOLO | Object detection, pose estimation, segmentation |
| Roboflow | Cloud or local detection with RF-DETR |
| Custom | Extend VideoProcessor for any ML model |
Custom Pipeline with VLM
Combine VLMs with separate STT and TTS for full control:What’s Next
Video Processors
Build custom detection and analysis pipelines
Production Deployment
Deploy with Docker, Kubernetes, and monitoring
Examples
- Golf Coach — Realtime pose detection + coaching
- Security Camera — Face recognition + package detection
- Football Commentator — Object detection + live commentary

