Use this file to discover all available pages before exploring further.
Build real-time video AI agents that process video with computer vision models, analyze frames with VLMs, or stream directly to realtime models. Deploy to production with built-in metrics.
Copy this prompt into Claude Code, Cursor, Windsurf, or any coding agent to scaffold your project.
Vision Agents requires a Stream account for real-time transport. Stream offers 333,000 free participant minutes monthly, plus additional credits through the Maker Program for indie developers. Most AI providers also offer free tiers.
For video understanding and analysis, use VLMs that support the chat completions spec. Vision Agents automatically buffers frames and includes them with each request. Add the video-specific plugins:
For computer vision tasks like object detection, pose estimation, or custom ML models, use processors. They intercept video frames, run inference, and forward results to the LLM.
uv add "vision-agents[ultralytics]"
from dotenv import load_dotenvfrom vision_agents.core import Agent, AgentLauncher, User, Runnerfrom vision_agents.plugins import getstream, gemini, ultralyticsload_dotenv()async def create_agent(**kwargs) -> Agent: return Agent( edge=getstream.Edge(), agent_user=User(name="Golf Coach", id="agent"), instructions="Analyze the user's golf swing and provide feedback.", llm=gemini.Realtime(fps=3), processors=[ ultralytics.YOLOPoseProcessor(model_path="yolo26n-pose.pt") ], )async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None: call = await agent.create_call(call_type, call_id) async with agent.join(call): await agent.simple_response("Say hi and offer to analyze their swing") await agent.finish()if __name__ == "__main__": Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Available processors:
Processor
What It Does
Ultralytics YOLO
Object detection, pose estimation, segmentation
Roboflow
Cloud or local detection with RF-DETR
Custom
Extend VideoProcessor for any ML model
Processors can be chained — run detection first, then pass annotated frames to the LLM.