Skip to main content
Build real-time video AI agents that process video with computer vision models, analyze frames with VLMs, or stream directly to realtime models. Deploy to production with built-in metrics.

Copy this prompt into Claude Code, Cursor, Windsurf, or any coding agent to scaffold your project.

CursorOpen in Cursor
Vision Agents requires a Stream account for real-time transport. Stream offers 333,000 free participant minutes monthly, plus additional credits through the Maker Program for indie developers. Most AI providers also offer free tiers.
Prerequisites: Complete the Quickstart first.

Three Approaches

ModeBest ForHow It Works
Realtime ModelsLowest latency, native videoWebRTC/WebSocket direct to OpenAI or Gemini
VLMsVideo understanding, analysisFrame buffering + chat completions API
ProcessorsComputer vision, detectionCustom ML pipelines alongside the LLM

Realtime Mode

Stream video directly to models with native vision support. The fps parameter controls how many frames per second are sent to the model:
from dotenv import load_dotenv

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, gemini

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Assistant", id="agent"),
        instructions="Describe what you see. Be concise.",
        llm=gemini.Realtime(fps=3),
    )


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("What do you see?")
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Swap providers in one line:
llm=openai.Realtime(fps=3)   # OpenAI
llm=gemini.Realtime(fps=3)   # Gemini
llm=qwen.Realtime(fps=1)     # Qwen 3 OMNI

Vision Language Models (VLMs)

For video understanding and analysis, use VLMs that support the chat completions spec. Vision Agents automatically buffers frames and includes them with each request. Add the video-specific plugins:
uv add "vision-agents[nvidia,deepgram,elevenlabs]"
Add to your .env:
NVIDIA_API_KEY=your_nvidia_api_key
DEEPGRAM_API_KEY=your_deepgram_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
from dotenv import load_dotenv

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import nvidia, getstream, deepgram, elevenlabs

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Assistant", id="agent"),
        instructions="Analyze the video and answer questions.",
        llm=nvidia.VLM(
            model="nvidia/cosmos-reason2-8b",
            fps=1,
            frame_buffer_seconds=10,
        ),
        stt=deepgram.STT(),
        tts=elevenlabs.TTS(),
    )


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("Describe what you see")
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Supported VLM providers:
ProviderUse Case
NVIDIACosmos 2 for advanced video reasoning
HuggingFaceOpen-source VLMs (Qwen2-VL, etc.) via inference API
OpenRouterUnified access to Claude, Gemini, and more

Video Processors

For computer vision tasks like object detection, pose estimation, or custom ML models, use processors. They intercept video frames, run inference, and forward results to the LLM.
uv add "vision-agents[ultralytics]"
from dotenv import load_dotenv

from vision_agents.core import Agent, AgentLauncher, User, Runner
from vision_agents.plugins import getstream, gemini, ultralytics

load_dotenv()


async def create_agent(**kwargs) -> Agent:
    return Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Golf Coach", id="agent"),
        instructions="Analyze the user's golf swing and provide feedback.",
        llm=gemini.Realtime(fps=3),
        processors=[
            ultralytics.YOLOPoseProcessor(model_path="yolo26n-pose.pt")
        ],
    )


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    call = await agent.create_call(call_type, call_id)
    async with agent.join(call):
        await agent.simple_response("Say hi and offer to analyze their swing")
        await agent.finish()


if __name__ == "__main__":
    Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Available processors:
ProcessorWhat It Does
Ultralytics YOLOObject detection, pose estimation, segmentation
RoboflowCloud or local detection with RF-DETR
CustomExtend VideoProcessor for any ML model
Processors can be chained — run detection first, then pass annotated frames to the LLM.

Custom Pipeline with VLM

Combine VLMs with separate STT and TTS for full control:
from vision_agents.plugins import huggingface, getstream, deepgram, elevenlabs

# Inside create_agent:
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You're a visual assistant.",
    llm=huggingface.VLM(
        model="Qwen/Qwen2-VL-7B-Instruct",
        fps=1,
        frame_buffer_seconds=10,
    ),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)

What’s Next

Video Processors

Build custom detection and analysis pipelines

Production Deployment

Deploy with Docker, Kubernetes, and monitoring

Examples