Skip to main content
In addition to voice agents, which we discussed in the previous section, developers can also build fast, realtime video AI applications using Vision Agents. Video AI agents on Stream can be configured in a few ways:
  • WebRTC: Natively send realtime video at full FPS to LLM models over WebRTC, no intervals or images necessary
  • Interval-based processing: A Video Processor intercepts video frames at a set time, runs them through custom ML models, and then forwards the input to LLMs for further processing.
Like voice agents, the Agent class automatically handles a lot of this logic for you under the hood. Both Gemini Live and OpenAI Realtime support native WebRTC video by default, while LLMs configured with dedicated STT, TTS, and Processors will also automatically forward video frames. These are great for applications across real-time coaching, manufacturing, healthcare, retail, virtual avatars and more.

Building with OpenAI Realtime over WebRTC

Let’s get started by adding the dependencies required for our project. In this example, we assume you have a fresh Python project setup using 3.12+ or something newer. In the guide, we also use uv as our package manager of choice.
# Initialize a project in your working directory
uv init

uv add "vision-agents[getstream, openai]"
Next, in our main.py file, we can start by importing the packages required for our project:
import logging

from dotenv import load_dotenv

from vision_agents.core import User, Agent, cli
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import getstream, openai

logger = logging.getLogger(__name__)

load_dotenv()
This sets up some basic logging and loads in the .env variables required for our sample. Since we are running the OpenAI model in this example, you will need to have the following in your .env:
# Stream API credentials
STREAM_API_KEY=
STREAM_API_SECRET=

# OpenAI
OPENAI_API_KEY=
Next, let’s setup the Agent with some basic instructions, configure our edge layer and instantiate the LLM we are using:
async def create_agent(**kwargs) -> Agent:
    """Create the agent with OpenAI Realtime."""
    # create an agent to run with Stream's edge, OpenAI llm
    agent = Agent(
        edge=getstream.Edge(),  # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
        agent_user=User(name="My happy AI friend", id="agent"),  # the user object for the agent (name, image etc)
        instructions="You're a video AI assistant. Keep responses short and conversational. Don't use special characters or formatting. Be friendly and helpful. Your main job is to describe the world you see to the user. Make it fun!",
        llm=openai.Realtime(),
    )
    return agent


async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    # Ensure the agent user is created
    await agent.create_user()
    # Create a call
    call = await agent.create_call(call_type, call_id)

    logger.info("🤖 Starting OpenAI Realtime Agent...")

    # Have the agent join the call/room
    with await agent.join(call):
        logger.info("Joining call")
        # Open simple demo UI for quickly testing your agent
        await agent.edge.open_demo(call)
        logger.info("LLM ready")

        await agent.llm.simple_response("Tell me what you see in the frame")
        
        # run till the call ends
        await agent.finish()


if __name__ == "__main__":
    cli(AgentLauncher(create_agent=create_agent, join_call=join_call))
Rather than passing in instructions directly in the agent creation step, you can also use @mention syntax in the instructions string, like so:
instructions="Read @voice-agent-instructions.md"
Since we are using OpenAI directly over WebRTC, we automatically benefit from OpenAI’s voices, turn detection, and more. Under the hood, the raw WebRTC tracks go directly to OpenAI; there are no internals or intermediate steps. The result is an LLM that can see and hear the world around you and respond to user conversations with minimal delay. This approach is fantastic for building games and applications where advanced image processing isn’t needed before the model. In the next section, we will look at building an advanced video AI pipeline that does interval processing before the model.

Building a custom Video AI pipeline

A powerful component of the Vision Agents SDK is the ability to integrate realtime Video to any external computer vision model/provider through our processor pipeline. Processors are special classes that allow developers to interact directly with the raw frames. In this section, we will look at building an advanced video AI pipeline capable of detecting poses made by the user. For our processor, we will use the out-of-the-box integration with Ultralytics’ YOLO Pose Detection; however, as we will talk about more in the Processors section, this method can be used to integrate with any generic AI solution capable of processing images. To get started, let’s make a few modifications to our original sample:
from vision_agents.core.processors import YOLOPoseProcessor # Add import for YOLO
from vision_agents.plugins import deepgram, cartesia  # Add import for deepgram and cartesia plugins

async def create_agent(**kwargs) -> Agent:
    """Create the agent with OpenAI LLM and YOLO Pose Processor."""
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="My happy AI friend", id="agent"),  # the user object for the agent (name, image etc)
        instructions="You're a video AI assistant built to detect poses. Keep responses short and conversational. Don't use special characters or formatting. Be friendly and helpful. Make it fun!",
        llm=openai.LLM(),
        stt=deepgram.STT(),
        tts=cartesia.TTS(),
        processors=[
            YOLOPoseProcessor()
        ],  # processors can fetch extra data, check images/audio data or transform video
    )
    return agent
In the above snippet, we made a few changes to the code:
  1. Instead of using the OpenAI Realtime model, we are now using the LLM model
  2. STT and TTS are broken out to use Deepgram and Cartesia directly
  3. We pass in YOLOPoseProcessor to the processors list on the Agent.
Don’t forget to also update your .env to use the keys from Cartesia and Deepgram. Both are free for developers and can be found on their respective dashboards.
In this example, we are only using one processor; however, it is possible to pass in multiple and chain them together. Processors are also not limited to video; they can also be audio, allowing you to manipulate the user’s audio as well. For more on Processors, LLMs, and Realtime, check out some of the other guides in our docs. Building something with Vision Agents, tell us about it, we love seeing (and sharing) projects from the community.