Skip to main content
OpenAI Realtime is a low-latency API that combines real-time video analysis, transcription, text-to-speech synthesis and more into a single streamlined pipeline. The OpenAI plugin in the Vision Agents SDK provides four integrations:
  1. OpenAI Realtime - A native integration for realtime video and audio with out-of-the-box support for OpenAI’s realtime models. Stream both video and audio to OpenAI over WebRTC and receive responses in real-time. Supports MCP and function calling.
  2. OpenAI LLM - Access OpenAI’s language models (like gpt-4o) with full support for the Responses API. Includes conversation history management, tool calling, and streaming responses. Works with separate STT/TTS components for voice interactions.
  3. OpenAI Chat Completions - A flexible LLM integration that works with any OpenAI-compatible API (including OSS models). Ideal for using custom models, fine-tuned models, or third-party providers that implement the OpenAI Chat Completions API.
  4. OpenAI TTS - A text-to-speech implementation using OpenAI’s TTS models with streaming support for low-latency audio synthesis.
These integrations are ideal for building conversational agents, AI avatars, fitness coaches, visual accessibility assistants, remote support tools with visual guidance, interactive tutors, and much more!

Installation

Install the Stream OpenAI plugin with
uv add vision-agents[openai]

Tutorials

The Voice AI quickstart and Video AI quickstart pages have examples to get you up and running.

Example

Check out our OpenAI example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.

OpenAI Realtime

The OpenAI Realtime plugin is used as the LLM component of an Agent for real-time speech-to-speech interactions.

Usage

Here’s a complete example:
from vision_agents.plugins import openai, getstream
from vision_agents.core.agents import Agent
from getstream import AsyncStream

# Create Stream client and user
client = AsyncStream()
agent_user = await client.create_user(name="AI Assistant")

# Create agent with OpenAI Realtime
agent = Agent(
    edge=getstream.Edge(),
    agent_user=agent_user,
    instructions="You are a helpful voice assistant.",
    llm=openai.Realtime(model="gpt-realtime", voice="marin", fps=1),
    processors=[]
)

# Create and join a call
call = client.video.call("default", call_id)
await call.get_or_create(data={"created_by_id": agent.agent_user.id})

with await agent.join(call):
    # Wait for LLM to be ready
    await agent.llm.simple_response(text="Please greet the user.")
    # Keep running until call ends
    await agent.finish()

Parameters

NameTypeDefaultDescription
modelstr"gpt-realtime"The OpenAI model to use for speech-to-speech. Supports real-time models only.
voicestr"marin"The voice to use for spoken responses (e.g., “marin”, “alloy”, “echo”).
fpsint1Number of video frames per second to send (for video-enabled agents).
The API key is read from the OPENAI_API_KEY environment variable. Instructions are set via the Agent’s instructions parameter.

Methods

connect()

Establishes the WebRTC connection to OpenAI’s Realtime API. This is called automatically when the agent joins a call and should not be called directly in most cases.
await agent.llm.connect()

simple_response(text)

Sends a text message to the OpenAI Realtime session. The model will respond with audio output.
await agent.llm.simple_response(text="What do you see in the video?")

simple_audio_response(pcm_data)

Sends raw PCM audio data to OpenAI. Audio should be 48 kHz, 16-bit PCM format.
await agent.llm.simple_audio_response(pcm_data)

request_session_info()

Requests session information from the OpenAI API.
await agent.llm.request_session_info()

Properties

output_track

The output_track property provides access to the audio output stream from OpenAI. This is an AudioStreamTrack that contains the synthesized speech responses.
audio_track = agent.llm.output_track

is_connected

Returns True if the realtime session is currently active.
if agent.llm.is_connected:
    print("Connected to OpenAI Realtime API")

Function calling

You can give the model the ability to call functions in your code while using the Realtime plugin via the main Agent class. Follow the instructions in the MCP tool calling guide, replacing the LLM with the OpenAI Realtime class.

Events

The OpenAI Realtime plugin emits various events during conversations that you can subscribe to. The plugin wraps OpenAI’s native events into a strongly-typed event system with better ergonomics.
from vision_agents.core.llm.events import (
    RealtimeConnectedEvent,
    RealtimeResponseEvent,
    RealtimeTranscriptEvent,
    RealtimeAudioOutputEvent,
    RealtimeErrorEvent
)

# Subscribe to events
@agent.llm.events.on(RealtimeConnectedEvent)
async def on_connected(event: RealtimeConnectedEvent):
    print(f"Connected! Session ID: {event.session_id}")
    print(f"Capabilities: {event.capabilities}")

@agent.llm.events.on(RealtimeTranscriptEvent)
async def on_transcript(event: RealtimeTranscriptEvent):
    print(f"Transcript: {event.text}")
    print(f"Role: {event.user_metadata.get('role')}")

@agent.llm.events.on(RealtimeResponseEvent)
async def on_response(event: RealtimeResponseEvent):
    print(f"Response: {event.text}")
    print(f"Complete: {event.is_complete}")

@agent.llm.events.on(RealtimeAudioOutputEvent)
async def on_audio_output(event: RealtimeAudioOutputEvent):
    # Handle audio output
    audio_data = event.audio_data
    sample_rate = event.sample_rate

OpenAI LLM

An alternative way for developers to interact with OpenAI would be through the conventional LLM pattern. This is useful if you’re using a model which does not support connecting directly over WebRTC or if you’re using a custom/fine-tuned model.

Usage

To use the OpenAI LLM, the plugin exposes an LLM class, which can be used directly with the Agent. When using the LLM mode, you must also supply an STT and TTS plugin, as the audio from the user is first converted into text and then sent to the model. The response from the model is then converted back into audio.

Parameters

NameTypeDefaultDescription
modelstr-The OpenAI model to use. See OpenAI models for available options.
api_keyOptional[str]NoneOpenAI API key. If not provided, reads from OPENAI_API_KEY environment variable.
base_urlOptional[str]NoneCustom base URL for OpenAI API. Useful for proxies or OpenAI-compatible endpoints.
clientOptional[AsyncOpenAI]NoneCustom AsyncOpenAI client instance. If not provided, a new client is created with the provided or environment API key.
from vision_agents.plugins import openai, getstream, deepgram
from vision_agents.core.agents import Agent
from getstream import AsyncStream

# Create Stream client and user
client = AsyncStream()

# Create agent with OpenAI TTS
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="My happy AI friend", id="agent"),
    instructions="You are a helpful assistant.",
    stt=deepgram.STT(),
    llm=openai.LLM(), 
    tts=openai.TTS(model="gpt-4o-mini-tts", voice="alloy"),
)

# Create and join a call
call = client.video.call("default", call_id)
await call.get_or_create(data={"created_by_id": agent.agent_user.id})

with await agent.join(call):
    # Use agent.say() to synthesize speech
    await agent.say("Hello! How can I help you today?")
    await agent.finish()

OpenAI Chat Completions

The OpenAI Chat Completions plugin provides a flexible way to use any model that implements the OpenAI Chat Completions API. This includes OpenAI models, custom fine-tuned models, and third-party providers that offer OpenAI-compatible endpoints. This is ideal for:
  • Using OSS models hosted on services like Together AI, Fireworks, or Replicate
  • Deploying custom fine-tuned models
  • Testing different model providers without changing your code
  • Using models that don’t have native realtime support

Usage

from vision_agents.plugins import openai
from vision_agents.core import Agent, User
from vision_agents.plugins import getstream, deepgram, elevenlabs

# Use with any OpenAI-compatible endpoint
llm = openai.ChatCompletionsLLM(
    model="deepseek-ai/DeepSeek-V3.1",
    base_url="https://api.together.xyz/v1",
    api_key="your_together_api_key"
)

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="AI Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=llm,
    stt=deepgram.STT(),
    tts=elevenlabs.TTS()
)

Parameters

NameTypeDefaultDescription
modelstr-The model identifier to use (e.g., "gpt-4o", "deepseek-ai/DeepSeek-V3.1").
api_keyOptional[str]NoneAPI key for the service. If not provided, reads from OPENAI_API_KEY environment variable.
base_urlOptional[str]NoneCustom base URL for the API endpoint. Required for non-OpenAI providers.
clientOptional[AsyncOpenAI]NoneCustom AsyncOpenAI client instance. If not provided, a new client is created with the provided or environment API key.

Features

  • Streaming responses: Real-time text generation with chunk events
  • Conversation memory: Automatic conversation history management
  • Event-driven: Emits LLM events for integration with other components
  • Provider flexibility: Works with any OpenAI-compatible API

Methods

simple_response(text, processors, participant)

Generate a response to text input:
response = await llm.simple_response("Hello, how are you?")
print(response.text)

Example with OSS Models

from vision_agents.plugins import openai

# Use DeepSeek
llm = openai.ChatCompletionsLLM(
    model="deepseek-chat",
    base_url="https://api.deepseek.com",
    api_key="your_api_key"
)

# Use Qwen via Baseten
llm = openai.ChatCompletionsVLM(
    model="qwen3vl",
    base_url="https://model-vq0nkx7w.api.baseten.co/development/sync/v1",
    api_key="your_api_key"
)

OpenAI TTS

The OpenAI TTS plugin provides text-to-speech synthesis using OpenAI’s TTS models. It supports streaming audio output for low-latency speech generation.

Usage

Use the OpenAI TTS plugin as the tts parameter when creating an Agent:
from vision_agents.plugins import openai, getstream
from vision_agents.core.agents import Agent
from getstream import AsyncStream

# Create Stream client and user
client = AsyncStream()
agent_user = await client.create_user(name="AI Assistant")

# Create agent with OpenAI TTS
agent = Agent(
    edge=getstream.Edge(),
    agent_user=agent_user,
    instructions="You are a helpful assistant.",
    llm=your_llm,  # Any LLM that outputs text
    tts=openai.TTS(model="gpt-4o-mini-tts", voice="alloy"),
)

# Create and join a call
call = client.video.call("default", call_id)
await call.get_or_create(data={"created_by_id": agent.agent_user.id})

with await agent.join(call):
    # Use agent.say() to synthesize speech
    await agent.say("Hello! How can I help you today?")
    await agent.finish()

Parameters

NameTypeDefaultDescription
api_keyOptional[str]NoneOpenAI API key. If not provided, reads from OPENAI_API_KEY environment variable.
modelstr"gpt-4o-mini-tts"The OpenAI TTS model to use. See OpenAI TTS docs for options.
voicestr"alloy"The voice to use for synthesis. Options include “alloy”, “echo”, “fable”, “onyx”, “nova”, and “shimmer”.
clientOptional[AsyncOpenAI]NoneCustom AsyncOpenAI client instance. If not provided, a new client is created with the provided or environment API key.

Methods

stream_audio(text)

Synthesizes speech from text and returns PCM audio data. This method is called internally by the Agent when using agent.say().
pcm_data = await tts.stream_audio("Hello, world!")

stop_audio()

Stops any ongoing audio synthesis. For OpenAI TTS, this is a no-op as the agent manages the output track.
await tts.stop_audio()

Audio format

OpenAI TTS returns audio in the following format:
  • Sample rate: 24,000 Hz
  • Channels: 1 (mono)
  • Format: 16-bit signed PCM
The SDK automatically handles resampling and format conversion as needed for your Stream call.

Events

The OpenAI TTS plugin emits standard TTS events that you can subscribe to:
from vision_agents.core.tts.events import (
    TTSAudioEvent,
    TTSSynthesisStartedEvent,
    TTSSynthesisCompletedEvent
)

@agent.tts.events.on(TTSSynthesisStartedEvent)
async def on_synthesis_started(event: TTSSynthesisStartedEvent):
    print(f"Starting synthesis: {event.text}")

@agent.tts.events.on(TTSAudioEvent)
async def on_audio_chunk(event: TTSAudioEvent):
    print(f"Received audio chunk: {len(event.audio_data)} bytes")

@agent.tts.events.on(TTSSynthesisCompletedEvent)
async def on_synthesis_completed(event: TTSSynthesisCompletedEvent):
    print("Synthesis completed")