Skip to main content
Using Vision Agents, developers can build voice agents using one of two modes. The first is using our out-of-the-box support for OpenAI Realtime or Gemini Live and the second allows for a more traditional Speech-to-Text (STT) -> LLM -> Text-to-Speech (TTS) pipeline. In this guide, we will show examples using both; however, developers can choose the best option. We recommend using the real-time version of OpenAI and Gemini for fast, low-latency agents. If you want full control over your voice pipeline, such as using a custom LLM like Grok or Anthropic, consider the second approach. Both approaches follow our philosophy of thin wrapping, meaning if the Agent does not expose something for you directly, the underlying client can either be passed in or accessed directly.

Building with Real-Time OpenAI and Gemini Models

Both OpenAI and Gemini support voice agents directly at the model layer. This means developers are not required to manually pass in text-to-speech, speech-to-text, or voice activity/turn-taking models to the agent; the model has built-in support for these. Let’s build a simple voice agent using the Gemini Live model to get started. For this, we will need to install the following in a new Python 3.12+ project:
uv init

uv add "vision-agents[getstream, gemini]" python-dotenv
Next, in our main.py file, we can start by importing the packages required for our project:
import logging

from dotenv import load_dotenv

from vision_agents.core import User, Agent, cli
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import getstream, gemini

logger = logging.getLogger(__name__)

load_dotenv()
This sets up some basic logging and loads in the .env variables required for our sample. Since we are running the Gemini model in this example, we will need to have the following in our .env file:
# Stream API credentials
STREAM_API_KEY=
STREAM_API_SECRET=

# Gemini
GOOGLE_API_KEY=
Both Stream and Google offer free API keys. For Gemini, developers can get a free API key on Google’s AI Studio while Stream developers can get their API key on the Stream Dashboard
Next, we can define our create_agent function to set up the Agent with basic instructions, the edge layer, and the user our agent will join the call as:
async def create_agent(**kwargs) -> Agent:
    """Create the agent with Gemini Realtime."""
    # Initialize Gemini with realtime capabilities
    llm = gemini.Realtime()

    # create an agent to run with Stream's edge, Gemini llm
    agent = Agent(
        edge=getstream.Edge(),  # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
        agent_user=User(name="My happy AI friend", id="agent"),  # the user object for the agent (name, image etc)
        instructions="You're a voice AI assistant. Keep responses short and conversational. Don't use special characters or formatting. Be friendly and helpful.",
        llm=llm,
    )
    return agent
The Agent allows us to interact with the Gemini model in two ways:
  1. Using simple_response, a convenience method for quickly sending some text to the model without changing any additional parameters.
  2. Using send_realtime_input, the native Gemini Realtime Input method which allows us to interact with the model directly.
Rather than passing in instructions directly in the agent creation step, we can also use @mention syntax in the instructions string, like so:
instructions="Read @voice-agent-instructions.md"
We also need to define a join_call function that handles joining the call and starting the agent:
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    # Ensure the agent user is created
    await agent.create_user()
    # Create a call
    call = await agent.create_call(call_type, call_id)

    logger.info("🤖 Starting Gemini Realtime Agent...")

    # Have the agent join the call/room
    with await agent.join(call):
        logger.info("Joining call")

        await agent.edge.open_demo(call)
        logger.info("LLM ready")

        # Example 1: standardized simple response
        await agent.llm.simple_response("chat with the user about the weather.")
        
        # run till the call ends
        await agent.finish()


if __name__ == "__main__":
    cli(AgentLauncher(create_agent=create_agent, join_call=join_call))
To run our example, we can call uv run main.py which kicks off the agent and automatically opens the Stream Video demo app as the UI 🎉.

Custom voice agent pipelines

For advanced voice pipelines, such as wanting to use a different LLM provider, custom voices, VADs, etc., the Agent framework also allows you to override these properties directly. Like the previous example, which uses the OpenAI WebRTC connection and Gemini Live API, this method breaks things out into their individual parts and connects them together internally within the Agent class. For example, we could use OpenAI’s GPT-5 as the underlying model but customise the responses by creating a custom voice with Cartesia. In this case, we would make a few small changes to our earlier example. Before starting, we need to update our dependencies to now also use the openai, cartesia, and deepgram plugins:
uv add "vision-agents[openai, cartesia, deepgram]"
In our imports, let’s remove the gemini plugin and replace it with openai. We will also add the cartesia and deepgram packages since we will be using their TTS and STT services respectively.
from vision_agents.plugins import getstream, openai, cartesia, deepgram
Next, we need to update our .env with the API keys for OpenAI, Cartesia and Deepgram. Each of these services provide developers with the option to create a free API key on their website with generous limits.
# Deepgram API credentials
DEEPGRAM_API_KEY=
# Cartesia API credentials
CARTESIA_API_KEY=
# OpenAI API credentials
OPENAI_API_KEY=
Finally, in our create_agent function we can change the configuration of our Agent class. We can change the LLM in use and pass in the clients for TTS and STT:
async def create_agent(**kwargs) -> Agent:
    """Create the agent with OpenAI LLM."""
    # Create the OpenAI LLM object
    llm = openai.LLM()

    # create an agent to run with Stream's edge, openAI llm
    agent = Agent(
        edge=getstream.Edge(),  # low latency edge. clients for React, iOS, Android, RN, Flutter etc.
        agent_user=User(name="My happy AI friend", id="agent"),  # the user object for the agent (name, image etc)
        instructions="You're a voice AI assistant. Keep responses short and conversational. Don't use special characters or formatting. Be friendly and helpful.",
        llm=llm,
        tts=cartesia.TTS(),
        stt=deepgram.STT()
    )
    return agent
In our join_call function, we can also add OpenAI’s create_response method directly for advanced requests:
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    # Ensure the agent user is created
    await agent.create_user()
    # Create a call
    call = await agent.create_call(call_type, call_id)

    logger.info("🤖 Starting OpenAI Agent...")

    # Have the agent join the call/room
    with await agent.join(call):
        logger.info("Joining call")

        await agent.edge.open_demo(call)
        logger.info("LLM ready")

        # Example: use native openAI create response
        await agent.llm.create_response(input=[
            {
                "role": "user",
                "content": [
                    {"type": "input_text", "text": "Tell me a short poem about this image"},
                    {"type": "input_image", "image_url": f"https://images.unsplash.com/photo-1757495361144-0c2bfba62b9e?q=80&w=2340&auto=format&fit=crop&ixlib=rb-4.1.0&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"},
                ],
            }
        ],)

        # run till the call ends
        await agent.finish()
Running once more with uv run main.py should once again bring our agent to life with the familiar Stream demo screen.

Advanced

Both the Realtime and traditional LLM modes support things like conversation, memory and function calling out of the box. By default, the Agent will write STT and LLM responses to Stream’s real-time Chat API which is linked to the Call ID. For function calling and MCP, functions can be annotated with @llm.register_function. They are automatically picked up and transformed into the right format for the LLM:
@llm.register_function(description="Get current weather for a location")
def get_weather(location: str):
    """Get the current weather for a location."""
    return {
        "location": location,
        "temperature": "22°C",
        "condition": "Sunny",
        "humidity": "65%"
    }
MCP servers (more info here) can be passed directly to the Agent class as a list:
# Create GitHub MCP server
github_server = MCPServerRemote(
    url="https://api.githubcopilot.com/mcp/",
    headers={"Authorization": f"Bearer {github_pat}"},
    timeout=10.0,  # Shorter connection timeout
    session_timeout=300.0
)

agent = Agent(
    edge=edge,
    llm=llm,
    agent_user=agent_user,
    instructions="You are a helpful AI assistant with access to GitHub via MCP server. You can help with GitHub operations like creating issues, managing pull requests, searching repositories, and more. Keep responses conversational and helpful.",
    mcp_servers=[github_server],
)
For more on these topics, check out our guides on MCP and Function Calling, Chat and Memory and Processors. Building with Vision Agents? Share it with us, we’re always keen to see (and share) projects from the community.