Gemini

Google’s Gemini provides realtime multimodal capabilities over WebSocket. Using Vision Agents with Gemini allows developers to quickly build audio and video directly to into their apps and receive responses in real-time. The plugin includes built-in tools for search, code execution, RAG, as well as support for using both LLM and Realtime models.

Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Installation

uv add vision-agents[gemini]

Realtime

Native speech-to-speech with optional video over WebSocket.

from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.Realtime(fps=3),  # Video frames sent to model
)

Name	Type	Default	Description
`model`	`str`	`"gemini-2.5-flash"`	Gemini model
`fps`	`int`	`1`	Video frames per second
`api_key`	`str`	`None`	API key (defaults to `GOOGLE_API_KEY` env var)

LLM

Standard chat completions. Requires separate STT/TTS.

from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream, deepgram, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-2.5-flash"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)

Built-in Tools

Gemini provides built-in tools you can enable:

llm = gemini.LLM(
    model="gemini-2.5-flash",
    tools=[
        gemini.tools.GoogleSearch(),
        gemini.tools.CodeExecution(),
        gemini.tools.FileSearch(store),  # RAG
        gemini.tools.URLContext(),
    ]
)

Tool	Description
`GoogleSearch`	Ground responses with web data
`CodeExecution`	Run Python code
`FileSearch`	RAG over your documents
`URLContext`	Read specific web pages

File Search (RAG)

Managed RAG with automatic chunking and retrieval:

from vision_agents.plugins import gemini

store = gemini.GeminiFilesearchRAG(name="my-knowledge-base")
await store.create()
await store.add_directory("./knowledge")

llm = gemini.LLM(
    model="gemini-2.5-flash",
    tools=[gemini.tools.FileSearch(store)]
)

See the RAG guide for more details.

Function Calling

@agent.llm.register_function(description="Get weather for a location")
async def get_weather(location: str) -> dict:
    return {"temperature": "22°C", "condition": "Sunny"}

See the Function Calling guide for details.

Events

The Gemini plugin emits events for connection state and responses. Most developers should use the core events (LLMResponseCompletedEvent, etc.) for provider-agnostic code.

from vision_agents.plugins.gemini.events import (
    GeminiConnectedEvent,
    GeminiErrorEvent,
)

@agent.events.subscribe
async def on_gemini_connected(event: GeminiConnectedEvent):
    print(f"Connected to Gemini model: {event.model}")

@agent.events.subscribe
async def on_gemini_error(event: GeminiErrorEvent):
    print(f"Gemini error: {event.error}")

Event	Description
`GeminiConnectedEvent`	Realtime connection established
`GeminiErrorEvent`	Error occurred
`GeminiAudioEvent`	Audio output received
`GeminiTextEvent`	Text output received
`GeminiResponseEvent`	Response chunk received

Overview

AI Providers

Custom Integrations

Installation

Realtime

LLM

Built-in Tools

File Search (RAG)

Function Calling

Events

Next Steps

Build a Voice Agent

Build a Video Agent

Overview

AI Providers

Custom Integrations

​Installation

​Realtime

​LLM

​Built-in Tools

​File Search (RAG)

​Function Calling

​Events

​Next Steps

Build a Voice Agent

Build a Video Agent

Installation

Realtime

LLM

Built-in Tools

File Search (RAG)

Function Calling

Events

Next Steps