Skip to main content
Google’s Gemini provides realtime multimodal capabilities over WebSocket. Using Vision Agents with Gemini allows developers to quickly build audio and video directly to into their apps and receive responses in real-time. The plugin includes built-in tools for search, code execution, RAG, as well as support for using both LLM and Realtime models.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Installation

uv add vision-agents[gemini]

Realtime

Native speech-to-speech with optional video over WebSocket.
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.Realtime(fps=3),  # Video frames sent to model
)
NameTypeDefaultDescription
modelstr"gemini-2.5-flash"Gemini model
fpsint1Video frames per second
api_keystrNoneAPI key (defaults to GOOGLE_API_KEY env var)

LLM

Standard chat completions. Requires separate STT/TTS.
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream, deepgram, elevenlabs

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-2.5-flash"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)

Built-in Tools

Gemini provides built-in tools you can enable:
llm = gemini.LLM(
    model="gemini-2.5-flash",
    tools=[
        gemini.tools.GoogleSearch(),
        gemini.tools.CodeExecution(),
        gemini.tools.FileSearch(store),  # RAG
        gemini.tools.URLContext(),
    ]
)
ToolDescription
GoogleSearchGround responses with web data
CodeExecutionRun Python code
FileSearchRAG over your documents
URLContextRead specific web pages

File Search (RAG)

Managed RAG with automatic chunking and retrieval:
from vision_agents.plugins import gemini

store = gemini.GeminiFilesearchRAG(name="my-knowledge-base")
await store.create()
await store.add_directory("./knowledge")

llm = gemini.LLM(
    model="gemini-2.5-flash",
    tools=[gemini.tools.FileSearch(store)]
)
See the RAG guide for more details.

Function Calling

@agent.llm.register_function(description="Get weather for a location")
async def get_weather(location: str) -> dict:
    return {"temperature": "22°C", "condition": "Sunny"}
See the Function Calling guide for details.

Events

The Gemini plugin emits events for connection state and responses. Most developers should use the core events (LLMResponseCompletedEvent, etc.) for provider-agnostic code.
from vision_agents.plugins.gemini.events import (
    GeminiConnectedEvent,
    GeminiErrorEvent,
)

@agent.events.subscribe
async def on_gemini_connected(event: GeminiConnectedEvent):
    print(f"Connected to Gemini model: {event.model}")

@agent.events.subscribe
async def on_gemini_error(event: GeminiErrorEvent):
    print(f"Gemini error: {event.error}")
EventDescription
GeminiConnectedEventRealtime connection established
GeminiErrorEventError occurred
GeminiAudioEventAudio output received
GeminiTextEventText output received
GeminiResponseEventResponse chunk received

Next Steps