Skip to main content
ElevenLabs provides highly realistic and expressive voices for text-to-speech, plus real-time speech-to-text via Scribe. Supports multiple languages and voice styles.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

Installation

uv add vision-agents[elevenlabs]

Quick Start

from vision_agents.core import Agent, User
from vision_agents.plugins import elevenlabs, gemini, deepgram, getstream

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=gemini.LLM("gemini-2.5-flash"),
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
)
Set ELEVENLABS_API_KEY in your environment or pass api_key directly.

TTS

Expressive text-to-speech synthesis.
tts = elevenlabs.TTS(
    voice_id="VR6AewLTigWG4xSOukaG",
    model_id="eleven_multilingual_v2",
)
NameTypeDefaultDescription
voice_idstr"VR6AewLTigWG4xSOukaG"ElevenLabs voice ID
model_idstr"eleven_multilingual_v2"TTS model
api_keystrNoneAPI key (defaults to ELEVENLABS_API_KEY env var)

STT

Real-time transcription via Scribe v2 (~150ms latency, 99 languages).
stt = elevenlabs.STT(
    model_id="scribe_v2_realtime",
    language_code="en",
)
NameTypeDefaultDescription
model_idstr"scribe_v2_realtime"Scribe model
language_codestr"en"Language code
api_keystrNoneAPI key (defaults to ELEVENLABS_API_KEY env var)
Scribe v2 does not support turn detection. Use a separate turn detection plugin like Smart Turn if needed.

Next Steps