xAI provides realtime speech-to-speech over WebSocket with server-side voice activity detection, built-in web search, and X search. No separate STT/TTS needed.Documentation Index
Fetch the complete documentation index at: https://visionagents.ai/llms.txt
Use this file to discover all available pages before exploring further.
Vision Agents requires a Stream account
for real-time transport. Most providers offer free tiers to get started.
Installation
Quick start
Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model | str | "grok-voice-think-fast-1.0" | Grok realtime model |
voice | str | "ara" | Voice ("ara", "rex", "sal", "eve", "leo") |
api_key | str | None | API key (defaults to XAI_API_KEY env var) |
turn_detection | str or None | "server_vad" | Turn detection mode ("server_vad" or None for manual) |
vad_interrupt_response | bool | False | Allow VAD to auto-cancel the assistant response on detected speech |
web_search | bool | True | Enable web search tool |
x_search | bool | True | Enable X (Twitter) search tool |
x_search_allowed_handles | list[str] | None | Restrict X search to specific handles |
vad_interrupt_response defaults to False because speaker-to-mic echo can cause the server to cancel the agent’s own response mid-sentence. Set to True only if your audio setup avoids echo feedback.Function calling
Next steps
xAI LLM
Advanced reasoning with Grok
xAI TTS
Text-to-speech with expressive voices

