Deepgram is a powerful Speech-to-Text (STT) platform that provides fast, accurate, and customizable transcription services. It’s designed for real-time and batch audio processing, with support for features like word-level timestamps, speaker diarization, and multilingual transcription.
The Deepgram plugin in the Vision Agents SDK enables real-time transcription of voice input, making it ideal for voice agents, call analysis, meeting transcriptions, and more.
Installation
Install the Deepgram plugin with
uv add vision-agents[deepgram]
Example
Check out our Deepgram example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.
Initialisation
The Deepgram plugin is exposed via the STT class:
from vision_agents.plugins import deepgram
# Initialize with default settings
stt = deepgram.STT()
# Or with custom options
stt = deepgram.STT(
api_key="your-api-key",
sample_rate=48000,
language="en-US",
interim_results=True
)
To initialise without passing in the API key, make sure the DEEPGRAM_API_KEY is available as an environment variable.
You can do this either by defining it in a .env file or exporting it directly in your terminal.
Parameters
These are the parameters available in the DeepgramSTT plugin for you to customise:
| Name | Type | Default | Description |
api_key | str or None | None | Your Deepgram API key. If not provided, the plugin will use the DEEPGRAM_API_KEY environment variable. |
options | LiveOptions or None | None | Optional Deepgram configuration options, such as tier, model, or features like punctuation or diarization. |
sample_rate | int | 48000 | The sample rate (in Hz) of the audio stream being transcribed. |
language | str | "en-US" | Language code for transcription. |
keep_alive_interval | float | 3.0 | Interval (in seconds) for sending keep-alive messages to maintain the WebSocket connection. |
interim_results | bool | True | Whether to receive partial_transcript events when speaking. |
Functionality
Process Audio
The Deepgram STT service automatically starts when you initialize it and connects to Deepgram’s streaming API. When used with an Agent, audio is automatically processed through the STT service.
If using standalone, you can process audio directly:
from getstream.video.rtc.track_util import PcmData
# Process audio through Deepgram STT
await stt.process_audio(pcm_data, participant)
Events
Transcript Event
The transcript event is triggered when a final transcript is available from Deepgram:
from vision_agents.core.stt.events import STTTranscriptEvent
@stt.events.subscribe
async def on_transcript(event: STTTranscriptEvent):
print(f"Final transcript: {event.text}")
print(f"User: {event.participant.user_id}")
print(f"Confidence: {event.response.confidence}")
Partial Transcript Event
The partial transcript event is fired in real time as Deepgram generates intermediate (partial) transcriptions:
from vision_agents.core.stt.events import STTPartialTranscriptEvent
@stt.events.subscribe
async def on_partial_transcript(event: STTPartialTranscriptEvent):
print(f"Partial transcript: {event.text}")
print(f"User: {event.participant.user_id}")
Error Event
If an error occurs, an error event is fired:
from vision_agents.core.stt.events import STTErrorEvent
@stt.events.subscribe
async def on_stt_error(event: STTErrorEvent):
print(f"STT error: {event.error}")
print(f"Context: {event.context}")
Close
The Deepgram STT service automatically manages its WebSocket connection lifecycle. When used with an Agent, cleanup is handled automatically. For standalone usage, you can close the connection: