Skip to main content
Deepgram is a powerful Speech-to-Text (STT) platform that provides fast, accurate, and customizable transcription services. It’s designed for real-time and batch audio processing, with support for features like word-level timestamps, speaker diarization, and multilingual transcription. The Deepgram plugin in the Vision Agents SDK enables real-time transcription of voice input, making it ideal for voice agents, call analysis, meeting transcriptions, and more.

Installation

Install the Deepgram plugin with
uv add vision-agents[deepgram]

Example

Check out our Deepgram example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.

Initialisation

The Deepgram plugin is exposed via the STT class:
from vision_agents.plugins import deepgram

# Initialize with default settings
stt = deepgram.STT()

# Or with custom options
stt = deepgram.STT(
    api_key="your-api-key",
    sample_rate=48000,
    language="en-US",
    interim_results=True
)
To initialise without passing in the API key, make sure the DEEPGRAM_API_KEY is available as an environment variable. You can do this either by defining it in a .env file or exporting it directly in your terminal.

Parameters

These are the parameters available in the DeepgramSTT plugin for you to customise:
NameTypeDefaultDescription
api_keystr or NoneNoneYour Deepgram API key. If not provided, the plugin will use the DEEPGRAM_API_KEY environment variable.
optionsLiveOptions or NoneNoneOptional Deepgram configuration options, such as tier, model, or features like punctuation or diarization.
sample_rateint48000The sample rate (in Hz) of the audio stream being transcribed.
languagestr"en-US"Language code for transcription.
keep_alive_intervalfloat3.0Interval (in seconds) for sending keep-alive messages to maintain the WebSocket connection.
interim_resultsboolTrueWhether to receive partial_transcript events when speaking.

Functionality

Process Audio

The Deepgram STT service automatically starts when you initialize it and connects to Deepgram’s streaming API. When used with an Agent, audio is automatically processed through the STT service. If using standalone, you can process audio directly:
from getstream.video.rtc.track_util import PcmData

# Process audio through Deepgram STT
await stt.process_audio(pcm_data, participant)

Events

Transcript Event

The transcript event is triggered when a final transcript is available from Deepgram:
from vision_agents.core.stt.events import STTTranscriptEvent

@stt.events.subscribe
async def on_transcript(event: STTTranscriptEvent):
    print(f"Final transcript: {event.text}")
    print(f"User: {event.participant.user_id}")
    print(f"Confidence: {event.response.confidence}")

Partial Transcript Event

The partial transcript event is fired in real time as Deepgram generates intermediate (partial) transcriptions:
from vision_agents.core.stt.events import STTPartialTranscriptEvent

@stt.events.subscribe
async def on_partial_transcript(event: STTPartialTranscriptEvent):
    print(f"Partial transcript: {event.text}")
    print(f"User: {event.participant.user_id}")

Error Event

If an error occurs, an error event is fired:
from vision_agents.core.stt.events import STTErrorEvent

@stt.events.subscribe
async def on_stt_error(event: STTErrorEvent):
    print(f"STT error: {event.error}")
    print(f"Context: {event.context}")

Close

The Deepgram STT service automatically manages its WebSocket connection lifecycle. When used with an Agent, cleanup is handled automatically. For standalone usage, you can close the connection:
await stt.close()