Fish Audio is a high-quality AI voice platform that provides both Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities. It offers fast, accurate transcription with automatic language detection and natural-sounding voice synthesis with support for voice cloning.
The Fish Audio plugin for Vision Agents enables real-time transcription and speech synthesis, making it ideal for voice agents, multilingual applications, and conversational AI systems.
Installation
Install the Fish Audio plugin with
uv add vision-agents[fish]
Example
Check out our Fish Audio example to see a practical implementation of the plugin, or read on for some key details.
Text-to-Speech (TTS)
Initialisation
The Fish Audio TTS plugin is exposed via the TTS class:
from vision_agents.plugins import fish
# Initialize with default settings
tts = fish.TTS()
# Or with custom options
tts = fish.TTS(
api_key="your-api-key",
reference_id="your_reference_voice_id"
)
To initialise without passing in the API key, make sure the FISH_API_KEY environment variable is set.
You can do this either by defining it in a .env file or exporting it directly in your terminal.
Parameters
These are the parameters available in the Fish TTS plugin:
| Name | Type | Default | Description |
api_key | str or None | None | Your Fish Audio API key. If not provided, uses the FISH_API_KEY environment variable. |
reference_id | str or None | None | Optional reference voice ID for voice cloning. Uses a default voice if not specified. |
base_url | str or None | None | Optional custom API endpoint. |
client | Session or None | None | Optionally pass your own Fish Audio Session instance. |
Functionality
Send text to convert to speech
The send() method sends text to Fish Audio for synthesis. The resulting audio is played through the configured output track:
await tts.send("Hello, this is a test of Fish Audio text-to-speech.")
Voice Cloning
Fish Audio supports voice cloning using reference audio:
# Using a reference voice ID
tts = fish.TTS(reference_id="your_reference_voice_id")
# The reference voice will be used for all subsequent synthesis
await tts.send("This will use the reference voice.")
Speech-to-Text (STT)
Initialisation
The Fish Audio STT plugin is exposed via the STT class:
from vision_agents.plugins import fish
# Initialize with default settings
stt = fish.STT()
# Or with custom options
stt = fish.STT(
api_key="your-api-key",
language="en"
)
To initialise without passing in the API key, make sure the FISH_API_KEY environment variable is set.
Parameters
These are the parameters available in the Fish STT plugin:
| Name | Type | Default | Description |
api_key | str or None | None | Your Fish Audio API key. If not provided, uses the FISH_API_KEY environment variable. |
language | str or None | None | Language code for transcription (e.g., “en”, “zh”). If None, automatic language detection is used. |
client | Session or None | None | Optionally pass your own Fish Audio Session instance. |
Functionality
Process Audio
Once you join the call, you can listen for audio events and pass them to the STT class for processing:
from getstream.video import rtc
async with rtc.join(call, bot_user_id) as connection:
@connection.on("audio")
async def on_audio(pcm: PcmData, user):
# Process audio through Fish Audio STT
await stt.process_audio(pcm, user)
Events
Transcript Event
The transcript event is triggered when a final transcript is available from Fish Audio:
from vision_agents.core.stt.events import STTTranscriptEvent
@stt.events.subscribe
async def on_transcript(event: STTTranscriptEvent):
print(f"Final transcript: {event.text}")
print(f"User: {event.participant.user_id}")
print(f"Language: {event.response.language}")
Error Event
If an error occurs during transcription, an error event is fired:
from vision_agents.core.stt.events import STTErrorEvent
@stt.events.subscribe
async def on_stt_error(event: STTErrorEvent):
print(f"STT error: {event.error}")
Supported Languages
Fish Audio STT supports multiple languages with automatic detection:
en - English
zh - Chinese
es - Spanish
fr - French
de - German
ja - Japanese
ko - Korean
pt - Portuguese
For automatic language detection, set language=None (default).
The STT implementation accepts PCM audio data with the following specifications:
- Sample rate: 16kHz or higher recommended
- Format: Mono, 16-bit PCM
Getting Your API Key
- Sign up for a Fish Audio account at https://fish.audio
- Navigate to the API Keys section in your dashboard
- Create a new API key
- Set the
FISH_API_KEY environment variable or pass it directly to the plugin