Fish Audio

Fish Audio is a high-quality AI voice platform that provides both Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities. It offers fast, accurate transcription with automatic language detection and natural-sounding voice synthesis with support for voice cloning. The Fish Audio plugin for Vision Agents enables real-time transcription and speech synthesis, making it ideal for voice agents, multilingual applications, and conversational AI systems.

Installation

Install the Fish Audio plugin with

uv add vision-agents[fish]

Example

Check out our Fish Audio example to see a practical implementation of the plugin, or read on for some key details.

Text-to-Speech (TTS)

Initialisation

The Fish Audio TTS plugin is exposed via the TTS class:

from vision_agents.plugins import fish

# Initialize with default settings
tts = fish.TTS()

# Or with custom options
tts = fish.TTS(
    api_key="your-api-key",
    reference_id="your_reference_voice_id"
)

To initialise without passing in the API key, make sure the FISH_API_KEY environment variable is set. You can do this either by defining it in a .env file or exporting it directly in your terminal.

Parameters

These are the parameters available in the Fish TTS plugin:

Name	Type	Default	Description
`api_key`	`str` or `None`	`None`	Your Fish Audio API key. If not provided, uses the `FISH_API_KEY` environment variable.
`reference_id`	`str` or `None`	`None`	Optional reference voice ID for voice cloning. Uses a default voice if not specified.
`base_url`	`str` or `None`	`None`	Optional custom API endpoint.
`client`	`Session` or `None`	`None`	Optionally pass your own Fish Audio Session instance.

Functionality

Send text to convert to speech

The send() method sends text to Fish Audio for synthesis. The resulting audio is played through the configured output track:

await tts.send("Hello, this is a test of Fish Audio text-to-speech.")

Voice Cloning

Fish Audio supports voice cloning using reference audio:

# Using a reference voice ID
tts = fish.TTS(reference_id="your_reference_voice_id")

# The reference voice will be used for all subsequent synthesis
await tts.send("This will use the reference voice.")

Speech-to-Text (STT)

Initialisation

The Fish Audio STT plugin is exposed via the STT class:

from vision_agents.plugins import fish

# Initialize with default settings
stt = fish.STT()

# Or with custom options
stt = fish.STT(
    api_key="your-api-key",
    language="en"
)

To initialise without passing in the API key, make sure the FISH_API_KEY environment variable is set.

Parameters

These are the parameters available in the Fish STT plugin:

Name	Type	Default	Description
`api_key`	`str` or `None`	`None`	Your Fish Audio API key. If not provided, uses the `FISH_API_KEY` environment variable.
`language`	`str` or `None`	`None`	Language code for transcription (e.g., “en”, “zh”). If None, automatic language detection is used.
`client`	`Session` or `None`	`None`	Optionally pass your own Fish Audio Session instance.

Functionality

Process Audio

Once you join the call, you can listen for audio events and pass them to the STT class for processing:

from getstream.video import rtc

async with rtc.join(call, bot_user_id) as connection:

    @connection.on("audio")
    async def on_audio(pcm: PcmData, user):
        # Process audio through Fish Audio STT
        await stt.process_audio(pcm, user)

Events

Transcript Event

The transcript event is triggered when a final transcript is available from Fish Audio:

from vision_agents.core.stt.events import STTTranscriptEvent

@stt.events.subscribe
async def on_transcript(event: STTTranscriptEvent):
    print(f"Final transcript: {event.text}")
    print(f"User: {event.participant.user_id}")
    print(f"Language: {event.response.language}")

Error Event

If an error occurs during transcription, an error event is fired:

from vision_agents.core.stt.events import STTErrorEvent

@stt.events.subscribe
async def on_stt_error(event: STTErrorEvent):
    print(f"STT error: {event.error}")

Supported Languages

Fish Audio STT supports multiple languages with automatic detection:

en - English
zh - Chinese
es - Spanish
fr - French
de - German
ja - Japanese
ko - Korean
pt - Portuguese

For automatic language detection, set language=None (default).

Audio Format Requirements

The STT implementation accepts PCM audio data with the following specifications:

Sample rate: 16kHz or higher recommended
Format: Mono, 16-bit PCM

Getting Your API Key

Sign up for a Fish Audio account at https://fish.audio
Navigate to the API Keys section in your dashboard
Create a new API key
Set the FISH_API_KEY environment variable or pass it directly to the plugin

Overview

AI Providers

Custom Integrations

Installation

Example

Text-to-Speech (TTS)

Initialisation

Parameters

Functionality

Send text to convert to speech

Voice Cloning

Speech-to-Text (STT)

Initialisation

Parameters

Functionality

Process Audio

Events

Transcript Event

Error Event

Supported Languages

Audio Format Requirements

Getting Your API Key

Overview

AI Providers

Custom Integrations

​Installation

​Example

​Text-to-Speech (TTS)

​Initialisation

​Parameters

​Functionality

​Send text to convert to speech

​Voice Cloning

​Speech-to-Text (STT)

​Initialisation

​Parameters

​Functionality

​Process Audio

​Events

Transcript Event

Error Event

​Supported Languages

​Audio Format Requirements

​Getting Your API Key

Installation

Example

Text-to-Speech (TTS)

Initialisation

Parameters

Functionality

Send text to convert to speech

Voice Cloning

Speech-to-Text (STT)

Initialisation

Parameters

Functionality

Process Audio

Events

Supported Languages

Audio Format Requirements

Getting Your API Key