Smart Turn

Smart Turn is an advanced turn detection system that combines Silero VAD, Whisper feature extraction, and neural turn completion models to intelligently detect when a speaker has completed their turn in a conversation. It analyzes audio in real time to determine whether speech is incomplete or complete, enabling natural conversation flow in voice agents. With the Vision Agents SDK you can use Smart Turn to manage conversational turns in your video calls with just a few lines of code.

Installation

Install the Smart Turn plugin with

uv add vision-agents[smart_turn]

Example

Check out our simple agent example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.

from vision_agents.core import Agent
from vision_agents.plugins import smart_turn

# Create turn detection with custom settings
turn_detection = smart_turn.TurnDetection(
    buffer_in_seconds=2.0,
    confidence_threshold=0.5
)

# Use with an agent
agent = Agent(
    turn_detection=turn_detection,
    # ... other agent configuration
)

# Or use standalone
await turn_detection.start()

# Listen for turn events
@turn_detection.events.subscribe
async def on_turn_started(event: smart_turn.TurnStartedEvent):
    print(f"User {event.participant.user_id} started speaking")

@turn_detection.events.subscribe
async def on_turn_ended(event: smart_turn.TurnEndedEvent):
    print(f"User {event.participant.user_id} finished speaking")
    print(f"Confidence: {event.confidence}")

# Stop when finished
await turn_detection.stop()

Initialisation

The Smart Turn plugin is exposed via the TurnDetection class:

from vision_agents.plugins import smart_turn

# Default settings
turn_detection = smart_turn.TurnDetection()

# Custom settings for more aggressive turn detection
turn_detection = smart_turn.TurnDetection(
    buffer_in_seconds=1.5,
    confidence_threshold=0.7
)

# Start detection (downloads models if needed)
await turn_detection.start()

Parameters

You can customise the behaviour of Smart Turn through the following parameters:

Name	Type	Default	Description
`buffer_in_seconds`	`float`	`2.0`	Duration in seconds to buffer audio before processing.
`confidence_threshold`	`float`	`0.5`	Probability threshold (0.0–1.0) for determining turn completion.
`sample_rate`	`int`	`16000`	Audio sample rate in Hz for processing (audio is resampled automatically).

Functionality

Start and Stop

Control turn detection with the start() and stop() methods:

# Start turn detection (downloads models if needed)
await turn_detection.start()

# Check if detection is active
if turn_detection.is_active:
    print("Turn detection is active")

# Stop turn detection
await turn_detection.stop()

Events

The plugin emits turn detection events through the Vision Agents event system:

Turn Started Event

Fired when a user begins speaking:

from vision_agents.core.turn_detection.events import TurnStartedEvent

@turn_detection.events.subscribe
async def on_turn_started(event: TurnStartedEvent):
    print(f"Turn started by {event.participant.user_id}")
    print(f"Confidence: {event.confidence}")

Turn Ended Event

Fired when a user completes their turn (based on the model’s prediction and confidence threshold):

from vision_agents.core.turn_detection.events import TurnEndedEvent

@turn_detection.events.subscribe
async def on_turn_ended(event: TurnEndedEvent):
    print(f"Turn ended by {event.participant.user_id}")
    print(f"Confidence: {event.confidence}")
    print(f"Duration: {event.duration_ms}ms")
    print(f"Trailing silence: {event.trailing_silence_ms}ms")

Event Properties

Both TurnStartedEvent and TurnEndedEvent include the following properties:

Property	Type	Description
`participant`	`Participant`	Participant object with user_id and metadata.
`confidence`	`float\|None`	Confidence level of the turn detection (0.0–1.0).
`trailing_silence_ms`	`float\|None`	Milliseconds of silence after speech (TurnEnded).
`duration_ms`	`float\|None`	Duration of the turn in milliseconds (TurnEnded).
`custom`	`dict\|None`	Additional model-specific data.

How It Works

Smart Turn uses a multi-stage pipeline to detect turn completion:

Silero VAD: Detects speech activity and segments audio
Audio Buffering: Buffers audio based on buffer_in_seconds
Whisper Feature Extraction: Extracts acoustic features from the audio
Neural Turn Completion: Predicts turn completion probability using an ONNX model
Event Emission: Emits TurnStartedEvent when speech begins and TurnEndedEvent when turn completion probability exceeds confidence_threshold

The system automatically downloads required models (Silero VAD and Smart Turn ONNX) on first use.

Overview

AI Providers

Custom Integrations

Installation

Example

Initialisation

Parameters

Functionality

Start and Stop

Events

Turn Started Event

Turn Ended Event

Event Properties

How It Works

Overview

AI Providers

Custom Integrations

​Installation

​Example

​Initialisation

​Parameters

​Functionality

​Start and Stop

​Events

​Turn Started Event

​Turn Ended Event

​Event Properties

​How It Works

Installation

Example

Initialisation

Parameters

Functionality

Start and Stop

Events

Turn Started Event

Turn Ended Event

Event Properties

How It Works