Skip to main content
Smart Turn is an advanced turn detection system that combines Silero VAD, Whisper feature extraction, and neural turn completion models to intelligently detect when a speaker has completed their turn in a conversation. It analyzes audio in real time to determine whether speech is incomplete or complete, enabling natural conversation flow in voice agents. With the Vision Agents SDK you can use Smart Turn to manage conversational turns in your video calls with just a few lines of code.

Installation

Install the Smart Turn plugin with
uv add vision-agents[smart_turn]

Example

Check out our simple agent example to see a practical implementation of the plugin and get inspiration for your own projects, or read on for some key details.
from vision_agents.core import Agent
from vision_agents.plugins import smart_turn

# Create turn detection with custom settings
turn_detection = smart_turn.TurnDetection(
    buffer_in_seconds=2.0,
    confidence_threshold=0.5
)

# Use with an agent
agent = Agent(
    turn_detection=turn_detection,
    # ... other agent configuration
)

# Or use standalone
await turn_detection.start()

# Listen for turn events
@turn_detection.events.subscribe
async def on_turn_started(event: smart_turn.TurnStartedEvent):
    print(f"User {event.participant.user_id} started speaking")

@turn_detection.events.subscribe
async def on_turn_ended(event: smart_turn.TurnEndedEvent):
    print(f"User {event.participant.user_id} finished speaking")
    print(f"Confidence: {event.confidence}")

# Stop when finished
await turn_detection.stop()

Initialisation

The Smart Turn plugin is exposed via the TurnDetection class:
from vision_agents.plugins import smart_turn

# Default settings
turn_detection = smart_turn.TurnDetection()

# Custom settings for more aggressive turn detection
turn_detection = smart_turn.TurnDetection(
    buffer_in_seconds=1.5,
    confidence_threshold=0.7
)

# Start detection (downloads models if needed)
await turn_detection.start()

Parameters

You can customise the behaviour of Smart Turn through the following parameters:
NameTypeDefaultDescription
buffer_in_secondsfloat2.0Duration in seconds to buffer audio before processing.
confidence_thresholdfloat0.5Probability threshold (0.0–1.0) for determining turn completion.
sample_rateint16000Audio sample rate in Hz for processing (audio is resampled automatically).

Functionality

Start and Stop

Control turn detection with the start() and stop() methods:
# Start turn detection (downloads models if needed)
await turn_detection.start()

# Check if detection is active
if turn_detection.is_active:
    print("Turn detection is active")

# Stop turn detection
await turn_detection.stop()

Events

The plugin emits turn detection events through the Vision Agents event system:

Turn Started Event

Fired when a user begins speaking:
from vision_agents.core.turn_detection.events import TurnStartedEvent

@turn_detection.events.subscribe
async def on_turn_started(event: TurnStartedEvent):
    print(f"Turn started by {event.participant.user_id}")
    print(f"Confidence: {event.confidence}")

Turn Ended Event

Fired when a user completes their turn (based on the model’s prediction and confidence threshold):
from vision_agents.core.turn_detection.events import TurnEndedEvent

@turn_detection.events.subscribe
async def on_turn_ended(event: TurnEndedEvent):
    print(f"Turn ended by {event.participant.user_id}")
    print(f"Confidence: {event.confidence}")
    print(f"Duration: {event.duration_ms}ms")
    print(f"Trailing silence: {event.trailing_silence_ms}ms")

Event Properties

Both TurnStartedEvent and TurnEndedEvent include the following properties:
PropertyTypeDescription
participantParticipantParticipant object with user_id and metadata.
confidencefloat|NoneConfidence level of the turn detection (0.0–1.0).
trailing_silence_msfloat|NoneMilliseconds of silence after speech (TurnEnded).
duration_msfloat|NoneDuration of the turn in milliseconds (TurnEnded).
customdict|NoneAdditional model-specific data.

How It Works

Smart Turn uses a multi-stage pipeline to detect turn completion:
  1. Silero VAD: Detects speech activity and segments audio
  2. Audio Buffering: Buffers audio based on buffer_in_seconds
  3. Whisper Feature Extraction: Extracts acoustic features from the audio
  4. Neural Turn Completion: Predicts turn completion probability using an ONNX model
  5. Event Emission: Emits TurnStartedEvent when speech begins and TurnEndedEvent when turn completion probability exceeds confidence_threshold
The system automatically downloads required models (Silero VAD and Smart Turn ONNX) on first use.