Vision Agents requires a Stream account for real-time transport. Some models on HuggingFace are gated and require a HuggingFace account and access token (
HF_TOKEN).Installation
Local LLM
Run text language models locally with streaming and function calling.Function Calling
Quantization
Reduce memory usage with 4-bit or 8-bit quantization. Requires the[transformers-quantized] extra.
LLM Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model | str | — | HuggingFace model ID |
device | str | "auto" | "auto", "cuda", "mps", or "cpu" |
quantization | str | "none" | "none", "4bit", or "8bit" |
torch_dtype | str | "auto" | "auto", "float16", "bfloat16", or "float32" |
trust_remote_code | bool | False | Allow custom model code (needed for Qwen, Phi, etc.) |
max_new_tokens | int | 512 | Maximum tokens to generate per response |
max_tool_rounds | int | 3 | Maximum tool-call rounds per response |
Local VLM
Run vision-language models that can see video frames from the call. Supports function calling.VLM Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model | str | — | HuggingFace model ID |
device | str | "auto" | "auto", "cuda", "mps", or "cpu" |
quantization | str | "none" | "none", "4bit", or "8bit" |
torch_dtype | str | "auto" | "auto", "float16", "bfloat16", or "float32" |
trust_remote_code | bool | True | Allow custom model code |
fps | int | 1 | Frames per second to capture from video |
frame_buffer_seconds | int | 10 | Seconds of video frames to buffer |
max_frames | int | 4 | Maximum frames sent per inference (evenly sampled) |
max_new_tokens | int | 512 | Maximum tokens to generate per response |
max_tool_rounds | int | 3 | Maximum tool-call rounds per response |
do_sample | bool | True | Use sampling for generation |
Object Detection
Run detection models like RT-DETRv2 on live video frames. EmitsDetectionCompletedEvent with bounding boxes for each processed frame.
Detection Parameters
| Name | Type | Default | Description |
|---|---|---|---|
model | str | "PekingU/rtdetr_v2_r101vd" | HuggingFace detection model ID |
conf_threshold | float | 0.5 | Confidence threshold (0—1) |
fps | int | 10 | Frame processing rate |
classes | list[str] | None | Filter to specific class names (e.g. ["person"]) |
device | str | "auto" | "auto", "cuda", "mps", or "cpu" |
annotate | bool | True | Draw bounding boxes on output video |
Next Steps
HuggingFace Inference
Cloud-based inference (no GPU needed)
Build a Voice Agent
Get started with voice
Build a Video Agent
Add video processing
Video Processors
Process video frames in real-time

