HuggingFace Transformers

Run open-weight models locally on your own hardware using HuggingFace Transformers. Supports text LLMs, vision-language models, and real-time object detection, all without API calls.

Vision Agents requires a Stream account for real-time transport. Some models on HuggingFace are gated and require a HuggingFace account and access token (HF_TOKEN).

For cloud-based inference via HuggingFace’s Inference Providers API (no GPU required), see HuggingFace Inference.

Installation

# Local inference (LLM, VLM, object detection)
uv add "vision-agents-plugins-huggingface[transformers]"

# With 4-bit / 8-bit quantization support (BitsAndBytes)
uv add "vision-agents-plugins-huggingface[transformers-quantized]"

Local LLM

Run text language models locally with streaming and function calling.

from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=huggingface.TransformersLLM(
        model="google/gemma-4-E2B-it",
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

Function Calling

llm = huggingface.TransformersLLM(model="google/gemma-3-4b-it")

@llm.register_function(description="Get current weather for a city")
async def get_weather(city: str) -> str:
    return f"The weather in {city} is sunny."

Quantization

Reduce memory usage with 4-bit or 8-bit quantization. Requires the [transformers-quantized] extra.

llm = huggingface.TransformersLLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    quantization="4bit",
)

LLM Parameters

Name	Type	Default	Description
`model`	`str`	—	HuggingFace model ID
`device`	`str`	`"auto"`	`"auto"`, `"cuda"`, `"mps"`, or `"cpu"`
`quantization`	`str`	`"none"`	`"none"`, `"4bit"`, or `"8bit"`
`torch_dtype`	`str`	`"auto"`	`"auto"`, `"float16"`, `"bfloat16"`, or `"float32"`
`trust_remote_code`	`bool`	`False`	Allow custom model code (needed for Qwen, Phi, etc.)
`max_new_tokens`	`int`	`512`	Maximum tokens to generate per response
`max_tool_rounds`	`int`	`3`	Maximum tool-call rounds per response

Local VLM

Run vision-language models that can see video frames from the call. Supports function calling.

from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a visual assistant. Describe what you see.",
    llm=huggingface.TransformersVLM(
        model="Qwen/Qwen2-VL-2B-Instruct",
        fps=1,
        frame_buffer_seconds=10,
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

VLM Parameters

Name	Type	Default	Description
`model`	`str`	—	HuggingFace model ID
`device`	`str`	`"auto"`	`"auto"`, `"cuda"`, `"mps"`, or `"cpu"`
`quantization`	`str`	`"none"`	`"none"`, `"4bit"`, or `"8bit"`
`torch_dtype`	`str`	`"auto"`	`"auto"`, `"float16"`, `"bfloat16"`, or `"float32"`
`trust_remote_code`	`bool`	`True`	Allow custom model code
`fps`	`int`	`1`	Frames per second to capture from video
`frame_buffer_seconds`	`int`	`10`	Seconds of video frames to buffer
`max_frames`	`int`	`4`	Maximum frames sent per inference (evenly sampled)
`max_new_tokens`	`int`	`512`	Maximum tokens to generate per response
`max_tool_rounds`	`int`	`3`	Maximum tool-call rounds per response
`do_sample`	`bool`	`True`	Use sampling for generation

Object Detection

Run detection models like RT-DETRv2 on live video frames. Emits DetectionCompletedEvent with bounding boxes for each processed frame.

from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

processor = huggingface.TransformersDetectionProcessor(
    model="PekingU/rtdetr_v2_r101vd",
    conf_threshold=0.5,
    fps=5,
)

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a visual assistant.",
    llm=...,
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
    processors=[processor],
)

@agent.events.subscribe
async def on_detection(event: huggingface.DetectionCompletedEvent):
    for obj in event.objects:
        print(f"{obj['label']} ({obj['confidence']:.0%})")

Detection Parameters

Name	Type	Default	Description
`model`	`str`	`"PekingU/rtdetr_v2_r101vd"`	HuggingFace detection model ID
`conf_threshold`	`float`	`0.5`	Confidence threshold (0—1)
`fps`	`int`	`10`	Frame processing rate
`classes`	`list[str]`	`None`	Filter to specific class names (e.g. `["person"]`)
`device`	`str`	`"auto"`	`"auto"`, `"cuda"`, `"mps"`, or `"cpu"`
`annotate`	`bool`	`True`	Draw bounding boxes on output video

Next Steps

HuggingFace Inference

Cloud-based inference (no GPU needed)

Build a Voice Agent

Get started with voice

Build a Video Agent

Add video processing

Video Processors

Process video frames in real-time

Documentation Index

​Installation

​Local LLM

​Function Calling

​Quantization

​LLM Parameters

​Local VLM

​VLM Parameters

​Object Detection

​Detection Parameters

​Next Steps

HuggingFace Inference

Build a Voice Agent

Build a Video Agent

Video Processors

Installation

Local LLM

Function Calling

Quantization

LLM Parameters

Local VLM

VLM Parameters

Object Detection

Detection Parameters

Next Steps