Skip to main content
Run open-weight models locally on your own hardware using HuggingFace Transformers. Supports text LLMs, vision-language models, and real-time object detection, all without API calls.
Vision Agents requires a Stream account for real-time transport. Some models on HuggingFace are gated and require a HuggingFace account and access token (HF_TOKEN).
For cloud-based inference via HuggingFace’s Inference Providers API (no GPU required), see HuggingFace Inference.

Installation

# Local inference (LLM, VLM, object detection)
uv add "vision-agents-plugins-huggingface[transformers]"

# With 4-bit / 8-bit quantization support (BitsAndBytes)
uv add "vision-agents-plugins-huggingface[transformers-quantized]"

Local LLM

Run text language models locally with streaming and function calling.
from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a helpful assistant.",
    llm=huggingface.TransformersLLM(
        model="google/gemma-4-E2B-it",
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

Function Calling

llm = huggingface.TransformersLLM(model="google/gemma-3-4b-it")

@llm.register_function(description="Get current weather for a city")
async def get_weather(city: str) -> str:
    return f"The weather in {city} is sunny."

Quantization

Reduce memory usage with 4-bit or 8-bit quantization. Requires the [transformers-quantized] extra.
llm = huggingface.TransformersLLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    quantization="4bit",
)

LLM Parameters

NameTypeDefaultDescription
modelstrHuggingFace model ID
devicestr"auto""auto", "cuda", "mps", or "cpu"
quantizationstr"none""none", "4bit", or "8bit"
torch_dtypestr"auto""auto", "float16", "bfloat16", or "float32"
trust_remote_codeboolFalseAllow custom model code (needed for Qwen, Phi, etc.)
max_new_tokensint512Maximum tokens to generate per response
max_tool_roundsint3Maximum tool-call rounds per response

Local VLM

Run vision-language models that can see video frames from the call. Supports function calling.
from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a visual assistant. Describe what you see.",
    llm=huggingface.TransformersVLM(
        model="Qwen/Qwen2-VL-2B-Instruct",
        fps=1,
        frame_buffer_seconds=10,
    ),
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
)

VLM Parameters

NameTypeDefaultDescription
modelstrHuggingFace model ID
devicestr"auto""auto", "cuda", "mps", or "cpu"
quantizationstr"none""none", "4bit", or "8bit"
torch_dtypestr"auto""auto", "float16", "bfloat16", or "float32"
trust_remote_codeboolTrueAllow custom model code
fpsint1Frames per second to capture from video
frame_buffer_secondsint10Seconds of video frames to buffer
max_framesint4Maximum frames sent per inference (evenly sampled)
max_new_tokensint512Maximum tokens to generate per response
max_tool_roundsint3Maximum tool-call rounds per response
do_sampleboolTrueUse sampling for generation

Object Detection

Run detection models like RT-DETRv2 on live video frames. Emits DetectionCompletedEvent with bounding boxes for each processed frame.
from vision_agents.core import Agent, User
from vision_agents.plugins import huggingface, getstream, deepgram

processor = huggingface.TransformersDetectionProcessor(
    model="PekingU/rtdetr_v2_r101vd",
    conf_threshold=0.5,
    fps=5,
)

agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Assistant", id="agent"),
    instructions="You are a visual assistant.",
    llm=...,
    stt=deepgram.STT(),
    tts=deepgram.TTS(),
    processors=[processor],
)

@agent.events.subscribe
async def on_detection(event: huggingface.DetectionCompletedEvent):
    for obj in event.objects:
        print(f"{obj['label']} ({obj['confidence']:.0%})")

Detection Parameters

NameTypeDefaultDescription
modelstr"PekingU/rtdetr_v2_r101vd"HuggingFace detection model ID
conf_thresholdfloat0.5Confidence threshold (0—1)
fpsint10Frame processing rate
classeslist[str]NoneFilter to specific class names (e.g. ["person"])
devicestr"auto""auto", "cuda", "mps", or "cpu"
annotateboolTrueDraw bounding boxes on output video

Next Steps

HuggingFace Inference

Cloud-based inference (no GPU needed)

Build a Voice Agent

Get started with voice

Build a Video Agent

Add video processing

Video Processors

Process video frames in real-time