Skip to main content
Moondream is a powerful vision AI model that provides real-time vision capabilities including zero-shot object detection, visual question answering (VQA), and image captioning. The Moondream 3 model enables you to detect any object by simply describing it in natural language, answer questions about video frames, or generate descriptions automatically—all without requiring training or fine-tuning. The Moondream plugin in the Vision Agents SDK provides multiple components:
  • Detection processors: Cloud-hosted API and local on-device versions for object detection
  • Vision Language Models (VLMs): Cloud-hosted and local versions for visual question answering and image captioning

Installation

Install the Moondream plugin with
uv add vision-agents-plugins-moondream

Quick Start - Detection

Using CloudDetectionProcessor (Hosted)

The CloudDetectionProcessor uses Moondream’s hosted API. By default it has a 2 RPS (requests per second) rate limit and requires an API key. The rate limit can be adjusted by contacting the Moondream team.
from vision_agents.plugins import moondream
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

# Create a cloud processor with detection
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",  # or set MOONDREAM_API_KEY env var
    detect_objects="person",  # or ["person", "car", "dog"] for multiple
    fps=30
)

# Use in an agent
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant"),
    instructions="You are a helpful vision assistant.",
    llm=gemini.Realtime(fps=10),
    processors=[processor],
)
To initialize without passing in the API key, make sure the MOONDREAM_API_KEY is available as an environment variable. You can do this either by defining it in a .env file or exporting it directly in your terminal.

Using LocalDetectionProcessor (On-Device)

If you are running on your own infrastructure or using a service like Digital Ocean’s Gradient AI GPUs, you can use the LocalDetectionProcessor which downloads the model from HuggingFace and runs on device.
The moondream3-preview model is gated and requires HuggingFace authentication:
from vision_agents.plugins import moondream
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

# Create a local processor (no API key needed)
processor = moondream.LocalDetectionProcessor(
    detect_objects=["person", "car", "dog"],
    conf_threshold=0.3,
    device="cuda",  # Auto-detects CUDA, MPS, or CPU
    fps=30
)

# Use in an agent
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant"),
    instructions="You are a helpful vision assistant.",
    llm=gemini.Realtime(fps=10),
    processors=[processor],
)

Quick Start - Vision Language Models

The VLM supports two modes:
  • "vqa" (Visual Question Answering): Answers questions about video frames. Questions come from STT transcripts.
  • "caption" (Image Captioning): Generates descriptions of video frames automatically.

Using CloudVLM (Hosted)

The CloudVLM uses Moondream’s hosted API for visual question answering and captioning. It automatically processes video frames and responds to questions asked via STT (Speech-to-Text).
import asyncio
from vision_agents.core import User, Agent
from vision_agents.plugins import deepgram, getstream, elevenlabs, moondream

async def create_agent():
    # Create a cloud VLM for visual question answering
    llm = moondream.CloudVLM(
        api_key="your-api-key",  # or set MOONDREAM_API_KEY env var
        mode="vqa",  # or "caption" for image captioning
    )
    
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Vision Assistant", id="agent"),
        llm=llm,
        tts=elevenlabs.TTS(),
        stt=deepgram.STT(),
    )
    return agent
To initialize without passing in the API key, make sure the MOONDREAM_API_KEY is available as an environment variable.

Using LocalVLM (On-Device)

The LocalVLM downloads the model from HuggingFace and runs on device. It supports both VQA and captioning modes.
The moondream3-preview model is gated and requires HuggingFace authentication:
from vision_agents.plugins import moondream
from vision_agents.core import Agent

# Create a local VLM (no API key needed)
llm = moondream.LocalVLM(
    mode="vqa",  # or "caption" for image captioning
    force_cpu=False,  # Auto-detects CUDA, MPS, or CPU
)

# Use in an agent
agent = Agent(
    llm=llm,
    tts=your_tts,
    stt=your_stt,
    # ... other components
)

Cloud vs. Local

We recommend most users stick with the Cloud version since it takes care of the hosting, model updates and the various complexities that comes with those things. If you are feeling adventurous or like to try and host the model yourself, we recommend you do so on CUDA devices for the best experience. Cloud
  • Use when: You want a simple setup with no infrastructure management
  • Pros: No model download, no GPU required, automatic updates
  • Cons: Requires API key, 2 RPS rate limit by default (can be increased)
  • Best for: Development, testing, low-to-medium volume applications
Local (For Advanced Users)
  • Use when: You need higher throughput, have your own GPU infrastructure, or want to avoid rate limits
  • Pros: No rate limits, no API costs, full control over hardware
  • Cons: Requires GPU for best performance, model download on first use, infrastructure management
  • Best for: Production deployments, high-volume applications, custom infrastructure

Detect Multiple Objects

Both detection processors support zero-shot detection of multiple object types simultaneously:
# Detect multiple object types with zero-shot detection
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "car", "dog", "basketball", "laptop"],
    conf_threshold=0.3
)

Configuration

Detection Processor Parameters

CloudDetectionProcessor Parameters

NameTypeDefaultDescription
api_keystr or NoneNoneAPI key for Moondream Cloud API. If not provided, will attempt to read from MOONDREAM_API_KEY environment variable.
detect_objectsstr or List[str]"person"Object(s) to detect using zero-shot detection. Can be any object name like “person”, “car”, “basketball”.
conf_thresholdfloat0.3Confidence threshold for detections.
fpsint30Frame processing rate.
intervalint0Processing interval in seconds.
max_workersint10Thread pool size for CPU-intensive operations.
By default, the Moondream Cloud API has a 2 RPS (requests per second) rate limit. Contact the Moondream team to request a higher limit.

LocalDetectionProcessor Parameters

NameTypeDefaultDescription
detect_objectsstr or List[str]"person"Object(s) to detect using zero-shot detection. Can be any object name like “person”, “car”, “basketball”.
conf_thresholdfloat0.3Confidence threshold for detections.
fpsint30Frame processing rate.
intervalint0Processing interval in seconds.
max_workersint10Thread pool size for CPU-intensive operations.
devicestr or NoneNoneDevice to run inference on (‘cuda’, ‘mps’, or ‘cpu’). Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU.
model_namestr"moondream/moondream3-preview"Hugging Face model identifier.
optionsAgentOptions or NoneNoneModel directory configuration. If not provided, uses default which defaults to tempfile.gettempdir().
Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.

Vision Language Model Parameters

CloudVLM Parameters

NameTypeDefaultDescription
api_keystr or NoneNoneAPI key for Moondream Cloud API. If not provided, will attempt to read from MOONDREAM_API_KEY environment variable.
modeLiteral["vqa", "caption"]"vqa"”vqa” for visual question answering or “caption” for image captioning.
max_workersint10Thread pool size for CPU-intensive operations.
By default, the Moondream Cloud API has rate limits. Contact the Moondream team to request higher limits.

LocalVLM Parameters

NameTypeDefaultDescription
modeLiteral["vqa", "caption"]"vqa"”vqa” for visual question answering or “caption” for image captioning.
max_workersint10Thread pool size for async operations.
force_cpuboolFalseIf True, force CPU usage even if CUDA/MPS is available. Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU. Note: MPS is automatically converted to CPU due to model compatibility. We recommend running on CUDA for best performance.
model_namestr"moondream/moondream3-preview"Hugging Face model identifier.
optionsAgentOptions or NoneNoneModel directory configuration. If not provided, uses default_agent_options().
Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.

Video Publishing

Both processors publish annotated video frames with bounding boxes drawn on detected objects:
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "car"]
)

# The track will show:
# - Green bounding boxes around detected objects
# - Labels with confidence scores
# - Real-time annotation overlay
The annotated video is automatically sent to your realtime LLM, enabling it to understand what objects are present in the scene and their locations.