Moondream

Moondream is a powerful vision AI model that provides real-time vision capabilities including zero-shot object detection, visual question answering (VQA), and image captioning. The Moondream 3 model enables you to detect any object by simply describing it in natural language, answer questions about video frames, or generate descriptions automatically—all without requiring training or fine-tuning. The Moondream plugin in the Vision Agents SDK provides multiple components:

Detection processors: Cloud-hosted API and local on-device versions for object detection
Vision Language Models (VLMs): Cloud-hosted and local versions for visual question answering and image captioning

Installation

Install the Moondream plugin with

uv add vision-agents-plugins-moondream

Quick Start - Detection

Using CloudDetectionProcessor (Hosted)

The CloudDetectionProcessor uses Moondream’s hosted API. By default it has a 2 RPS (requests per second) rate limit and requires an API key. The rate limit can be adjusted by contacting the Moondream team.

from vision_agents.plugins import moondream
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

# Create a cloud processor with detection
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",  # or set MOONDREAM_API_KEY env var
    detect_objects="person",  # or ["person", "car", "dog"] for multiple
    fps=30
)

# Use in an agent
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant"),
    instructions="You are a helpful vision assistant.",
    llm=gemini.Realtime(fps=10),
    processors=[processor],
)

To initialize without passing in the API key, make sure the MOONDREAM_API_KEY is available as an environment variable. You can do this either by defining it in a .env file or exporting it directly in your terminal.

Using LocalDetectionProcessor (On-Device)

If you are running on your own infrastructure or using a service like Digital Ocean’s Gradient AI GPUs, you can use the LocalDetectionProcessor which downloads the model from HuggingFace and runs on device.

The moondream3-preview model is gated and requires HuggingFace authentication:

Request access at https://huggingface.co/moondream/moondream3-preview
Set HF_TOKEN environment variable: export HF_TOKEN=your_token_here
Or run: huggingface-cli login

from vision_agents.plugins import moondream
from vision_agents.core import Agent, User
from vision_agents.plugins import gemini, getstream

# Create a local processor (no API key needed)
processor = moondream.LocalDetectionProcessor(
    detect_objects=["person", "car", "dog"],
    conf_threshold=0.3,
    device="cuda",  # Auto-detects CUDA, MPS, or CPU
    fps=30
)

# Use in an agent
agent = Agent(
    edge=getstream.Edge(),
    agent_user=User(name="Vision Assistant"),
    instructions="You are a helpful vision assistant.",
    llm=gemini.Realtime(fps=10),
    processors=[processor],
)

Quick Start - Vision Language Models

The VLM supports two modes:

"vqa" (Visual Question Answering): Answers questions about video frames. Questions come from STT transcripts.
"caption" (Image Captioning): Generates descriptions of video frames automatically.

Using CloudVLM (Hosted)

The CloudVLM uses Moondream’s hosted API for visual question answering and captioning. It automatically processes video frames and responds to questions asked via STT (Speech-to-Text).

import asyncio
from vision_agents.core import User, Agent
from vision_agents.plugins import deepgram, getstream, elevenlabs, moondream

async def create_agent():
    # Create a cloud VLM for visual question answering
    llm = moondream.CloudVLM(
        api_key="your-api-key",  # or set MOONDREAM_API_KEY env var
        mode="vqa",  # or "caption" for image captioning
    )
    
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Vision Assistant", id="agent"),
        llm=llm,
        tts=elevenlabs.TTS(),
        stt=deepgram.STT(),
    )
    return agent

To initialize without passing in the API key, make sure the MOONDREAM_API_KEY is available as an environment variable.

Using LocalVLM (On-Device)

The LocalVLM downloads the model from HuggingFace and runs on device. It supports both VQA and captioning modes.

The moondream3-preview model is gated and requires HuggingFace authentication:

Request access at https://huggingface.co/moondream/moondream3-preview
Set HF_TOKEN environment variable: export HF_TOKEN=your_token_here
Or run: huggingface-cli login

from vision_agents.plugins import moondream
from vision_agents.core import Agent

# Create a local VLM (no API key needed)
llm = moondream.LocalVLM(
    mode="vqa",  # or "caption" for image captioning
    force_cpu=False,  # Auto-detects CUDA, MPS, or CPU
)

# Use in an agent
agent = Agent(
    llm=llm,
    tts=your_tts,
    stt=your_stt,
    # ... other components
)

Cloud vs. Local

We recommend most users stick with the Cloud version since it takes care of the hosting, model updates and the various complexities that comes with those things. If you are feeling adventurous or like to try and host the model yourself, we recommend you do so on CUDA devices for the best experience. Cloud

Use when: You want a simple setup with no infrastructure management
Pros: No model download, no GPU required, automatic updates
Cons: Requires API key, 2 RPS rate limit by default (can be increased)
Best for: Development, testing, low-to-medium volume applications

Local (For Advanced Users)

Use when: You need higher throughput, have your own GPU infrastructure, or want to avoid rate limits
Pros: No rate limits, no API costs, full control over hardware
Cons: Requires GPU for best performance, model download on first use, infrastructure management
Best for: Production deployments, high-volume applications, custom infrastructure

Detect Multiple Objects

Both detection processors support zero-shot detection of multiple object types simultaneously:

# Detect multiple object types with zero-shot detection
processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "car", "dog", "basketball", "laptop"],
    conf_threshold=0.3
)

Configuration

Detection Processor Parameters

CloudDetectionProcessor Parameters

Name	Type	Default	Description
`api_key`	`str` or `None`	`None`	API key for Moondream Cloud API. If not provided, will attempt to read from `MOONDREAM_API_KEY` environment variable.
`detect_objects`	`str` or `List[str]`	`"person"`	Object(s) to detect using zero-shot detection. Can be any object name like “person”, “car”, “basketball”.
`conf_threshold`	`float`	`0.3`	Confidence threshold for detections.
`fps`	`int`	`30`	Frame processing rate.
`interval`	`int`	`0`	Processing interval in seconds.
`max_workers`	`int`	`10`	Thread pool size for CPU-intensive operations.

By default, the Moondream Cloud API has a 2 RPS (requests per second) rate limit. Contact the Moondream team to request a higher limit.

LocalDetectionProcessor Parameters

Name	Type	Default	Description
`detect_objects`	`str` or `List[str]`	`"person"`	Object(s) to detect using zero-shot detection. Can be any object name like “person”, “car”, “basketball”.
`conf_threshold`	`float`	`0.3`	Confidence threshold for detections.
`fps`	`int`	`30`	Frame processing rate.
`interval`	`int`	`0`	Processing interval in seconds.
`max_workers`	`int`	`10`	Thread pool size for CPU-intensive operations.
`device`	`str` or `None`	`None`	Device to run inference on (‘cuda’, ‘mps’, or ‘cpu’). Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU.
`model_name`	`str`	`"moondream/moondream3-preview"`	Hugging Face model identifier.
`options`	`AgentOptions` or `None`	`None`	Model directory configuration. If not provided, uses default which defaults to tempfile.gettempdir().

Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.

Vision Language Model Parameters

CloudVLM Parameters

Name	Type	Default	Description
`api_key`	`str` or `None`	`None`	API key for Moondream Cloud API. If not provided, will attempt to read from `MOONDREAM_API_KEY` environment variable.
`mode`	`Literal["vqa", "caption"]`	`"vqa"`	”vqa” for visual question answering or “caption” for image captioning.
`max_workers`	`int`	`10`	Thread pool size for CPU-intensive operations.

By default, the Moondream Cloud API has rate limits. Contact the Moondream team to request higher limits.

LocalVLM Parameters

Name	Type	Default	Description
`mode`	`Literal["vqa", "caption"]`	`"vqa"`	”vqa” for visual question answering or “caption” for image captioning.
`max_workers`	`int`	`10`	Thread pool size for async operations.
`force_cpu`	`bool`	`False`	If True, force CPU usage even if CUDA/MPS is available. Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU. Note: MPS is automatically converted to CPU due to model compatibility. We recommend running on CUDA for best performance.
`model_name`	`str`	`"moondream/moondream3-preview"`	Hugging Face model identifier.
`options`	`AgentOptions` or `None`	`None`	Model directory configuration. If not provided, uses default_agent_options().

Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.

Video Publishing

Both processors publish annotated video frames with bounding boxes drawn on detected objects:

processor = moondream.CloudDetectionProcessor(
    api_key="your-api-key",
    detect_objects=["person", "car"]
)

# The track will show:
# - Green bounding boxes around detected objects
# - Labels with confidence scores
# - Real-time annotation overlay

The annotated video is automatically sent to your realtime LLM, enabling it to understand what objects are present in the scene and their locations.

Overview

AI Providers

Custom Integrations

Installation

Quick Start - Detection

Using CloudDetectionProcessor (Hosted)

Using LocalDetectionProcessor (On-Device)

Quick Start - Vision Language Models

Using CloudVLM (Hosted)

Using LocalVLM (On-Device)

Cloud vs. Local

Detect Multiple Objects

Configuration

Detection Processor Parameters

CloudDetectionProcessor Parameters

LocalDetectionProcessor Parameters

Vision Language Model Parameters

CloudVLM Parameters

LocalVLM Parameters

Video Publishing

Links

Overview

AI Providers

Custom Integrations

​Installation

​Quick Start - Detection

​Using CloudDetectionProcessor (Hosted)

​Using LocalDetectionProcessor (On-Device)

​Quick Start - Vision Language Models

​Using CloudVLM (Hosted)

​Using LocalVLM (On-Device)

​Cloud vs. Local

​Detect Multiple Objects

​Configuration

​Detection Processor Parameters

​CloudDetectionProcessor Parameters

​LocalDetectionProcessor Parameters

​Vision Language Model Parameters

​CloudVLM Parameters

​LocalVLM Parameters

​Video Publishing

​Links

Installation

Quick Start - Detection

Using CloudDetectionProcessor (Hosted)

Using LocalDetectionProcessor (On-Device)

Quick Start - Vision Language Models

Using CloudVLM (Hosted)

Using LocalVLM (On-Device)

Cloud vs. Local

Detect Multiple Objects

Configuration

Detection Processor Parameters

CloudDetectionProcessor Parameters

LocalDetectionProcessor Parameters

Vision Language Model Parameters

CloudVLM Parameters

LocalVLM Parameters

Video Publishing

Links