- Detection processors: Cloud-hosted API and local on-device versions for object detection
- Vision Language Models (VLMs): Cloud-hosted and local versions for visual question answering and image captioning
Installation
Install the Moondream plugin withQuick Start - Detection
Using CloudDetectionProcessor (Hosted)
TheCloudDetectionProcessor uses Moondream’s hosted API. By default it has a 2 RPS (requests per second) rate limit and requires an API key. The rate limit can be adjusted by contacting the Moondream team.
Using LocalDetectionProcessor (On-Device)
If you are running on your own infrastructure or using a service like Digital Ocean’s Gradient AI GPUs, you can use theLocalDetectionProcessor which downloads the model from HuggingFace and runs on device.
Quick Start - Vision Language Models
The VLM supports two modes:"vqa"(Visual Question Answering): Answers questions about video frames. Questions come from STT transcripts."caption"(Image Captioning): Generates descriptions of video frames automatically.
Using CloudVLM (Hosted)
TheCloudVLM uses Moondream’s hosted API for visual question answering and captioning. It automatically processes video frames and responds to questions asked via STT (Speech-to-Text).
Using LocalVLM (On-Device)
TheLocalVLM downloads the model from HuggingFace and runs on device. It supports both VQA and captioning modes.
Cloud vs. Local
We recommend most users stick with the Cloud version since it takes care of the hosting, model updates and the various complexities that comes with those things. If you are feeling adventurous or like to try and host the model yourself, we recommend you do so on CUDA devices for the best experience. Cloud- Use when: You want a simple setup with no infrastructure management
- Pros: No model download, no GPU required, automatic updates
- Cons: Requires API key, 2 RPS rate limit by default (can be increased)
- Best for: Development, testing, low-to-medium volume applications
- Use when: You need higher throughput, have your own GPU infrastructure, or want to avoid rate limits
- Pros: No rate limits, no API costs, full control over hardware
- Cons: Requires GPU for best performance, model download on first use, infrastructure management
- Best for: Production deployments, high-volume applications, custom infrastructure
Detect Multiple Objects
Both detection processors support zero-shot detection of multiple object types simultaneously:Configuration
Detection Processor Parameters
CloudDetectionProcessor Parameters
| Name | Type | Default | Description |
|---|---|---|---|
api_key | str or None | None | API key for Moondream Cloud API. If not provided, will attempt to read from MOONDREAM_API_KEY environment variable. |
detect_objects | str or List[str] | "person" | Object(s) to detect using zero-shot detection. Can be any object name like “person”, “car”, “basketball”. |
conf_threshold | float | 0.3 | Confidence threshold for detections. |
fps | int | 30 | Frame processing rate. |
interval | int | 0 | Processing interval in seconds. |
max_workers | int | 10 | Thread pool size for CPU-intensive operations. |
By default, the Moondream Cloud API has a 2 RPS (requests per second) rate limit. Contact the Moondream team to request a higher limit.
LocalDetectionProcessor Parameters
| Name | Type | Default | Description |
|---|---|---|---|
detect_objects | str or List[str] | "person" | Object(s) to detect using zero-shot detection. Can be any object name like “person”, “car”, “basketball”. |
conf_threshold | float | 0.3 | Confidence threshold for detections. |
fps | int | 30 | Frame processing rate. |
interval | int | 0 | Processing interval in seconds. |
max_workers | int | 10 | Thread pool size for CPU-intensive operations. |
device | str or None | None | Device to run inference on (‘cuda’, ‘mps’, or ‘cpu’). Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU. |
model_name | str | "moondream/moondream3-preview" | Hugging Face model identifier. |
options | AgentOptions or None | None | Model directory configuration. If not provided, uses default which defaults to tempfile.gettempdir(). |
Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.
Vision Language Model Parameters
CloudVLM Parameters
| Name | Type | Default | Description |
|---|---|---|---|
api_key | str or None | None | API key for Moondream Cloud API. If not provided, will attempt to read from MOONDREAM_API_KEY environment variable. |
mode | Literal["vqa", "caption"] | "vqa" | ”vqa” for visual question answering or “caption” for image captioning. |
max_workers | int | 10 | Thread pool size for CPU-intensive operations. |
By default, the Moondream Cloud API has rate limits. Contact the Moondream team to request higher limits.
LocalVLM Parameters
| Name | Type | Default | Description |
|---|---|---|---|
mode | Literal["vqa", "caption"] | "vqa" | ”vqa” for visual question answering or “caption” for image captioning. |
max_workers | int | 10 | Thread pool size for async operations. |
force_cpu | bool | False | If True, force CPU usage even if CUDA/MPS is available. Auto-detects CUDA, then MPS (Apple Silicon), then defaults to CPU. Note: MPS is automatically converted to CPU due to model compatibility. We recommend running on CUDA for best performance. |
model_name | str | "moondream/moondream3-preview" | Hugging Face model identifier. |
options | AgentOptions or None | None | Model directory configuration. If not provided, uses default_agent_options(). |
Performance will vary depending on your hardware configuration. CUDA is recommended for best performance on NVIDIA GPUs. The model will be downloaded from HuggingFace on first use.

