ON THIS PAGE

  • Quickstart
  • Overview
  • Features
  • Pre-annotating existing datasets
  • Installation
  • Hardware requirements
  • Available models
  • Usage
  • Example
  • Useful tips

DataDreamer

DataDreamer examples

Quickstart

DataDreamer is a Luxonis tool for two related workflows:
  • Generate synthetic datasets from scratch with generative AI and foundation vision models.
  • Pre-annotate existing image datasets to reduce manual labeling work before review and training.
This makes it useful when you do not have enough training data yet, or when you want to speed up dataset preparation before training with Luxonis Train.To generate your dataset with custom classes, you need to execute only two commands:
Command Line
1pip install datadreamer
2datadreamer --class_names person moon robot
For the complete CLI surface, see the API reference.Here are some tutorials which you can use to start generating data and training models using DataDreamer:For more information visit the DataDreamer GitHub repository.

Overview

DataDreamer scheme
DataDreamer is an advanced toolkit engineered to facilitate the development of edge AI models, irrespective of initial data availability. Distinctive features of DataDreamer include:
  • Synthetic Data Generation: Eliminate the dependency on extensive datasets for AI training. DataDreamer empowers users to generate synthetic datasets from the ground up, utilizing advanced AI algorithms capable of producing high-quality, diverse images.
  • Real-data Pre-annotation: Use the same annotation pipeline on images you already have to bootstrap labels for classification, object detection, and instance segmentation tasks.
  • Knowledge Extraction from Foundational Models: DataDreamer leverages the latent knowledge embedded within sophisticated, pre-trained AI models. This capability allows for the transfer of expansive understanding from these "Foundation models" to smaller, custom-built models, enhancing their capabilities significantly.
  • Efficient and Potent Models: The primary objective of DataDreamer is to enable the creation of compact models that are both size-efficient for integration into any device and robust in performance for specialized tasks.

Features

  • Prompt Generation: Automate the creation of image prompts using powerful language models.Provided class names: ["horse", "robot"]Generated prompt: "A photo of a horse and a robot coexisting peacefully in the midst of a serene pasture."
  • Image Generation: Generate synthetic datasets with state-of-the-art generative models.
generated image
  • Dataset Annotation: Leverage foundation models to label datasets automatically.
annotated image
  • Pre-annotate Existing Data: Run DataDreamer on a directory of real images when you want an initial pass of labels before manual QA.
  • Edge Model Training: Train efficient small-scale neural networks for edge deployment. (not part of this library)

Pre-annotating existing datasets

DataDreamer can skip synthetic image generation entirely and annotate images you already have. This is useful when you want to accelerate labeling without changing the rest of your training workflow.
Command Line
1datadreamer \
2  --task instance-segmentation \
3  --image_annotator owlv2-slimsam \
4  --save_dir dataset_path \
5  --class_names dumpling \
6  --annotate_only
--annotate_only disables prompt generation and image generation, so DataDreamer only runs the annotation stage. Combine it with --task and --image_annotator to switch between classification, detection, and instance segmentation workflows.

Installation

DataDreamer can be installed either from PyPI or run from the published Docker image.
Command Line
1pip install datadreamer
Command Line
1docker pull ghcr.io/luxonis/datadreamer:latest
2docker run --rm -v "$(pwd):/app" ghcr.io/luxonis/datadreamer:latest --save_dir generated_dataset --device cpu
Use --gpus all together with --device cuda when running the container on a CUDA-capable machine.

Hardware requirements

  • A CUDA-compatible GPU with at least 16 GB of memory is recommended for the best experience.
  • At least 16 GB of system RAM is recommended, with 32 GB or more preferred for larger jobs.

Available models

Model CategoryModel NamesDescription/Notes
Prompt GenerationMistral-7B-Instruct-v0.1Semantically rich prompts
TinyLlama-1.1B-Chat-v1.0Tiny LM
Qwen2.5-1.5B-InstructQwen2.5 LM
Simple random generatorJoins randomly chosen object names
Profanity FilterQwen2.5-1.5B-InstructFast and accurate LM profanity filter
Image GenerationSDXL-1.0Slow and accurate (1024x1024 images)
SDXL-TurboFast and less accurate (512x512 images)
SDXL-LightningFast and accurate (1024x1024 images)
Shuttle-3-DiffusionFast and accurate (512x512 images)
Image AnnotationOWLv2Open-Vocabulary object detector
CLIPZero-shot-image-classification
AIMv2Zero-shot-image-classification
SlimSAMZero-shot-instance-segmentation
SAM2.1Zero-shot-instance-segmentation

Usage

DataDreamer can be driven either directly from CLI arguments or from a YAML config file.
Command Line
1datadreamer --save_dir path/to/save_directory --config configs/det_config.yaml
Some of the most important options are:
  • --task for selecting detection, classification, or instance-segmentation
  • --dataset_format for choosing raw, yolo, coco, voc, luxonis-dataset, or cls-single
  • --image_annotator for selecting the detector, classifier, or segmentation stack
  • --device for choosing between cuda and cpu
In raw format, DataDreamer writes generated images together with prompts.json and annotations.json. Other output formats are designed to plug directly into common training pipelines.

Example

Command Line
1datadreamer --save_dir path/to/save_directory --class_names person moon robot --prompts_number 20 --prompt_generator simple --num_objects_range 1 3 --image_generator sdxl-turbo
This command generates images for the specified objects, saving them and their annotations in the given directory. The script allows customization of the generation process through various parameters, adapting to different needs and hardware configurations.

Useful tips

  • Batched generation: Increase --batch_size_prompt, --batch_size_image, and --batch_size_annotation when memory allows to improve throughput.
  • Better image quality: Prefer sdxl-lightning, shuttle-3, or sdxl over sdxl-turbo when quality matters more than speed.
  • Image selection: Use --use_image_tester together with --image_tester_patience when you want DataDreamer to spend more time selecting better generations.
  • Number of objects per image: Keep --num_objects_range modest. Values above 3 are usually harder for current image generators to render reliably.
  • Prompt generation: Use --prompt_generator tiny, lm, or qwen2 when you want more varied prompts than the simple random generator.