DataDreamer
DataDreamer

Quickstart
- Generate synthetic datasets from scratch with generative AI and foundation vision models.
- Pre-annotate existing image datasets to reduce manual labeling work before review and training.
Command Line
1pip install datadreamer
2datadreamer --class_names person moon robot- Google Colab notebook with instructions on how to generate a dataset, train a model, and export it for RVC2/RVC3: DataDreamer Quickstart
- Helmet detection example: Helmet detection
- Instance segmentation example: Instance segmentation
Overview

DataDreamer is an advanced toolkit engineered to facilitate the development of edge AI models, irrespective of initial data availability. Distinctive features of DataDreamer include:- Synthetic Data Generation: Eliminate the dependency on extensive datasets for AI training. DataDreamer empowers users to generate synthetic datasets from the ground up, utilizing advanced AI algorithms capable of producing high-quality, diverse images.
- Real-data Pre-annotation: Use the same annotation pipeline on images you already have to bootstrap labels for classification, object detection, and instance segmentation tasks.
- Knowledge Extraction from Foundational Models:
DataDreamerleverages the latent knowledge embedded within sophisticated, pre-trained AI models. This capability allows for the transfer of expansive understanding from these "Foundation models" to smaller, custom-built models, enhancing their capabilities significantly. - Efficient and Potent Models: The primary objective of
DataDreameris to enable the creation of compact models that are both size-efficient for integration into any device and robust in performance for specialized tasks.
Features
- Prompt Generation: Automate the creation of image prompts using powerful language models.Provided class names: ["horse", "robot"]Generated prompt: "A photo of a horse and a robot coexisting peacefully in the midst of a serene pasture."
- Image Generation: Generate synthetic datasets with state-of-the-art generative models.

- Dataset Annotation: Leverage foundation models to label datasets automatically.

- Pre-annotate Existing Data: Run DataDreamer on a directory of real images when you want an initial pass of labels before manual QA.
- Edge Model Training: Train efficient small-scale neural networks for edge deployment. (not part of this library)
Pre-annotating existing datasets
Command Line
1datadreamer \
2 --task instance-segmentation \
3 --image_annotator owlv2-slimsam \
4 --save_dir dataset_path \
5 --class_names dumpling \
6 --annotate_only--annotate_only disables prompt generation and image generation, so DataDreamer only runs the annotation stage. Combine it with --task and --image_annotator to switch between classification, detection, and instance segmentation workflows.Installation
Command Line
1pip install datadreamerCommand Line
1docker pull ghcr.io/luxonis/datadreamer:latest
2docker run --rm -v "$(pwd):/app" ghcr.io/luxonis/datadreamer:latest --save_dir generated_dataset --device cpu--gpus all together with --device cuda when running the container on a CUDA-capable machine.Hardware requirements
- A CUDA-compatible GPU with at least 16 GB of memory is recommended for the best experience.
- At least 16 GB of system RAM is recommended, with 32 GB or more preferred for larger jobs.
Available models
| Model Category | Model Names | Description/Notes |
|---|---|---|
| Prompt Generation | Mistral-7B-Instruct-v0.1 | Semantically rich prompts |
| TinyLlama-1.1B-Chat-v1.0 | Tiny LM | |
| Qwen2.5-1.5B-Instruct | Qwen2.5 LM | |
| Simple random generator | Joins randomly chosen object names | |
| Profanity Filter | Qwen2.5-1.5B-Instruct | Fast and accurate LM profanity filter |
| Image Generation | SDXL-1.0 | Slow and accurate (1024x1024 images) |
| SDXL-Turbo | Fast and less accurate (512x512 images) | |
| SDXL-Lightning | Fast and accurate (1024x1024 images) | |
| Shuttle-3-Diffusion | Fast and accurate (512x512 images) | |
| Image Annotation | OWLv2 | Open-Vocabulary object detector |
| CLIP | Zero-shot-image-classification | |
| AIMv2 | Zero-shot-image-classification | |
| SlimSAM | Zero-shot-instance-segmentation | |
| SAM2.1 | Zero-shot-instance-segmentation |
Usage
Command Line
1datadreamer --save_dir path/to/save_directory --config configs/det_config.yaml--taskfor selectingdetection,classification, orinstance-segmentation--dataset_formatfor choosingraw,yolo,coco,voc,luxonis-dataset, orcls-single--image_annotatorfor selecting the detector, classifier, or segmentation stack--devicefor choosing betweencudaandcpu
prompts.json and annotations.json. Other output formats are designed to plug directly into common training pipelines.Example
Command Line
1datadreamer --save_dir path/to/save_directory --class_names person moon robot --prompts_number 20 --prompt_generator simple --num_objects_range 1 3 --image_generator sdxl-turboUseful tips
- Batched generation: Increase
--batch_size_prompt,--batch_size_image, and--batch_size_annotationwhen memory allows to improve throughput. - Better image quality: Prefer
sdxl-lightning,shuttle-3, orsdxloversdxl-turbowhen quality matters more than speed. - Image selection: Use
--use_image_testertogether with--image_tester_patiencewhen you want DataDreamer to spend more time selecting better generations. - Number of objects per image: Keep
--num_objects_rangemodest. Values above 3 are usually harder for current image generators to render reliably. - Prompt generation: Use
--prompt_generator tiny,lm, orqwen2when you want more varied prompts than the simple random generator.