# DataDreamer

## Quickstart

[DataDreamer](https://github.com/luxonis/datadreamer) is a Luxonis tool for two related workflows:

 * Generate synthetic datasets from scratch with generative AI and foundation vision models.
 * Pre-annotate existing image datasets to reduce manual labeling work before review and training.

This makes it useful when you do not have enough training data yet, or when you want to speed up dataset preparation before
training with [Luxonis Train](https://docs.luxonis.com/software-v3/ai-inference/model-source/training/luxonis-train.md).

To generate your dataset with custom classes, you need to execute only two commands:

```bash
pip install datadreamer
datadreamer --class_names person moon robot
```

For the complete CLI surface, see the [API
reference](https://docs.luxonis.com/software-v3/ai-inference/model-source/training/datadreamer/api-reference.md).

Here are some tutorials which you can use to start generating data and training models using DataDreamer:

 * Google Colab notebook with instructions on how to generate a dataset, train a model, and export it for RVC2/RVC3: [DataDreamer
   Quickstart](https://colab.research.google.com/github/luxonis/ai-tutorials/blob/main/training/datadreamer/generate_dataset_and_train_yolo.ipynb)
 * Helmet detection example: [Helmet
   detection](https://colab.research.google.com/github/luxonis/ai-tutorials/blob/main/training/datadreamer/helmet_detection.ipynb)
 * Instance segmentation example: [Instance
   segmentation](https://colab.research.google.com/github/luxonis/ai-tutorials/blob/main/training/datadreamer/generate_instance_segmentation_dataset_and_train_yolo.ipynb)

For more information visit the [DataDreamer GitHub repository](https://github.com/luxonis/datadreamer).

## Overview

DataDreamer is an advanced toolkit engineered to facilitate the development of edge AI models, irrespective of initial data
availability. Distinctive features of DataDreamer include:

 * Synthetic Data Generation: Eliminate the dependency on extensive datasets for AI training. DataDreamer empowers users to
   generate synthetic datasets from the ground up, utilizing advanced AI algorithms capable of producing high-quality, diverse
   images.

 * Real-data Pre-annotation: Use the same annotation pipeline on images you already have to bootstrap labels for classification,
   object detection, and instance segmentation tasks.

 * Knowledge Extraction from Foundational Models: DataDreamer leverages the latent knowledge embedded within sophisticated,
   pre-trained AI models. This capability allows for the transfer of expansive understanding from these "Foundation models" to
   smaller, custom-built models, enhancing their capabilities significantly.

 * Efficient and Potent Models: The primary objective of DataDreamer is to enable the creation of compact models that are both
   size-efficient for integration into any device and robust in performance for specialized tasks.

## Features

 * Prompt Generation: Automate the creation of image prompts using powerful language models.
   
   Provided class names: ["horse", "robot"]
   
   Generated prompt: "A photo of a horse and a robot coexisting peacefully in the midst of a serene pasture."

 * Image Generation: Generate synthetic datasets with state-of-the-art generative models.

 * Dataset Annotation: Leverage foundation models to label datasets automatically.

 * Pre-annotate Existing Data: Run DataDreamer on a directory of real images when you want an initial pass of labels before manual
   QA.

 * Edge Model Training: Train efficient small-scale neural networks for edge deployment. (not part of this library)

## Pre-annotating existing datasets

DataDreamer can skip synthetic image generation entirely and annotate images you already have. This is useful when you want to
accelerate labeling without changing the rest of your training workflow.

```bash
datadreamer \
  --task instance-segmentation \
  --image_annotator owlv2-slimsam \
  --save_dir dataset_path \
  --class_names dumpling \
  --annotate_only
```

--annotate_only disables prompt generation and image generation, so DataDreamer only runs the annotation stage. Combine it with
--task and --image_annotator to switch between classification, detection, and instance segmentation workflows.

## Installation

DataDreamer can be installed either from PyPI or run from the published Docker image.

```bash
pip install datadreamer
```

```bash
docker pull ghcr.io/luxonis/datadreamer:latest
docker run --rm -v "$(pwd):/app" ghcr.io/luxonis/datadreamer:latest --save_dir generated_dataset --device cpu
```

Use --gpus all together with --device cuda when running the container on a CUDA-capable machine.

## Hardware requirements

 * A CUDA-compatible GPU with at least 16 GB of memory is recommended for the best experience.
 * At least 16 GB of system RAM is recommended, with 32 GB or more preferred for larger jobs.

## Available models

| Model Category | Model Names | Description/Notes |
| --- | --- | --- |
| Prompt Generation | [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | Semantically rich
prompts |
| | [TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) | Tiny LM |
| | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | Qwen2.5 LM |
| | Simple random generator | Joins randomly chosen object names |
| Profanity Filter | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | Fast and accurate LM profanity
filter |
| Image Generation | [SDXL-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) | Slow and accurate (1024x1024
images) |
| | [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo) | Fast and less accurate (512x512 images) |
| | [SDXL-Lightning](https://huggingface.co/ByteDance/SDXL-Lightning) | Fast and accurate (1024x1024 images) |
| | [Shuttle-3-Diffusion](https://huggingface.co/shuttleai/shuttle-3-diffusion) | Fast and accurate (512x512 images) |
| Image Annotation | [OWLv2](https://huggingface.co/google/owlv2-base-patch16-ensemble) | Open-Vocabulary object detector |
| | [CLIP](https://huggingface.co/openai/clip-vit-base-patch32) | Zero-shot-image-classification |
| | [AIMv2](https://huggingface.co/apple/aimv2-large-patch14-224-lit) | Zero-shot-image-classification |
| | [SlimSAM](https://huggingface.co/Zigeng/SlimSAM-uniform-50) | Zero-shot-instance-segmentation |
| | [SAM2.1](https://huggingface.co/facebook/sam2.1-hiera-large) | Zero-shot-instance-segmentation |

## Usage

DataDreamer can be driven either directly from CLI arguments or from a YAML config file.

```bash
datadreamer --save_dir path/to/save_directory --config configs/det_config.yaml
```

Some of the most important options are:

 * --task for selecting detection, classification, or instance-segmentation
 * --dataset_format for choosing raw, yolo, coco, voc, luxonis-dataset, or cls-single
 * --image_annotator for selecting the detector, classifier, or segmentation stack
 * --device for choosing between cuda and cpu

In raw format, DataDreamer writes generated images together with prompts.json and annotations.json. Other output formats are
designed to plug directly into common training pipelines.

## Example

```bash
datadreamer --save_dir path/to/save_directory --class_names person moon robot --prompts_number 20 --prompt_generator simple --num_objects_range 1 3 --image_generator sdxl-turbo
```

This command generates images for the specified objects, saving them and their annotations in the given directory. The script
allows customization of the generation process through various parameters, adapting to different needs and hardware
configurations.

## Useful tips

 * Batched generation: Increase --batch_size_prompt, --batch_size_image, and --batch_size_annotation when memory allows to improve
   throughput.
 * Better image quality: Prefer sdxl-lightning, shuttle-3, or sdxl over sdxl-turbo when quality matters more than speed.
 * Image selection: Use --use_image_tester together with --image_tester_patience when you want DataDreamer to spend more time
   selecting better generations.
 * Number of objects per image: Keep --num_objects_range modest. Values above 3 are usually harder for current image generators to
   render reliably.
 * Prompt generation: Use --prompt_generator tiny, lm, or qwen2 when you want more varied prompts than the simple random
   generator.
