DataDreamer

Quickstart
Command Line
1pip install datadreamer
2datadreamer --class_names person moon robot- Google Colab notebook with instructions on how to generate a dataset, train a model, and export it to for RVC2/RVC3: DataDreamer Quickstart
- Helmet detection example: Helmet detection
- Instance segmentation example: Instance segmentation
Overview

DataDreamer is an advanced toolkit engineered to facilitate the development of edge AI models, irrespective of initial data availability. Distinctive features of DataDreamer include:- Synthetic Data Generation: Eliminate the dependency on extensive datasets for AI training. DataDreamer empowers users to generate synthetic datasets from the ground up, utilizing advanced AI algorithms capable of producing high-quality, diverse images.
- Knowledge Extraction from Foundational Models:
DataDreamerleverages the latent knowledge embedded within sophisticated, pre-trained AI models. This capability allows for the transfer of expansive understanding from these "Foundation models" to smaller, custom-built models, enhancing their capabilities significantly. - Efficient and Potent Models: The primary objective of
DataDreameris to enable the creation of compact models that are both size-efficient for integration into any device and robust in performance for specialized tasks.
Features
- Prompt Generation: Automate the creation of image prompts using powerful language models.Provided class names: ["horse", "robot"]Generated prompt: "A photo of a horse and a robot coexisting peacefully in the midst of a serene pasture."
- Image Generation: Generate synthetic datasets with state-of-the-art generative models.

- Dataset Annotation: Leverage foundation models to label datasets automatically.

- Edge Model Training: Train efficient small-scale neural networks for edge deployment. (not part of this library)
Installation
Command Line
1pip install datadreamerAvailable models
| Model Category | Model Names | Description/Notes |
|---|---|---|
| Prompt Generation | Mistral-7B-Instruct-v0.1 | Semantically rich prompts |
| TinyLlama-1.1B-Chat-v1.0 | Tiny LM | |
| Qwen2.5-1.5B-Instruct | Qwen2.5 LM | |
| Simple random generator | Joins randomly chosen object names | |
| Profanity Filter | Qwen2.5-1.5B-Instruct | Fast and accurate LM profanity filter |
| Image Generation | SDXL-1.0 | Slow and accurate (1024x1024 images) |
| SDXL-Turbo | Fast and less accurate (512x512 images) | |
| SDXL-Lightning | Fast and accurate (1024x1024 images) | |
| Shuttle-3-Diffusion | Fast and accurate (512x512 images) | |
| Image Annotation | OWLv2 | Open-Vocabulary object detector |
| CLIP | Zero-shot-image-classification | |
| AIMv2 | Zero-shot-image-classification | |
| SlimSAM | Zero-shot-instance-segmentation | |
| SAM2.1 | Zero-shot-instance-segmentation |
Example
Command Line
1datadreamer --save_dir path/to/save_directory --class_names person moon robot --prompts_number 20 --prompt_generator simple --num_objects_range 1 3 --image_generator sdxl-turboUseful tips
- Batched generation: To speed up the generation process, consider increasing the batch size with
--batch_size_prompt,--batch_size_imageand--batch_size_annotationparameters. If you are running out of memory, try reducing the batch size. - Better image quality: For better image quality, consider tuning the following parameters:
--image_generator: Choose a model with higher image quality. SDXL-Turbo -> SDXL-Lightning -> Shuttle-3-Diffusion -> SDXL (from fastest to slowest, and from lowest to highest quality).--use_image_testerand--image_tester_patience: Enable iterative image generation and use the CLIP model to select the best images. Consider increasing the patience to get better results.
- Number of objects per image: To generate images with a different number of objects, use the
--num_objects_rangeparameter. For example,--num_objects_range 1 3generates images with 1, 2, or 3 objects. Values higher than 3 are not recommended due to the limited ability of the current models to generate complex scenes. - Prompt generation: To generate more diverse prompts consider using the
--prompt_generator tinygenerator which uses a small language model to generate prompts.