# LuxonisDataset

## Overview

The LuxonisDataset class offers a simple API for creating and managing data in the Luxonis Data Format (LDF). It acts as an
abstraction layer and provides methods for dataset:

 * initialization,
 * ingestion,
 * splitting,
 * merging, and
 * export, cloud synchronization, and deletion.

The following sections will guide you through creation of a LDF dataset. We have prepared a simple toy dataset you can use to
follow along the examples ([ParkingLot.zip](https://drive.google.com/uc?export=download&id=1OAuLlL_4wRSzZ33BuxM6Uw2QYYgv_19N)). It
consists of images of cars and motorcycles on a parking lot, each annotated with a bounding box, keypoints and segmentation mask.

## Dataset initialization

Dataset creation process starts by initializing the LuxonisDataset object:

```python
from luxonis_ml.data.datasets import LuxonisDataset

dataset_name: str = ... # e.g. "parking_lot"
dataset = LuxonisDataset(dataset_name)
```

> Datasets can be stored locally or using one of the supported cloud storage providers, including GCS, S3, and Azure Blob storage.
By default, the initialized dataset is stored locally.

> If there already exist a dataset with the provided
> `dataset_name`
> , it will be automatically loaded instead of initializing a new one. Therefore, beware to use a unique name for each new dataset
or pass
> `delete_local=True`
> to the
> `LuxonisDataset`
> constructor to overwrite an existing one.

If you need more control over storage behavior, the constructor also accepts team_id, bucket_type, bucket_storage, and
delete_remote parameters. These are useful when the dataset should live in shared or remote object storage instead of only on the
local machine.

## Adding Data

After dataset initialization, we can start with data ingestion. We must first define a generator function that yields individual
data instances. Each data instance stores path to an image and a single annotation (e.g. a bounding box). So in case of multiple
annotations per image, multiple data instances must be yielded separately.

We define data instances as a Python dictionary with the following structure:

```python
{
    "file": str,  # path to the image file
    "annotation": Optional[dict]  # single image annotation
}
```

where content of the annotation field depends on the task type. The following task types are supported:

 * [Classification](https://github.com/luxonis/luxonis-ml/tree/main/luxonis_ml/data#classification)
 * [Bounding Box](https://github.com/luxonis/luxonis-ml/tree/main/luxonis_ml/data#bounding-box)
 * [Keypoints](https://github.com/luxonis/luxonis-ml/tree/main/luxonis_ml/data#keypoints)
 * [Segmentation Mask](https://github.com/luxonis/luxonis-ml/tree/main/luxonis_ml/data#segmentation-mask)
 * [Instance Segmentation](https://github.com/luxonis/luxonis-ml/tree/main/luxonis_ml/data#instance-segmentation)
 * [Array](https://github.com/luxonis/luxonis-ml/tree/main/luxonis_ml/data#array)
 * [Metadata](https://github.com/luxonis/luxonis-ml/tree/main/luxonis_ml/data#metadata)

Below we provide an example generator function for the parking lot dataset, yielding the data instances for bounding box
annotations.

```python
import json
from pathlib import Path

# path to the dataset, replace it with the actual path on your system
dataset_root = Path("data/parking_lot")

def generator():
    for annotation_dir in dataset_root.iterdir():
        with open(annotation_dir / "annotations.json") as f:
            data = json.load(f)

        # get the width and height of the image
        W = data["dimensions"]["width"]
        H = data["dimensions"]["height"]

        image_path = annotation_dir / data["filename"]

        for instance_id, bbox in data["BoundingBoxAnnotation"].items():

            # get unnormalized bounding box coordinates
            x, y = bbox["origin"]
            w, h = bbox["dimension"]

            # get the class name of the bounding box
            class_ = bbox["labelName"]
            yield {
                "file": image_path,
                "annotation": {
                    "class": class_,
                    # normalized bounding box
                    "boundingbox": {
                        "x": x / W,
                        "y": y / H,
                        "w": w / W,
                        "h": h / H,
                    },
                },
            }
```

The generator is then passed to the add method of the dataset.

```python
dataset.add(generator())
```

> The
> `add`
> method accepts any iterable, not only generators.

## Metadata and sources

Luxonis datasets can store more than just images and labels. Dataset metadata keeps track of:

 * task definitions and class mappings
 * keypoint skeletons
 * categorical metadata encodings
 * dataset source structure

The source structure is represented by LuxonisSource and LuxonisComponent, which lets one dataset describe multi-component or
multi-sensor inputs as well. For example, one source can contain multiple image components instead of only a single RGB image.

Useful metadata-related methods include:

 * set_tasks(...) to define task groups explicitly
 * set_classes(...) to register class mappings
 * get_source_names() to inspect the available dataset sources
 * update_source(...) to update source/component metadata

## Defining Splits

After adding data to the dataset, we can define its splits. There are no restrictions on the split names but in most cases one
should stick to train, val, and test sets. The splits are defined by calling the make_splits method on the LuxonisDataset object
and passing the desired split ratios in its arguments (by default, the data are split with the 80:10:10 ratio between train, val,
and test sets).

```python
dataset.make_splits({
  "train": 0.7,
  "val": 0.2,
  "test": 0.1,
})
```

For a more refined control over the splits, you can pass a dictionary with the split names as keys and lists of file names as
values:

```python
dataset.make_splits({
  "train": ["file1.jpg", "file2.jpg", ...],
  "val": ["file3.jpg", "file4.jpg", ...],
  "test": ["file5.jpg", "file6.jpg", ...],
})
```

Once splits are made, calling the make_splits method again will raise an error. If you wish to redefine them, pass
redefine_splits=True to the method call.

## Cloud sync and dataset discovery

For remote workflows, LuxonisDataset can also:

 * list datasets with LuxonisDataset.list_datasets(...)
 * pull missing or all media locally with pull_from_cloud(...)
 * push local data to remote object storage with push_to_cloud(...)

These methods are especially useful when the same dataset is shared across training machines or team environments.

## Dataset Cloning

You can clone an existing dataset to create a copy with a new name. This is useful for testing changes without affecting the
original dataset. Cloning is done by calling the clone method on the LuxonisDataset object and passing the desired name of the new
dataset.

```python
dataset_clone = dataset.clone(new_dataset_name="dataset_clone")
```

## Dataset Merging

Datasets can also be merged together. This is beneficial for combining multiple datasets into a larger, unified dataset for
comprehensive training or analysis. Merging is done by calling the merge_with method on the first LuxonisDataset object and
passing the second one as an argument. You can choose between two different merging modes:

 * inplace: the first dataset is modified to include data from the second dataset
 * out-of-place: a new dataset is created from the combination of two existing datasets

```python
# inplace merging
dataset1.merge_with(dataset2, inplace=True)
# OR out-of-place merging
dataset_merge = dataset1.merge_with(dataset2, inplace=False, new_dataset_name="dataset_merge")
```

## Dataset export

LuxonisDataset can export data back out of LDF into common dataset formats. This is useful when you want to prepare data in
LuxonisML, but train or inspect it in another toolchain.

```python
from luxonis_ml.enums import DatasetType

dataset.export("exports/coco", dataset_type=DatasetType.COCO)
```

The current exporters cover native LDF export as well as several standard formats, including COCO, Pascal VOC, Darknet, YOLOv4,
YOLOv6, YOLOv8 task-specific exporters, TensorFlow CSV, CreateML, FiftyOne Classification, Classification Directory, and
Segmentation Mask Directory.

## CLI Reference

The luxonis_ml CLI provides a set of various useful commands for managing datasets. These commands are accessible via the
luxonis_ml data command.

The available commands are:

 * luxonis_ml data ls - lists all datasets
 * luxonis_ml data info <dataset_name> - prints information about the dataset
 * luxonis_ml data inspect <dataset_name> - renders the data in the dataset on screen using cv2
 * luxonis_ml data delete <dataset_name> - deletes the dataset

For more information, run luxonis_ml data --help or pass the --help flag to any of the above commands.
