• Performance Optimization
  • Understanding Performance Metrics
  • Hardware Considerations
  • RVC2 NN Performance
  • Model Optimization Techniques
  • Luxonis-Specific Optimizations
  • Number of SHAVES
  • Lowering camera FPS to match NN FPS
  • NN input queue size and blocking behavior
  • Quantization (only applicable for RVC3 conversion)
  • Profiling and Benchmarking
  • DepthAI Pipeline Graph
  • Troubleshooting Common Performance Issues

Performance Optimization

This section is dedicated to optimizing and evaluating the performance of AI models when deployed on Luxonis devices, such as the OAK-D series.Optimizing AI model performance through quantization and hardware acceleration is key for powerful yet resource-constrained Luxonis edge devices. This ensures responsive and scalable AI applications suitable for real-world use.

Understanding Performance Metrics

Latency, throughput, and accuracy are three distinct metrics used to assess different aspects of performance in various systems.
  • Latency: refers to the delay or lag between the input (such as an image or video frame) and the corresponding output or response from a neural network.
  • Throughput: refers to the amount of data or the number of tasks a computer vision system can process within a given period.
  • Accuracy: measures how well a system or model performs in terms of correctness.

Hardware Considerations

Currently, our devices are built on top of either the second or third generation of our Robotics Vision Core (RVC2 and RVC3). You can find more specifics in hardware sections for RVC2 and RVC3.

RVC2 NN Performance

Model nameSizeFPSLatency [ms]
MobileOne S0224x224165.511.1
DeepLab V3256x25636.548.1
DeepLab V3513x5136.3253.1
YoloV6n R2416x41665.529.3
YoloV6n R2640x64029.366.4
YoloV6t R2416x41635.854.1
YoloV6t R2640x64014.2133.6
YoloV6m R2416x4168.6190.2

Model Optimization Techniques

Luxonis-Specific Optimizations

Number of SHAVES

When exporting a given using, e.g., our tools, setting the correct number of SHAVES can increase the performance. The SHAVES are vector processors in DepthAI/OAK. These SHAVES are used for operations that NCE (neural compute engines) aren't implemented to handle, but also for other things in the device, like handling, reformatting images, doing some ISP, etc.To read about setting a number of SHAVES in tools, please refer to here.

Lowering camera FPS to match NN FPS

Lowering FPS so that it does not exceed NN capabilities typically provides the best latency performance since the NN can start the inference as soon as a new frame is available.

NN input queue size and blocking behavior

By default, queues in DepthAI are blocking, so when the queue size (for setting it, you can use the setQueueSize() method) is reached, any additional messages from the device will be blocked. The library will wait until it can add new messages to the queue. When the queues are non-blocking, in the previous scenario, the library will discard the oldest message, add the new one to the queue, and then continue its processing loop. If your network has a higher latency and cannot process that many frames, it might improve the performance by setting the input queue of your neural network to non-blocking.

Quantization (only applicable for RVC3 conversion)

To speed the models up, one can quantize the models. Quantization refers to reducing the precision of the weights and activations in the model from high precision (e.g., 32-bit floating-point numbers) to lower precision (e.g., 8-bit integers). The goal is to achieve computational and memory efficiency without significantly compromising the model's performance. To learn more about quantizing a model, please refer to OpenVino's documentation.To quantize a model one can use a Post-Training Optimization Tool (POT). We recommend default quantization for quick testing. For this, you will need some images, e.g. coco128.zip (download from here). However, we suggest using images from your training or validation set for the best quantization and lowest accuracy drop.To install POT, use the following commands inside your Python environment:
Command Line
1python -m pip install --upgrade pip
2pip install openvino-dev==2022.1
Next, define a pot-config.json file:
2    "model": {
3        "model_name": "yolov6n",
4        "model": "path/to/model.xml",
5        "weights": "path/to/model.bin"
6    },
7    "engine": {
8        "device": "CPU",
9        "type": "simplified",
10        "data_source" : "/path/to/coco128/images/train2017/"
11    },
12    "compression": {
13        "target_device" : "VPU",
14        "algorithms": [
15            {
16                "name": "DefaultQuantization",
17                "params": {
18                    "stat_subset_size": 300
19                 },
20            }
21        ]
22    }
Don't forget to specify a proper path to XML, BIN, and dataset.Finally, call:
Command Line
1pot -c pot-config.json -d
This will create a directory with the results, storing your quantized XML and BIN.

Profiling and Benchmarking

By enabling the info log level (or lower), depthai will print the usage of hardware resources, specifically CPU/RAM consumption, temperature, CMX slices, and SHAVE core allocation.You can set the debugging level like this:
Command Line
1DEPTHAI_LEVEL=info python3 script.py

DepthAI Pipeline Graph

An excellent tool for debugging and providing insight into a DepthAI pipeline and its inner workings is DepthAI Pipeline Graph. It is a tool that visualizes the DepthAI pipelines with FPS outputs.To install it, run the following command:
Command Line
1pip install git+https://github.com/luxonis/depthai_pipeline_graph.git
Next, to visualize the DepthAI pipeline defined inside a main.py script, use this command:
Command Line
1pipeline_graph "python main.py -cam"
For more examples, please visit the DepthAI Pipeline Graph repository.

Troubleshooting Common Performance Issues

If you set the depthai level to trace, depthai will log operation times for each node/process.To visualize the network, you can use netron.app. You can investigate which operation could cause a bottleneck and then test your hypothesis by pruning the model before this operation. To prune a model, you should set the --output flag of the model optimizer (mo) to the given node. To read more about pruning, please see the documentation. After pruning the model, compile it, measure its latency, and compare it with the latency of the original (not pruned) model to see if your hypothesis was correct.