Spatial AI

Spatial AI allows robots or computers to perceive the world as a human can - what objects or features are - and where they are in the physical world. DepthAI platform leverages Spatial AI by fusing neural networks with depth perception on the OAK camera itself.There are a few different approaches to achieve AI + depth fusion:

Neural inference fused with depth map
Semantic depth
Stereo neural inference

Neural inference fused with depth map

DepthAI can fuse neural inference (object/landmark detection) results with a depth map to estimate spatial coordinates (XYZ) of all objects/landmarks in the scene.This technique is excellent for existing (pre-trained) 2D object/feature detectors as it runs inference on color/mono frames, and uses resulting bounding boxes to determine ROIs (regions-of-interest). DepthAI then averages depth from depth map inside these ROIs and calculates spatial coordinates from that

3D Object Localization

First, let us define what Object Detection is. It is the technical term for finding the bounding box of an object of interest, in an image's pixel space (i.e., pixel coordinates).

3D Object Localization (or 3D Object Detection) is all about finding objects in physical space instead of pixel space. It is useful when measuring or interacting with the physical world in real-time.Below is a visualization to showcase the difference between Object Detection and 3D Object Localization:

DepthAI extends these 2D neural networks (eg. MobileNet, Yolo) with spatial information to give them 3D context.

In the image above, a DepthAI application runs a MobileNet object detector on RGB camera stream and fuses them with the depth map [RGB-D] to estimate the 3D position of each detected object (see MobileNetSpatialDetectionNetwork node for more details).

NeuralNetwork decoding

The DepthAI API provides a simple way to decode the neural network results, including the bounding boxes, labels, and confidence scores. This is possible for Yolo and MobileNet neural network architectures. For any custom neural network, you can use the standard NeuralNetwork node, for which you will need to decode the results yourself.

3D Landmark Localization

Similar process happens for landmark localization. Here, instead of separate objects, the neural network detects landmarks (hand/position/features). An example would be a hand landmark detector on DepthAI (demo below). With a regular camera, this network returns the 2D (XY) coordinates of all 21 hand landmarks (contours of all joints in fingers). Using this same network with DepthAI, these 21 hand landmarks are now 3D points in physical space instead of 2D points in pixel space.

Demos: hand landmark (above), human pose landmark, and facial landmark detection demos.

Semantic depth

One of the classic problems in autonomous robotic navigation/actuation are unknown objects. Known objects are specified before the installation to be encountered - such as tools, other machines, workers, equipment, and facilities.We cannot anticipate unknown objects - including those unknowable or never-before-seen. Training an object detector is sufficient for known objects as this is a "positive" form of object detection: "Pile in the path, stop." "Shovel in the path, stop." etc.Such generic obstacle avoidance scenarios require a "negative" object detection system, and a very effective technique is to use semantic segmentation of RGB, Depth, or RGB+Depth.

In such a "negative" system, the semantic segmentation system is trained on all the surfaces that are not objects. So anything that is not that surface is considered an object - allowing the navigation to know its location and take commensurate action (stop, go around, turn around, etc.). So the semantic depth is extremely valuable for object avoidance and navigation planning applications.In the image above, a person semantic segmentation model is running on RGB frames, and, based on the results, it crops depth maps only to include the person's depth.

In the example above, autonomous lawn mower would only navigate around grass areas, and avoid everything else (trees, roots, manholes, paths, etc).

Stereo neural inference

In this mode, the neural inference (landmark detection) is run on the left and right cameras to produce stereo inference results. Unlike monocular neural inference fused with stereo depth - there is no max disparity search limit - so the minimum distance is purely limited by the greater of (a) horizontal field of view (HFOV) of the stereo cameras themselves and (b) the hyperfocal distance of the cameras (minimal distance for objects to be in focus).After we have 2D positions of landmarks from both left/right cameras, we can calculate the disparity of the results, which are then triangulated with the calibrated camera intrinsics to give the 3D position of all the detected features.

For more information, check out the Stereo neural inference demo.Examples include finding the 3D locations of:

Facial landmarks (eyes, ears, nose, edges of the mouth, etc.)
Features on a product (screw holes, blemishes, etc.)
Joints on a person (e.g., elbow, knees, hips, etc.)
Features on a vehicle (e.g. mirrors, headlights, etc.)
Pests or disease on a plant (i.e., features that are too small for object detection + stereo depth)

This mode does not require the neural networks to be trained with depth data. DepthAI takes standard, off-the-shelf 2D networks (which are significantly more common) and uses this stereo inference to produce accurate 3D results.

Need assistance?

Head over to Discussion Forum for technical support or any other questions you might have.

ON THIS PAGE