Dexgrasp

This study presents a novel multi-modal approach for precise object manipulation using robotic arms, integrated with advanced computer vision techniques

1. System Setting

The system employs a configuration of four Kinect sensors surrounding a desktop area. This setup captures both RGB and depth information from multiple angles, ensuring comprehensive visibility and data collection of the tabletop environment. The Kinect sensors are calibrated to synchronize their feeds, allowing for simultaneous capturing of the scene from different viewpoints.

System setting with four Kinect sensors

2. Experiments

First, we use Kinect to obtain point clouds with color information. It consists of an RGB image and a depth image.

Point cloud with color information
RGB and depth images

We input the possible semantic information of objects and the captured RGB image into the LLAVA large language model. The large language model will analyze the semantic information of the object in the image and tell us what the object is in the image.

LLAVA model output

Next, we input the image and object name into the Dino big prediction model, which will segment the image.

Dino model segmentation

Then we get the mask of the corresponding object in the image.

Object mask

Through the mask, we can segment the original point cloud of the object.

Segmented point cloud

We will match the previously scanned mesh of the object with the point cloud to obtain the object's accurate volume and position in space. (The matching consists of Global Registration coarse matching and ICP matching)

Mesh matching process

Because the original point cloud of the object collected by the original camera is always incomplete, but with the matched mesh, we can use it to supplement the original point cloud to obtain a complete point cloud of the object.

Complete point cloud

3. Visualization

We use the point cloud obtained by this method as the input of the algorithm in our paper "Towards Human-like Grasping for Robotic Dexterous Hand" to complete the motion planning of the robot, as shown in the video.