Human-centric Scene Understanding for 3D Large-scale Scenarios

1ShanghaiTech University, 2The University of Hong Kong

3Shanghai AI Laboratory 4The Chinese University of Hong Kong

Figure 1. The left shows several scenes captured in HuCenLife, which covers diverse human-centric daily-life scenarios. The right demonstrates rich annotations of HuCenLife, which can benefit many tasks for 3D scene understanding .

Abstract

Human-centric scene understanding is significant for real-world applications, but it is extremely challenging due to the existence of diverse human poses and ac- tions, complex human-environment interactions, severe oc- clusions in crowds, etc. In this paper, we present a large- scale multi-modal dataset for human-centric scene under- standing, dubbed HuCenLife, which is collected in diverse daily-life scenarios with rich and fine-grained annotations. Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc., and we also provide benchmarks for these tasks to facili- tate related research. In addition, we design novel mod- ules for LiDAR-based segmentation and action recognition, which are more applicable for large-scale human-centric scenarios and achieve state-of-the-art performance.

Method For Instance Segmentation

Figure 3.The architecture of our segmentation method. Especially, the HHOI module extracts the correlation within different persons and the human-object relationships, which can benefit the point-wise and instance-wise classification.

Method For Action Recognition.

Figure 5. Pipeline of our method for human-centric action recognition. We first utilize 3D detector to obtain a set of bounding boxes of persons. Then, for each person, we extract multi-resolution features and get a hierarchical fusion feature FHF . Next, we leverage the relationship with neighbors to enhance the ego-feature and obtain a comprehensive feature FIE for the final action classification.

Dataset.

Dataset file structure

HCL_Full
|── 09-23-13-44-53-1
|   |── bin (LiDAR)
|   |   |── 1663912046.036171264.bin
|   |   |── 1663912046.135965440.bin
|   |   |── ...
|   |── imu_csv (IMU)
|   |   |── 1663912046015329873.csv
|   |   |── 1663912046025090565.csv
|   |   |── ...
|   |── img_blur (Camera)
|       |── cam1
|       |   |── 1663912046.036171264.jpg
|       |   |── 1663912046.135965440.jpg
|       |   |── ...
|       |── cam2
|       |── cam6
|── 09-23-13-44-53-2
|── ...

Specification

  1. Point clouds are stored in binary format (bin). Use np.fromfile(file_path, dtype=np.float32).reshape(-1, 5) to load the file. Columns 0-4 represent x, y, z, reflectivity, and timestamp (t) respectively.
  2. The image is downsampled to 10Hz and temporally aligned with LiDAR data. After alignment, the image is renamed to correspond to the LiDAR file's name. The original image (32Hz) will be released soon.
  3. To preserve high-density IMU data, the IMU data isn't downsampled or aligned.

Annotation structure

09-23-13-44-53-1.json
{
  "data": "09-23-13-44-53-1",//corresponding data folder
  "frames_number": 44,//frame number
  "frame": [
    {
      "frameId": 0,//frame id
      "timestamp": 1663912047.0359857,//timestamp
      "pc_name": "09-23-13-44-53-1/bin/1663912047.035985664.bin",//point cloud path
      "instance_number": 5,//instance number
      "instance": [
        {
          "id": "a5f9185a-5719-4414-9ce5-1ba9316d7050",//unique uuid
          "number": 1,//globel id
          "category": "person",//category
          "action": "moving boxes,walking",//action
          "pointCount": 358,//number of point
          "seg_points": [
                          119855,
                          119856,
                          ...
                          ]//index for points
          "occlusion": 0,//occlusion level(0-1)
          "position": {
            "x": 8.642838478088379,
            "y": -1.170599341392517,
            "z": -0.4981747269630432
          },//bbox position
          "rotation": 1.1941385296061557,//bbox rotation(Yaw)
          "boundingbox3d": {
            "x": 0.7874413728713989,
            "y": 0.4814544916152954,
            "z": 1.5814100503921509
          }//bbox dimensions
        },
        {...},
        ...
      ]
    },
    {...},
    ...
  ]

}

Specification

  1. Annotation files are named correspondingly to the names in the dataset.
  2. Note: The 'instance' 'id' only applies to the JSON it originates from.

Split file structure

HCL_split.json
{
  "train": [        
    "10-01-18-42-05-2.json",
    "10-01-18-55-50-2.json",
    ...
  ],
  "test": [
    "10-03-16-35-25-1.json",
    "10-03-16-35-25-2.json",
    ...
  ]
}

Instance segmentation results on HuCenLife.

Action Recognition results on HuCenLife.

BibTeX

@article{xu2023human,
  title={Human-centric Scene Understanding for 3D Large-scale Scenarios},
  author={Xu, Yiteng and Cong, Peishan and Yao, Yichen and Chen, Runnan and Hou, Yuenan and Zhu, Xinge and He, Xuming and Yu, Jingyi and Ma, Yuexin},
  journal={arXiv preprint arXiv:2307.14392},
  year={2023}
}
}