WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

1ShanghaiTech University, 2The University of Hong Kong

3Shanghai AI Laboratory 4The Chinese University of Hong Kong
Introduction to the 3DVGW task and related application. The assistive robot observes the dynamic scene and locates the 3D object in the physical world according to natural language descriptions. Then, the robot moves to the target object to provide service. WildRefer provides a LiDAR-camera multi-sensor-based solution, which can conduct 3D visual grounding in large-scale unconstrained environment.

Abstract

We introduce the task of 3D visual grounding in large-scale dynamic scenes based on natural linguistic descriptions and online captured multi-modal visual data, including 2D images and 3D LiDAR point clouds. We present a novel method, dubbed WildRefer, for this task by fully utilizing the rich appearance information in images, the position and geometric clues in point cloud as well as the semantic knowledge of language descriptions. Besides, we propose two novel datasets, i.e., STRefer and LifeRefer, which focus on large-scale human-centric daily-life scenarios accompanied with abundant 3D object and natural language annotations. Our datasets are significant for the research of 3D visual grounding in the wild and has huge potential to boost the development of autonomous driving and service robots. Extensive experiments and ablation studies demonstrate that our method achieves state-of-the-art performance on the proposed benchmarks.

Method

Pipeline of WildRefer. The inputs are multi-frame synchronized points clouds and images as well as a natural language description. After feature extraction, we obtain two types of visual feature and a text feature. Through two dynamic visual encoders, we extract dynamic-enhanced image and point features. Then, a triple-modal feature interaction module is followed to fuse valuable information from different modalities. Finally, through a DETR-like decoder, we decode the location and size of the target object. SA and CA denote self-attention and cross-attention, respectively.

Dataset

STRefer

LifeRefer

Dataset Struture.

Dataset file structure

STRefer & LifeRefer
|── group_id # strefer or liferefer
|── scene_id # unique scene id that can match STCrowd & HucenLife
|── object_id # unique object id (For LifeRefer, it is kept same with HucenLife. 
|               For STRefer, it consist of scene id, frame id and object id.)
|── point_cloud
|   |── point_cloud_name # the frame name of point cloud for the scene
|   |── bbox             # bounding box of the object
|   |── category         # category of the object
|── language
|   |── description      # language description of the object
|   |── token            # token of the description
|   |── ann_id           # annotation id of the object
|── image
|   |── image_name       # the frame name of image for the scene 
|── calibration 
|   |── ex_matrix        # external matrix of calibration
|   |── in_matrix        # internal matrix of calibration

Specification

    We strongly recommend use our preproceed data of STCrowd and HucenLife.


Quantitative comparisons

    Comparison results on STRefer and LifeRefer. {*} denotes the one-stage version without pretrained 3D decoder.

Method Publication Type STRefer
Acc@0.25
STRefer
Acc@0.5
STRefer
mIOU
LifeRefer
Acc@0.25
LifeRefer
Acc@0.5
LifeRefer
mIOU
Time cost
(ms)
ScanRefer ECCV-2020 Two-Stage 32.93 30.39 25.21 22.76 14.89 12.61 156
ReferIt3D ECCV-2020 Two-Stage 34.05 31.61 26.05 26.18 17.06 14.38 194
3DVG-Transformer ICCV-2021 Two-Stage 40.53 37.71 30.63 25.71 15.94 13.99 160
MVT CVPR-2022 Two-Stage 45.12 42.40 35.03 21.65 13.17 11.71 242
3DJCG CVPR-2022 Two-Stage 50.47 47.47 38.85 27.82 16.87 15.40 161
BUTD-DETR ECCV-2022 Two-Stage 57.60 47.47 35.22 30.81 11.66 14.80 252
EDA CVPR-2023 Two-Stage 55.91 47.28 34.32 31.44 11.18 15.00 291
3D-SPS CVPR-2022 One-Stage 44.47 42.40 30.43 28.01 18.20 15.78 130
BUTD-DETR* ECCV-2022 One-Stage 56.66 45.12 33.52 32.46 12.82 15.96 138
EDA* CVPR-2023 One-Stage 57.41 45.59 34.03 29.32 12.25 14.41 154
Ours One-Stage 62.01 54.97 38.77 38.89 18.42 19.47 151