Human motion prediction is crucial for human-centric multimedia understanding and interacting. Current methods typically rely on ground truth human poses as observed input, which is not practical for real-world scenarios where only raw visual sensor data is available. To implement these methods in practice, a pre-phrase of pose estimation is essential. However, such two-stage approaches often lead to performance degradation due to the accumulation of errors. Moreover, reducing raw visual data to sparse keypoint representations significantly diminishes the density of information, resulting in the loss of fine-grained features. In this paper, we propose LiDAR-HMP, the first single-LiDAR-based 3D human motion prediction approach, which receives the raw LiDAR point cloud as input and forecasts future 3D human poses directly. Building upon our novel structure-aware body feature descriptor, LiDAR-HMP adaptively maps the observed motion manifold to future poses and effectively models the spatial-temporal correlations of human motions for further refinement of prediction results. Extensive experiments show that our method achieves state-of-the-art performance on two public benchmarks and demonstrates remarkable robustness and efficacy in real-world deployments.
The pipeline of our LiDAR-HMP. First, we obtain the structure-aware body feature descriptor from the observed LiDAR point cloud frames. Then, we adaptively predict the human motion with learnable queries for initial predictions and explicitly model the spatial-temporal correlations among them to refine the predicted motions. Finally, we decode the joint-wise results and point-wise results for auxiliary supervision.
@article{han2024towards,
title={Towards Practical Human Motion Prediction with LiDAR Point Clouds},
author={Han, Xiao and Ren, Yiming and Yao, Yichen and Sun, Yujing and Ma, Yuexin},
journal={arXiv preprint arXiv:2408.08202},
year={2024}
}