STAGE

STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation

Jiamin Wang^1,*, Yichen Yao^1,*, Xiang Feng¹, Hang Wu², Yaming Wang², Qingqiu Huang², Yuexin Ma^1,†, Xinge Zhu^2,†,

¹ShanghaiTech University, ²Yinwang Intelligent Technology

in IROS 2025

Abstract

The generation of temporally consistent, high-fidelity driving videos over extended horizons presents a fundamental challenge in autonomous driving world modeling. Existing approaches often suffer from error accumulation and feature misalignment due to inadequate decoupling of spatio-temporal dynamics and limited cross-frame feature propagation mechanisms. To address these limitations, we present STAGE (Streaming Temporal Attention Generative Engine), a novel auto-regressive framework that pioneers hierarchical feature coordination and multi-phase optimization for sustainable video synthesis. To achieve high-quality long-horizon driving video generation, we introduce Hierarchical Temporal Feature Transfer (HTFT) and a novel multi-stage training strategy. HTFT enhances temporal consistency between video frames throughout the video generation process by modeling the temporal and denoising process separately and transferring denoising features between frames. The multi-stage training strategy is to divide the training into three stages, through model decoupling and auto-regressive inference process simulation, thereby accelerating model convergence and reducing error accumulation. Experiments on the Nuscenes dataset show that STAGE has significantly surpassed existing methods in the long-horizon driving video generation task. In addition, we also explored STAGE's ability to generate unlimited-length driving videos. We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.

Method

Overview of STAGE.“AF”, “CF”, “NF” stand for Anchor Frame, Condition Frame, Noise Frame, respectively. (I) illustrates the hierarchical structuring of time and denoising steps, with the horizontal axis representing time and the vertical axis representing the denoising steps. T represents the T-th frame, while t refers to the t-th denoising step. (II) illustrates the framework of our model, where we leverage HTFT to facilitate feature transfer along the temporal dimension, thereby refining the generation process at each step. (III) presents our multi-stage training strategy and the process for infinite generation.

Visualization

Qualitative Comparison between Vista and STAGE in long video generation task. We generated 201 frames and selected frames 41, 81, 121, 161, and 201 for the comparison.

Qualitative Comparison between Vista and STAGE in short video generation task. We generated 16 frames, and selected frames 2, 5, 9, and 16 for comparison.

Visualization of Longer Video Generation. We generate 601 frames and selected frames 121, 241, 361, 481, and 601 for the visualization.

Videos

240 frames generated by STAGE.

600 frames generated by STAGE.