ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving

Abstract

End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability. However, existing methods often rely on fixed and sparse trajectory supervision, limiting their ability to capture the hierarchical reasoning process that human drivers naturally employ. To bridge this gap, we propose ReAL-AD, a Reasoning-Augmented Learning framework that structures decision-making in autonomous driving based on the three-tier human cognitive model: Driving Strategy, Driving Decision, and Driving Operation, where Vision-Language Models (VLMs) are incorporated to enhance situational awareness and structured reasoning across these levels. Specifically, we introduce: (1) the Strategic Reasoning Injector, which formulates high-level driving strategies by interpreting complex traffic contexts from VLM-generated insights; (2) the Tactical Reasoning Integrator, which refines strategic intent into interpretable tactical choices such as lane changes, overtaking, and speed adjustments; and (3) the Hierarchical Trajectory Decoder, which progressively translates tactical decisions into precise control actions for smooth and human-like trajectory execution. Extensive evaluations show that integrating our framework improves planning accuracy and safety by over 30%, making end-to-end autonomous driving more interpretable and aligned with human-like hierarchical reasoning.

Motivation

Structural Limitations of End-to-End Systems: Existing end-to-end autonomous driving frameworks predominantly rely on fixed trajectory supervision, which constrains their ability to model the hierarchical reasoning processes inherent to human drivers.
Interpretability Challenges: Current approaches lack alignment with the three-tier cognitive model observed in human driving, namely: strategic navigation, tactical maneuvers, and operational control, resulting in limited interpretability.
Hierarchical Reasoning via Vision-Language Models: The proposed ReAL-AD framework leverages Vision-Language Models to reconstruct hierarchical reasoning, encompassing strategic decision-making, tactical planning, and trajectory decoding.

Pipeline

Overall pipeline of ReAL-AD:

Multi-view images are processed by the Scene Encoder to extract environmental features.
The Strategic Reasoning Injector generates high-level driving decisions using structured prompts and utilizes them to enhance ego-query.
The Tactical Reasoning Integrator outputs reactive- and regulatory-level command features.
These features are then fed into the Hierarchical Trajectory Decoder, which progressively refines the latent trajectory space to generate the final planning trajectory.

Experimental Results

We evaluated ReAL-AD on the Bench2Drive dataset, comparing it with leading methods.

Our approach achieves:

Over 30% improvement in planning accuracy and safety metrics (L2 error, collision rate) compared to strong baselines.
Best-in-class performance among methods using vision-language models.
Significant gains in real-world driving scores and completed routes in closed-loop tests.

Why does ReAL-AD perform better?

By introducing a human-like, hierarchical reasoning process, our model learns to make decisions more like an experienced driver.
The structured integration of strategic, tactical, and operational reasoning enables better generalization and safer planning.
Leveraging vision-language models enhances scene understanding and makes the decision process more interpretable.

Visualization of VLM-generated information

Visualization of VLM-generated driving strategies and tactical commands, showing their alignment with final planning.

Our approach brings human-like reasoning into autonomous driving, making the decision process more transparent and easier to understand. The visualizations below highlight two challenging scenarios:

Scenario 1 (Left): The model clearly identifies the need for a lane change and generates a command to execute it, resulting in a safe and smooth trajectory.
Scenario 2 (Right): The model detects pedestrians ahead, recommends slowing down and emergency braking, and plans a trajectory that prioritizes safety.

These examples demonstrate how our system’s intermediate reasoning steps—visible in the visualizations—directly influence the final driving decisions, making the entire process more interpretable and trustworthy.

BibTeX


      @misc{lu2025realadhumanlikereasoningendtoend,
        title={ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving}, 
        author={Yuhang Lu and Jiadong Tu and Yuexin Ma and Xinge Zhu},
        year={2025},
        eprint={2507.12499},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2507.12499}, 
      }