Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving

Yuhang Lu1,*, Yichen Yao1,*, Jiadong Tu1,*, Jiangnan Shao1,*, Yuexin Ma1,†, Xinge Zhu2,†
1ShanghaiTech University, 2The Chinese University of Hong Kong
*These authors contributed equally. Corresponding author.

Abstract

Large Vision-Language Models (LVLMs) have recently garnered significant attention, with many efforts aimed at harnessing their general knowledge to enhance the interpretability and robustness of autonomous driving models. However, LVLMs typically rely on large, general-purpose datasets and lack the specialized expertise required for professional and safe driving. Existing vision-language driving datasets focus primarily on scene understanding and decision-making, without providing explicit guidance on traffic rules and driving skills, which are critical aspects directly related to driving safety. To bridge this gap, we propose IDKB, a large-scale dataset containing over one million data items collected from various countries, including driving handbooks, theory test data, and simulated road test data. Much like the process of obtaining a driver's license, IDKB encompasses nearly all the explicit knowledge needed for driving from theory to practice. In particular, we conducted comprehensive tests on 15 LVLMs using IDKB to assess their reliability in the context of autonomous driving and provided extensive analysis. We also fine-tuned popular models, achieving notable performance improvements, which further validate the significance of our dataset.

Data Overview

Intelligent Driving Knowledge Base (IDKB) is structured as a driving knowledge resource, mirroring the process individuals follow to acquire expertise when obtaining a driver's license. This process typically involves studying driving handbooks, taking theory tests, and practicing on the road, corresponding to the driving handbook data, driving test data and driving road data of our dataset.

Data Collection

Data construction pipeline of IDKB dataset. For Driving Handbook and Driving Test Data, we collect comprehensive driving knowledge resources from internet, followed by data extraction and postprocessing to obtain the final data. For Driving Road Data, we utilize CARLA to generate simulated road scenarios focused on traffic sign comprehension.

Data Statistic

The IDKB dataset contains 1,016,956 entries, comprising 84.0% Driving Test Data, 11.1% CARLA data, and 5.0% Driving Handbook Data. It spans multiple countries and languages, with contributions from regions like China, Italy, and Germany.

The data can be divided into four groups according to vehicle type

  • Car (61.6% - sedan, jeep, etc.)
  • Truck (19.3% - minivan, commercial, LGV, etc.)
  • Bus (7.9% - minibus, coaches, etc.)
  • Moto (11.2% - motorbike, motorcycle)

We utilized proprietary LVLMs to classify data into four semantic categories:

  • Laws & Regulations (22.2%)
  • Road Signs & Signals (38.6%)
  • Driving Techniques (22.0%)
  • Defensive Driving (17.1%)

Benchmark

Quantitative results for multiple LVLMs on several tasks for driving knowledge understanding.
Results of proprietary LVLMs are overlaid in grey. The best results for each value of the proprietary LVLMs are highlighted in dark brown, while the best results for the open-source LVLMs are indicated in bold.

Overall, the evaluated LVLMs generally did not demonstrate strong driving domain knowledge, highlighting the need for high-quality, structured, and diverse driving knowledge data for effective applications in autonomous driving.

Significant Improvement through Fine-Tuning

We fine-tuned four LVLMs, each with a different visual encoder or LLM, to assess the impact of our structured driving knowledge data.

As shown in the table, these fine-tuned models achieved IDKB scores comparable to larger proprietary models. MiniCPM-Llama3-V2.5 doubled its Test Data Score, underscoring the data's value in mastering driving laws, rules, and special situations.

The improvement in the Test Road Score reflects a better grasp of traffic regulations, reinforcing our dataset's role in advancing safer and more reliable autonomous driving systems.

Driving Knowledge Boosts Downstream Autonomous Driving Tasks

We applied the QwenVL-chat model, fine-tuned on the IDKB dataset, to the planning task using the nuScenes dataset, demonstrating how driving knowledge can boost downstream task performance.

Using a prompt-based method akin to DriveVLM, the model identifies traffic signs and predicts the next three seconds' trajectory.

The IDKB dataset aligns with human driving knowledge acquisition, enabling the model to better understand traffic laws and plan safer routes.

Our results confirm this:

  • 32% reduction in average L2 distance and a 65% drop in collision metrics, reflecting improved safety.
  • The model correctly identifies the 'ROAD WORK AHEAD' sign, decelerates, and plans a safer trajectory, evidenced by the decreasing offset between frames.

BibTeX

@misc{lu2024lvlmsobtaindriverslicense,
      title={Can LVLMs Obtain a Driver's License? A Benchmark Towards Reliable AGI for Autonomous Driving}, 
      author={Yuhang Lu and Yichen Yao and Jiadong Tu and Jiangnan Shao and Yuexin Ma and Xinge Zhu},
      year={2024},
      eprint={2409.02914},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2409.02914}, 
}