From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model
Abstract
The TAD benchmark evaluates Vision-Language Models on temporal understanding in autonomous driving footage, revealing substandard performance and proposing training-free solutions to improve accuracy.
Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs' ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at https://huggingface.co/datasets/vbdai/TAD{Hugging Face} and https://github.com/vbdi/tad_bench{Github}, respectively.
Community
This work introduces TAD, the first benchmark targeting temporal understanding in ego-centric autonomous-driving videos, evaluates SoTA VLMs, and boosts their performance with two training-free motion-reasoning methods (Scene-CoT and TCogMap).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models (2025)
- WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving (2025)
- Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts (2025)
- LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models (2025)
- Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning (2025)
- Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders (2025)
- Enhancing End-to-End Autonomous Driving with Risk Semantic Distillaion from VLM (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper