Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Robot. AI

Sec. Computational Intelligence in Robotics

Volume 12 - 2025 | doi: 10.3389/frobt.2025.1684845

AniDriveQA: A VQA Dataset for Driving Scenes with Animal Presence

Provisionally accepted
Rui  WangRui Wang1Ruiqi  WangRuiqi Wang2Hao  HuHao Hu1,3*Huai  YuHuai Yu4
  • 1The Institute of Computing Technologies, China Academy of Railway Sciences Corporation Limited, Beijing, China
  • 2School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China
  • 3The Center of National Railway Intelligent Transportation System Engineering and Technology, Beijing, China
  • 4Signal & Communication Research Institute, China Academy of Railway Sciences Corporation Limited, Beijing, China

The final, formatted version of the article will be published soon.

Animal-involved scenarios present significant challenges for autonomous driving systems due to their rarity, unpredictability, and safety-critical nature. However, existing vision-language datasets for autonomous driving largely neglect these long-tail situations. This paper introduces AniDriveQA, a novel visual question answering dataset specifically designed to evaluate vision-language models in driving scenarios involving animals. It aims to advance the reasoning, perception, and decision-making capabilities of VLMs in rare yet safety-critical autonomous driving scenarios. The dataset is built through a scalable pipeline that collected diverse animal-related traffic scenes from internet videos, filtered and annotated the data using object detection and scene classification models, and generated multi-task VQA labels with a large vision-language model. AniDriveQA encompassed three core task types: scene description, animal description, and driving suggestion. For evaluation, this paper adopted a hybrid scheme that combined classification accuracy for structured tasks with LLM-based scoring for open-ended responses. Extensive experiments on open-source VLMs revealed substantial performance disparities across models and tasks, highlighting the difficulty and diagnostic value of the dataset.

Keywords: vision-language models, Visual question answering (VQA), Autonomous Driving, Animal-Involved Scenarios, BenchmarkDataset

Received: 13 Aug 2025; Accepted: 26 Sep 2025.

Copyright: © 2025 Wang, Wang, Hu and Yu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Hao Hu, hhcars11@163.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.