Robusto-1 Dataset: Comparing Humans and VLMs on Real Out-Of-Distribution Autonomous Driving VQA from Peru

Dunant Cusipuma, David Ortega, Victor Flores-Benites, Arturo Deza

Producción científica: Capítulo del libro/informe/acta de congresoContribución a la conferenciarevisión exhaustiva

Resumen

As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations - especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the 'worst' (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual-Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.

Idioma originalInglés
Título de la publicación alojadaProceedings - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
EditorialIEEE Computer Society
Páginas3817-3828
Número de páginas12
ISBN (versión digital)9798331599942
DOI
EstadoPublicada - 2025
Evento2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025 - Nashville, Estados Unidos
Duración: 11 jun. 202512 jun. 2025

Serie de la publicación

NombreIEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
ISSN (versión impresa)2160-7508
ISSN (versión digital)2160-7516

Conferencia

Conferencia2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
País/TerritorioEstados Unidos
CiudadNashville
Período11/06/2512/06/25

Huella

Profundice en los temas de investigación de 'Robusto-1 Dataset: Comparing Humans and VLMs on Real Out-Of-Distribution Autonomous Driving VQA from Peru'. En conjunto forman una huella única.

Citar esto