TY - GEN
T1 - Robusto-1 Dataset
T2 - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
AU - Cusipuma, Dunant
AU - Ortega, David
AU - Flores-Benites, Victor
AU - Deza, Arturo
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations - especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the 'worst' (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual-Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.
AB - As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations - especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the 'worst' (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual-Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.
KW - autonomous driving
KW - neuroai
KW - representational similarity analysis
KW - vlms
UR - https://www.scopus.com/pages/publications/105017855504
U2 - 10.1109/CVPRW67362.2025.00367
DO - 10.1109/CVPRW67362.2025.00367
M3 - Conference contribution
AN - SCOPUS:105017855504
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 3817
EP - 3828
BT - Proceedings - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
PB - IEEE Computer Society
Y2 - 11 June 2025 through 12 June 2025
ER -