VGGSounder: Audio-Visual Evaluations for Foundation Models

Jun 9, 2025·
Daniil Zverev
Thaddäus Wiedemer
Thaddäus Wiedemer
,
Ameya Prabhu
,
Matthias Bethge
,
Wieland Brendel
,
A. Sophia Koepke
· 0 min read
Abstract
The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The classification dataset VGGSound is commonly used as a benchmark for evaluating audio-visual understanding. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. VGGSounder offers a robust benchmark supporting the future development of audio-visual foundation models. Our dataset and project page are available at https://vggsounder.github.io/
Type
Publication
ICCV 2025