VGGSounder: Audio-Visual Evaluations for Foundation Models

Jun 9, 2025·

Daniil Zverev

Thaddäus Wiedemer

Ameya Prabhu

Matthias Bethge

Wieland Brendel

A. Sophia Koepke

· 0 min read

Cite Project Video

Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The classification dataset VGGSound is commonly used as a benchmark for evaluating audio-visual understanding. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. VGGSounder offers a robust benchmark supporting the future development of audio-visual foundation models. Our dataset and project page are available at https://vggsounder.github.io/

Type

Conference paper

Publication

ICCV 2025

Last updated on Jun 9, 2025

Authors

Thaddäus Wiedemer

PhD Candidate

← Video Models Are Zero-Shot Learners and Reasoners Sep 24, 2025

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws Feb 17, 2025 →