Researchers led by Marek Kubis, Paweł Skórzewski and Iwona Christop from Adam Mickiewicz University, together with Mateusz Czyżnikiewicz, Jakub Kubiak and Łukasz Bondaruk from Samsung R and D Institute Poland, present C3T, the Cross-modal Capabilities Conservation Test. The benchmark is designed to assess whether large language models retain the same level of language understanding when accessed through speech as they do with text. Rather than focusing on speech recognition, C3T adapts textual language understanding tasks into audio format and evaluates models across both modalities to reveal any loss of comprehension introduced by speech input.
C3T uses automated selection and filtering of textual tasks to ensure they are plausible for voice interaction, then converts them to audio via a voice cloning text-to-speech pipeline. The authors generate diverse speaker voices, drawing on datasets such as GLOBE, to test how models perform across accents, genders and named identities. Fairness is quantified by aggregating worst-case outcomes among substituted lexical indicators for demographic groups, while robustness is measured by consistency between text and speech performance. This methodology allows direct comparison of raw accuracy as well as per-group and worst-case behaviour that plain accuracy metrics can mask.
Experimental results reported by the team show that even high-performing models can exhibit noticeable drops and inconsistencies when switching from text to speech input. The benchmark highlights cases where a model answers correctly on text but fails on speech for particular demographic voices, demonstrating that fair performance across speakers does not guarantee modality consistency. The authors note limitations in scope and suggest expanding task ranges and speaker groups in future work to further refine the evaluation of speech-aware large language models. More details and the full paper are available via the linked arXiv entry.
