C3T shows speech-aware large language models preserve understanding for Artificial Intelligence speech interfaces

Researchers from Adam Mickiewicz University and Samsung R and D Institute Poland introduced C3T, the Cross-modal Capabilities Conservation Test, to measure whether Artificial Intelligence large language models preserve language understanding when accessed via speech. The benchmark uses voice cloning to generate diverse speakers and quantifies fairness and robustness across text and speech modalities.

Researchers led by Marek Kubis, Paweł Skórzewski and Iwona Christop from Adam Mickiewicz University, together with Mateusz Czyżnikiewicz, Jakub Kubiak and Łukasz Bondaruk from Samsung R and D Institute Poland, present C3T, the Cross-modal Capabilities Conservation Test. The benchmark is designed to assess whether large language models retain the same level of language understanding when accessed through speech as they do with text. Rather than focusing on speech recognition, C3T adapts textual language understanding tasks into audio format and evaluates models across both modalities to reveal any loss of comprehension introduced by speech input.

C3T uses automated selection and filtering of textual tasks to ensure they are plausible for voice interaction, then converts them to audio via a voice cloning text-to-speech pipeline. The authors generate diverse speaker voices, drawing on datasets such as GLOBE, to test how models perform across accents, genders and named identities. Fairness is quantified by aggregating worst-case outcomes among substituted lexical indicators for demographic groups, while robustness is measured by consistency between text and speech performance. This methodology allows direct comparison of raw accuracy as well as per-group and worst-case behaviour that plain accuracy metrics can mask.

Experimental results reported by the team show that even high-performing models can exhibit noticeable drops and inconsistencies when switching from text to speech input. The benchmark highlights cases where a model answers correctly on text but fails on speech for particular demographic voices, demonstrating that fair performance across speakers does not guarantee modality consistency. The authors note limitations in scope and suggest expanding task ranges and speaker groups in future work to further refine the evaluation of speech-aware large language models. More details and the full paper are available via the linked arXiv entry.

50

Impact Score

Kazakhstan expands artificial intelligence network

Kazakhstan is scaling its artificial intelligence infrastructure and services with a new national supercomputing center, a publicly available Kazakh language model, and an e-government platform powered by more than 100 artificial intelligence agents. The country is also investing in training programs and a school curriculum to build long-term capacity.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.