New ABench-Physics benchmark exposes limits of LLMs in physical reasoning

A new benchmark, ABench-Physics, reveals major shortcomings in large language models when tackling challenging, dynamic physics problems, emphasizing the need for advanced evaluation tools in artificial intelligence.

Large language models (LLMs) are increasingly utilized for complex reasoning across disciplines, yet their true proficiency in physical reasoning remains uncertain. Addressing this gap, researchers from Zhejiang University and Ant Group have introduced ABench-Physics, a rigorous evaluation framework that diagnoses how well LLMs grasp and apply fundamental physics principles to high-difficulty, dynamic scenarios. Comprising graduate and Olympiad-level problems, ABench-Physics challenges models to deliver precise numerical solutions and adapt to variations rather than relying on memorized patterns.

The benchmark consists of two main components. Phy_A is a static set with 400 graduate or Olympiad-level physics questions demanding exact numerics and strict formatting, providing a harder alternative to common multiple-choice datasets. Phy_B innovates further with an automatic variation engine, producing endless iterations of 100 dynamic problems by altering parameters and conditions. This approach checks whether LLMs can generalize knowledge and tackle novel scenarios—crucial for preventing inflated scores due to prior exposure during the model´s training. Both high problem difficulty and the dynamic engine combine to probe core reasoning ability, uncovering significant weaknesses not previously exposed by static or simplistic question sets.

Testing current state-of-the-art LLMs with ABench-Physics reveals notable gaps, especially when faced with dynamic variants. While many models excel at mathematics and code, they often flounder when required to flexibly apply physical concepts outside familiar problem structures. The evaluation´s strict tolerance criteria and emphasis on accurate output mirror authentic scientific challenges, further surfacing the models´ struggles with both precision and conceptual understanding. These findings suggest that LLMs, as currently trained, favor rote memorization and pattern matching over genuine generalization and can be easily tripped up by minor changes in problem context.

The development of ABench-Physics reflects a growing trend towards more sophisticated, reality-aligned benchmarks such as UgPhysics, SciBench, and PuzzleBench, each contributing to nuanced assessments of analytical skill and reasoning depth. Looking forward, researchers highlight the importance of integrating multimodal data, expanding dynamic problem sets, and refining metrics to differentiate true reasoning from learned responses. By pinpointing current models´ shortcomings and guiding future improvements, these frameworks represent a substantial step towards more robust, scientifically capable artificial intelligence systems.

76

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.

Please check your email for a Verification Code sent to . Didn't get a code? Click here to resend