New ABench-Physics benchmark exposes limits of LLMs in physical reasoning

A new benchmark, ABench-Physics, reveals major shortcomings in large language models when tackling challenging, dynamic physics problems, emphasizing the need for advanced evaluation tools in artificial intelligence.

Large language models (LLMs) are increasingly utilized for complex reasoning across disciplines, yet their true proficiency in physical reasoning remains uncertain. Addressing this gap, researchers from Zhejiang University and Ant Group have introduced ABench-Physics, a rigorous evaluation framework that diagnoses how well LLMs grasp and apply fundamental physics principles to high-difficulty, dynamic scenarios. Comprising graduate and Olympiad-level problems, ABench-Physics challenges models to deliver precise numerical solutions and adapt to variations rather than relying on memorized patterns.

The benchmark consists of two main components. Phy_A is a static set with 400 graduate or Olympiad-level physics questions demanding exact numerics and strict formatting, providing a harder alternative to common multiple-choice datasets. Phy_B innovates further with an automatic variation engine, producing endless iterations of 100 dynamic problems by altering parameters and conditions. This approach checks whether LLMs can generalize knowledge and tackle novel scenarios—crucial for preventing inflated scores due to prior exposure during the model´s training. Both high problem difficulty and the dynamic engine combine to probe core reasoning ability, uncovering significant weaknesses not previously exposed by static or simplistic question sets.

Testing current state-of-the-art LLMs with ABench-Physics reveals notable gaps, especially when faced with dynamic variants. While many models excel at mathematics and code, they often flounder when required to flexibly apply physical concepts outside familiar problem structures. The evaluation´s strict tolerance criteria and emphasis on accurate output mirror authentic scientific challenges, further surfacing the models´ struggles with both precision and conceptual understanding. These findings suggest that LLMs, as currently trained, favor rote memorization and pattern matching over genuine generalization and can be easily tripped up by minor changes in problem context.

The development of ABench-Physics reflects a growing trend towards more sophisticated, reality-aligned benchmarks such as UgPhysics, SciBench, and PuzzleBench, each contributing to nuanced assessments of analytical skill and reasoning depth. Looking forward, researchers highlight the importance of integrating multimodal data, expanding dynamic problem sets, and refining metrics to differentiate true reasoning from learned responses. By pinpointing current models´ shortcomings and guiding future improvements, these frameworks represent a substantial step towards more robust, scientifically capable artificial intelligence systems.

76

Impact Score

AMD opens Ryzen Artificial Intelligence Halo mini PC pre-orders

AMD’s Strix Halo-powered developer platform is now listed for pre-order through Micro Center in the US. The compact kit targets Artificial Intelligence developers with a shared-memory Ryzen Artificial Intelligence Max+ platform and Linux or Windows options.

Great American Artificial Intelligence Act targets frontier model developers

The Great American Artificial Intelligence Act would create new obligations mainly for frontier model developers, while leaving many deployment risks for everyday business users intact. Companies using commercial tools would still face state-law, fraud, workforce, privacy, and governance exposure under existing frameworks.

EU rejects Apple blame for Siri Artificial Intelligence delay

European Union officials rejected Apple’s claim that Digital Markets Act rules are blocking the regional launch of Siri Artificial Intelligence. Brussels said Apple must build interoperability solutions that meet European privacy and security standards.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.