Apple research exposes reasoning model collapse on complex problems

A new Apple study finds that today´s reasoning-focused Artificial Intelligence models fail catastrophically when faced with sufficiently complex logic puzzles.

Apple Machine Learning Research´s latest preprint, ´The Illusion of Thinking,´ scrutinizes how so-called reasoning models—dubbed Large Reasoning Models (LRMs)—handle logical problem solving. The team designed a controlled puzzle environment to bypass industry-standard, potentially misleading benchmarks. When tested on logic puzzles of varying complexity, the plain language model outperformed reasoning-enhanced versions on simpler tasks, but as puzzle complexity increased, the more advanced models briefly surpassed their standard counterparts.

The study reveals a stark limitation: as tasks turn truly challenging, both simple and advanced models experience a dramatic plunge in accuracy and effort. These models not only fail to produce correct answers but also demonstrate reduced output, effectively abandoning attempts to solve more complex logic puzzles. Even explicit guidance—providing the models with the precise algorithm necessary for a solution—did not overcome this barrier at high complexity levels.

This work builds on previous warnings from the same research group, underscoring that current language models do not perform genuine reasoning but instead mimic patterns learned from their training data. The Apple team’s critique is echoed by other researchers, notably Subbarao Kambhampati, who argues against equating intermediate token generation with actual thinking. The consensus: marketers and benchmarks masking these weaknesses do little to change the harsh reality that neural networks, no matter how sophisticated, remain bounded by their training and lack authentic reasoning capability once confronted with new, truly arduous challenges.

72

Impact Score

India’s top 5 Artificial Intelligence startups and what their LLMs do

Five homegrown companies are building domain-focused Artificial Intelligence models and platforms, from medical imaging diagnostics to sovereign multilingual large language models and enterprise conversational agents. The piece summarises each startup’s core focus, flagship models, and target use cases.

New subsea habitat and cloning pets

Vanguard will become the first new subsea habitat in nearly 40 years, hosting teams of scientists on the seabed for weeklong missions. The newsletter also examines recent high-profile pet cloning and debates over responsible uses of cloning technology.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.