Apple Machine Learning Research´s latest preprint, ´The Illusion of Thinking,´ scrutinizes how so-called reasoning models—dubbed Large Reasoning Models (LRMs)—handle logical problem solving. The team designed a controlled puzzle environment to bypass industry-standard, potentially misleading benchmarks. When tested on logic puzzles of varying complexity, the plain language model outperformed reasoning-enhanced versions on simpler tasks, but as puzzle complexity increased, the more advanced models briefly surpassed their standard counterparts.
The study reveals a stark limitation: as tasks turn truly challenging, both simple and advanced models experience a dramatic plunge in accuracy and effort. These models not only fail to produce correct answers but also demonstrate reduced output, effectively abandoning attempts to solve more complex logic puzzles. Even explicit guidance—providing the models with the precise algorithm necessary for a solution—did not overcome this barrier at high complexity levels.
This work builds on previous warnings from the same research group, underscoring that current language models do not perform genuine reasoning but instead mimic patterns learned from their training data. The Apple team’s critique is echoed by other researchers, notably Subbarao Kambhampati, who argues against equating intermediate token generation with actual thinking. The consensus: marketers and benchmarks masking these weaknesses do little to change the harsh reality that neural networks, no matter how sophisticated, remain bounded by their training and lack authentic reasoning capability once confronted with new, truly arduous challenges.