Apple has published new research challenging assumptions about the reasoning capabilities of large language models (LLMs) and large reasoning models (LRMs). The paper, ´The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,´ details rigorously structured experiments using well-known puzzles such as the Tower of Hanoi and River Crossing challenges. These tasks were carefully chosen for their transparency, progressive difficulty, and lack of overlap with training datasets, ensuring a clear assessment of genuine reasoning ability rather than memorization or data leakage.
In testing a variety of high-profile models—including OpenAI´s o1 and o3, Google´s Gemini Thinking, Anthropic´s Claude 3.7 Sonnet, and DeepSeek-R1—Apple found that these models perform well on straightforward and moderately complex problems. However, once problem complexity passes a critical threshold, accuracy drops precipitously, often to near zero. The research identifies three distinct regimes: for easy tasks, standard LLMs sometimes outperform advanced LRMs; for medium difficulty, LRMs take the lead; but for high complexity, both model types collapse and demonstrate strikingly similar limitations. Surprisingly, as tasks grow more intricate, LRMs actually reduce the effort they expend—despite available resources—suggesting a lack of meta-reasoning and adaptability in allocating computational focus, a phenomenon that could hinder progress on demanding real-world challenges.
The study goes further in revealing that even when LRMs are provided with explicit, step-by-step algorithms, their ability to reason falters at the same complexity boundaries. This pattern indicates that current models predominantly rely on advanced pattern matching rather than authentic, systematic logical reasoning. Additionally, on simpler tasks, LRMs often continue exploring beyond reaching a solution, hinting at inefficiencies in their reasoning mechanisms. Apple´s findings upend the widely-held belief that scaling models and training data alone will produce true reasoning; instead, they suggest that current advances are superficial and highlight an urgent need for fundamentally new architectures and hybrid systems—possibly integrating external memory or symbolic engines—to achieve more human-like problem-solving.
The implications are significant for the technology industry and artificial intelligence research as a whole. Apple´s work cautions organizations against deploying LLMs in high-stakes settings without robust human oversight and urges the field to prioritize transparency, interpretability, and new paradigms over brute-force model scaling. Ultimately, the research reframes the pursuit of Artificial General Intelligence, advocating for a fresh approach that blends neural and symbolic reasoning to approach the elusive goal of true machine understanding.