Apple research exposes critical limits in large language model reasoning

June 14, 2025

Apple´s new study uncovers why large language models may only be mimicking reasoning—and what this means for the future of Artificial Intelligence.

Apple has published new research challenging assumptions about the reasoning capabilities of large language models (LLMs) and large reasoning models (LRMs). The paper, ´The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,´ details rigorously structured experiments using well-known puzzles such as the Tower of Hanoi and River Crossing challenges. These tasks were carefully chosen for their transparency, progressive difficulty, and lack of overlap with training datasets, ensuring a clear assessment of genuine reasoning ability rather than memorization or data leakage.

In testing a variety of high-profile models—including OpenAI´s o1 and o3, Google´s Gemini Thinking, Anthropic´s Claude 3.7 Sonnet, and DeepSeek-R1—Apple found that these models perform well on straightforward and moderately complex problems. However, once problem complexity passes a critical threshold, accuracy drops precipitously, often to near zero. The research identifies three distinct regimes: for easy tasks, standard LLMs sometimes outperform advanced LRMs; for medium difficulty, LRMs take the lead; but for high complexity, both model types collapse and demonstrate strikingly similar limitations. Surprisingly, as tasks grow more intricate, LRMs actually reduce the effort they expend—despite available resources—suggesting a lack of meta-reasoning and adaptability in allocating computational focus, a phenomenon that could hinder progress on demanding real-world challenges.

The study goes further in revealing that even when LRMs are provided with explicit, step-by-step algorithms, their ability to reason falters at the same complexity boundaries. This pattern indicates that current models predominantly rely on advanced pattern matching rather than authentic, systematic logical reasoning. Additionally, on simpler tasks, LRMs often continue exploring beyond reaching a solution, hinting at inefficiencies in their reasoning mechanisms. Apple´s findings upend the widely-held belief that scaling models and training data alone will produce true reasoning; instead, they suggest that current advances are superficial and highlight an urgent need for fundamentally new architectures and hybrid systems—possibly integrating external memory or symbolic engines—to achieve more human-like problem-solving.

The implications are significant for the technology industry and artificial intelligence research as a whole. Apple´s work cautions organizations against deploying LLMs in high-stakes settings without robust human oversight and urges the field to prioritize transparency, interpretability, and new paradigms over brute-force model scaling. Ultimately, the research reframes the pursuit of Artificial General Intelligence, advocating for a fresh approach that blends neural and symbolic reasoning to approach the elusive goal of true machine understanding.

Source

78

Impact Score

Latest News

Texas arrests man over Artificial Intelligence-generated child abuse images

May 31, 2026

Texas authorities arrested a Carrizo Springs man accused of creating hundreds of pornographic images and videos involving children by using Artificial Intelligence tools to manipulate photos taken from public school-affiliated pages. Investigators said the case also uncovered non-Artificial Intelligence-generated child sexual abuse images and identified approximately 30 victims.

Google launches Gemini Omni for conversational video editing

May 31, 2026

Google has introduced Gemini Omni, a video model that edits and generates clips through natural conversation using text, images, audio, and existing footage. The first public version, Gemini Omni Flash, is now rolling out across the Gemini app, Google Flow, and YouTube Shorts.

ByteDance develops custom CPU chips for Artificial Intelligence infrastructure

May 31, 2026

ByteDance is developing its own central processing units to support expanding Artificial Intelligence infrastructure as chip shortages and higher prices pressure growth. The company is exploring Arm and RISC-V designs for use in its own servers and data centres.

Regulators use Artificial Intelligence to scrutinize disclosures

May 30, 2026

US, UK, and European regulators are using or exploring Artificial Intelligence tools to detect disclosure problems and monitor firms more effectively. Compliance specialists say supervisors may now be ahead of financial institutions in some areas of technological sophistication.

EU Artificial Intelligence Act omnibus agreement reshapes compliance timeline

May 30, 2026

The EU’s May 2026 omnibus agreement keeps the core structure of the EU Artificial Intelligence Act intact while delaying key high-risk obligations and narrowing some compliance burdens. It also adds new prohibitions, expands enforcement powers, and gives smaller companies more tailored relief.

Apple research exposes critical limits in large language model reasoning

78

Impact Score

Latest News

Texas arrests man over Artificial Intelligence-generated child abuse images

Google launches Gemini Omni for conversational video editing

ByteDance develops custom CPU chips for Artificial Intelligence infrastructure

Regulators use Artificial Intelligence to scrutinize disclosures

EU Artificial Intelligence Act omnibus agreement reshapes compliance timeline

Contact Us