Apple study finds artificial intelligence reasoning models collapse on complex puzzles

Apple researchers report that leading artificial intelligence reasoning models excel at some tasks but abruptly fail on classic logic puzzles once complexity crosses a model specific threshold, raising fresh questions about how much these systems really 'think.'

Apple researchers have released new artificial intelligence research showing that large reasoning models, or LRMs, such as OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking, can completely collapse when confronted with increasingly complex logic problems. The paper, published just before Apple’s WWDC event and written by the same team that previously identified reasoning flaws in large language models, reports that while these models outperform conventional language models on medium difficulty puzzles, they actually perform worse on simple puzzles and then abruptly give up on hard ones. The findings have been received as sobering for artificial general intelligence optimists and encouraging for skeptics, suggesting that current reasoning models may not be as cognitively capable as some marketing implies.

The researchers evaluated LRMs using classic logic puzzles, including the Tower of Hanoi, jumping checker pieces into empty spaces, river crossing problems involving items like a fox, a chicken, and a bag of grain, and block stacking tasks that must match specific configurations. These puzzles are familiar tools for testing human reasoning, since once the core strategy is understood, people can usually scale up to more complex variants by following the same logic with more discs, checkers, animals, or blocks. The paper reports that “results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold.” In the Tower of Hanoi experiments, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added, and even boosting compute does not rescue performance on harder puzzles.

Apple’s team also examined how much “thinking” effort the models expend as puzzles get harder, measured in reasoning tokens. They found that reasoning models initially increase their thinking tokens as complexity rises, but “upon approaching a critical threshold — which closely corresponds to their accuracy collapse point — models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty,” effectively giving up as tasks become more complex. Accuracy did not improve when researchers supplied the algorithmic solution in the prompt, meaning the models continued to fail even when all they had to do was follow explicit step by step instructions. Commenting on the work, artificial intelligence expert Gary Marcus argued that the study reinforces that large language models and reasoning variants are “no substitute for good well-specified conventional algorithms” and noted that many humans also struggle on harder versions of these puzzles. The article concludes that Apple’s findings should be treated as important but limited data within a broader research landscape, neither definitive proof that artificial intelligence progress is hollow nor that artificial general intelligence breakthroughs are imminent.

55

Impact Score

How Artificial Intelligence is reshaping financial services oversight

Financial services regulators are largely treating Artificial Intelligence as another technology governed by existing rules rather than building new securities-specific frameworks. History suggests that clearer expectations will emerge through examinations, enforcement, and supervisory guidance.

Nvidia faces gamer backlash over Artificial Intelligence shift

Nvidia is facing growing frustration from gamers as memory supply is steered toward data center chips and DLSS 5 becomes more central to game performance. The dispute highlights how far the company’s priorities have shifted toward enterprise Artificial Intelligence.

Executives see limited Artificial Intelligence productivity gains so far

Corporate enthusiasm around Artificial Intelligence has yet to translate into broad gains in employment or productivity, reviving comparisons to the long lag between early computing breakthroughs and measurable economic impact. Recent surveys and studies show mixed results, with strong expectations for future benefits but little consensus on present gains.

Nvidia skips a new GeForce generation as Artificial Intelligence chips dominate

Nvidia is set to go a year without a new GeForce GPU generation for the first time since the 1990s as memory shortages and higher margins in Artificial Intelligence hardware reshape the market. AMD and Intel are also struggling to capitalize because the same supply constraints are hitting gaming products across the industry.

Where gpu debt starts to break

Stress in gpu-backed infrastructure financing is emerging around deals that lack the structural protections seen in the strongest transactions. Oracle, the Abilene Stargate project, and older CoreWeave debt illustrate different ways residual risk can surface when contracts, collateral, and counterparties fall short.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.