Apple study finds artificial intelligence reasoning models collapse on complex puzzles

January 18, 2026

Apple researchers report that leading artificial intelligence reasoning models excel at some tasks but abruptly fail on classic logic puzzles once complexity crosses a model specific threshold, raising fresh questions about how much these systems really 'think.'

Apple researchers have released new artificial intelligence research showing that large reasoning models, or LRMs, such as OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking, can completely collapse when confronted with increasingly complex logic problems. The paper, published just before Apple’s WWDC event and written by the same team that previously identified reasoning flaws in large language models, reports that while these models outperform conventional language models on medium difficulty puzzles, they actually perform worse on simple puzzles and then abruptly give up on hard ones. The findings have been received as sobering for artificial general intelligence optimists and encouraging for skeptics, suggesting that current reasoning models may not be as cognitively capable as some marketing implies.

The researchers evaluated LRMs using classic logic puzzles, including the Tower of Hanoi, jumping checker pieces into empty spaces, river crossing problems involving items like a fox, a chicken, and a bag of grain, and block stacking tasks that must match specific configurations. These puzzles are familiar tools for testing human reasoning, since once the core strategy is understood, people can usually scale up to more complex variants by following the same logic with more discs, checkers, animals, or blocks. The paper reports that “results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold.” In the Tower of Hanoi experiments, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added, and even boosting compute does not rescue performance on harder puzzles.

Apple’s team also examined how much “thinking” effort the models expend as puzzles get harder, measured in reasoning tokens. They found that reasoning models initially increase their thinking tokens as complexity rises, but “upon approaching a critical threshold — which closely corresponds to their accuracy collapse point — models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty,” effectively giving up as tasks become more complex. Accuracy did not improve when researchers supplied the algorithmic solution in the prompt, meaning the models continued to fail even when all they had to do was follow explicit step by step instructions. Commenting on the work, artificial intelligence expert Gary Marcus argued that the study reinforces that large language models and reasoning variants are “no substitute for good well-specified conventional algorithms” and noted that many humans also struggle on harder versions of these puzzles. The article concludes that Apple’s findings should be treated as important but limited data within a broader research landscape, neither definitive proof that artificial intelligence progress is hollow nor that artificial general intelligence breakthroughs are imminent.

Source

55

Impact Score

Latest News

NVIDIA and Doosan broaden physical Artificial Intelligence partnership

June 8, 2026

NVIDIA and Doosan Group are expanding work across robotics, autonomous equipment, power infrastructure and advanced materials. The partnership links NVIDIA accelerated computing platforms with Doosan businesses serving industrial automation, energy systems and data center hardware.

Chatbot liability suits test Artificial Intelligence safety law

June 8, 2026

A Florida lawsuit targeting ChatGPT’s maker signals a new product liability threat for Artificial Intelligence companies. The fight could turn on unsettled questions about platform immunity, speech protections, causation, and federal safety rules.

YouTube’s Artificial Intelligence remix tool raises creator economy concerns

June 8, 2026

YouTube’s Gemini-powered remixing tools promise easier creation and broader reach, but creators, marketers and lawyers are questioning consent, copyright exposure and brand safety.

Artificial Intelligence charging method targets longer EV battery life

June 8, 2026

A Chalmers University of Technology charging approach uses Artificial Intelligence to adapt fast charging to battery condition. The method aims to extend electric vehicle battery life without adding meaningful charging time.

Canada pushes Artificial Intelligence sovereignty strategy

June 8, 2026

Canada has unveiled an Artificial Intelligence for All strategy focused on reducing reliance on foreign cloud and Artificial Intelligence providers. The plan mirrors the EU’s new sovereignty push and sets targets for adoption, infrastructure and jobs.

Apple study finds artificial intelligence reasoning models collapse on complex puzzles

55

Impact Score

Latest News

NVIDIA and Doosan broaden physical Artificial Intelligence partnership

Chatbot liability suits test Artificial Intelligence safety law

YouTube’s Artificial Intelligence remix tool raises creator economy concerns

Artificial Intelligence charging method targets longer EV battery life

Canada pushes Artificial Intelligence sovereignty strategy

Contact Us