Apple study finds artificial intelligence reasoning models collapse on complex puzzles

Apple researchers report that leading artificial intelligence reasoning models excel at some tasks but abruptly fail on classic logic puzzles once complexity crosses a model specific threshold, raising fresh questions about how much these systems really 'think.'

Apple researchers have released new artificial intelligence research showing that large reasoning models, or LRMs, such as OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking, can completely collapse when confronted with increasingly complex logic problems. The paper, published just before Apple’s WWDC event and written by the same team that previously identified reasoning flaws in large language models, reports that while these models outperform conventional language models on medium difficulty puzzles, they actually perform worse on simple puzzles and then abruptly give up on hard ones. The findings have been received as sobering for artificial general intelligence optimists and encouraging for skeptics, suggesting that current reasoning models may not be as cognitively capable as some marketing implies.

The researchers evaluated LRMs using classic logic puzzles, including the Tower of Hanoi, jumping checker pieces into empty spaces, river crossing problems involving items like a fox, a chicken, and a bag of grain, and block stacking tasks that must match specific configurations. These puzzles are familiar tools for testing human reasoning, since once the core strategy is understood, people can usually scale up to more complex variants by following the same logic with more discs, checkers, animals, or blocks. The paper reports that “results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold.” In the Tower of Hanoi experiments, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added, and even boosting compute does not rescue performance on harder puzzles.

Apple’s team also examined how much “thinking” effort the models expend as puzzles get harder, measured in reasoning tokens. They found that reasoning models initially increase their thinking tokens as complexity rises, but “upon approaching a critical threshold — which closely corresponds to their accuracy collapse point — models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty,” effectively giving up as tasks become more complex. Accuracy did not improve when researchers supplied the algorithmic solution in the prompt, meaning the models continued to fail even when all they had to do was follow explicit step by step instructions. Commenting on the work, artificial intelligence expert Gary Marcus argued that the study reinforces that large language models and reasoning variants are “no substitute for good well-specified conventional algorithms” and noted that many humans also struggle on harder versions of these puzzles. The article concludes that Apple’s findings should be treated as important but limited data within a broader research landscape, neither definitive proof that artificial intelligence progress is hollow nor that artificial general intelligence breakthroughs are imminent.

55

Impact Score

Artificial intelligence coding debate and biotech breakthroughs for 2026

Artificial intelligence coding tools are spreading fast, but developers and experts are divided over whether they boost productivity or create a long term maintenance mess, while MIT Technology Review’s latest list of Ten Breakthrough Technologies highlights gene editing, de-extinction, and embryo screening as key biotech trends to watch for 2026.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.