Apple researchers have released new artificial intelligence research showing that large reasoning models, or LRMs, such as OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking, can completely collapse when confronted with increasingly complex logic problems. The paper, published just before Apple’s WWDC event and written by the same team that previously identified reasoning flaws in large language models, reports that while these models outperform conventional language models on medium difficulty puzzles, they actually perform worse on simple puzzles and then abruptly give up on hard ones. The findings have been received as sobering for artificial general intelligence optimists and encouraging for skeptics, suggesting that current reasoning models may not be as cognitively capable as some marketing implies.
The researchers evaluated LRMs using classic logic puzzles, including the Tower of Hanoi, jumping checker pieces into empty spaces, river crossing problems involving items like a fox, a chicken, and a bag of grain, and block stacking tasks that must match specific configurations. These puzzles are familiar tools for testing human reasoning, since once the core strategy is understood, people can usually scale up to more complex variants by following the same logic with more discs, checkers, animals, or blocks. The paper reports that “results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold.” In the Tower of Hanoi experiments, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added, and even boosting compute does not rescue performance on harder puzzles.
Apple’s team also examined how much “thinking” effort the models expend as puzzles get harder, measured in reasoning tokens. They found that reasoning models initially increase their thinking tokens as complexity rises, but “upon approaching a critical threshold — which closely corresponds to their accuracy collapse point — models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty,” effectively giving up as tasks become more complex. Accuracy did not improve when researchers supplied the algorithmic solution in the prompt, meaning the models continued to fail even when all they had to do was follow explicit step by step instructions. Commenting on the work, artificial intelligence expert Gary Marcus argued that the study reinforces that large language models and reasoning variants are “no substitute for good well-specified conventional algorithms” and noted that many humans also struggle on harder versions of these puzzles. The article concludes that Apple’s findings should be treated as important but limited data within a broader research landscape, neither definitive proof that artificial intelligence progress is hollow nor that artificial general intelligence breakthroughs are imminent.
