Apple study finds artificial intelligence reasoning models collapse on complex puzzles

January 18, 2026

Apple researchers report that leading artificial intelligence reasoning models excel at some tasks but abruptly fail on classic logic puzzles once complexity crosses a model specific threshold, raising fresh questions about how much these systems really 'think.'

Apple researchers have released new artificial intelligence research showing that large reasoning models, or LRMs, such as OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking, can completely collapse when confronted with increasingly complex logic problems. The paper, published just before Apple’s WWDC event and written by the same team that previously identified reasoning flaws in large language models, reports that while these models outperform conventional language models on medium difficulty puzzles, they actually perform worse on simple puzzles and then abruptly give up on hard ones. The findings have been received as sobering for artificial general intelligence optimists and encouraging for skeptics, suggesting that current reasoning models may not be as cognitively capable as some marketing implies.

The researchers evaluated LRMs using classic logic puzzles, including the Tower of Hanoi, jumping checker pieces into empty spaces, river crossing problems involving items like a fox, a chicken, and a bag of grain, and block stacking tasks that must match specific configurations. These puzzles are familiar tools for testing human reasoning, since once the core strategy is understood, people can usually scale up to more complex variants by following the same logic with more discs, checkers, animals, or blocks. The paper reports that “results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold.” In the Tower of Hanoi experiments, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added, and even boosting compute does not rescue performance on harder puzzles.

Apple’s team also examined how much “thinking” effort the models expend as puzzles get harder, measured in reasoning tokens. They found that reasoning models initially increase their thinking tokens as complexity rises, but “upon approaching a critical threshold — which closely corresponds to their accuracy collapse point — models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty,” effectively giving up as tasks become more complex. Accuracy did not improve when researchers supplied the algorithmic solution in the prompt, meaning the models continued to fail even when all they had to do was follow explicit step by step instructions. Commenting on the work, artificial intelligence expert Gary Marcus argued that the study reinforces that large language models and reasoning variants are “no substitute for good well-specified conventional algorithms” and noted that many humans also struggle on harder versions of these puzzles. The article concludes that Apple’s findings should be treated as important but limited data within a broader research landscape, neither definitive proof that artificial intelligence progress is hollow nor that artificial general intelligence breakthroughs are imminent.

Source

55

Impact Score

Latest News

AMD faces tightening server CPU supply as artificial intelligence reshapes compute demand

March 4, 2026

Surging demand for server CPUs alongside artificial intelligence workloads is tightening supply for AMD and pushing hyperscalers to rebalance their compute strategies. Intel is also reallocating capacity as delivery times and prices rise, particularly in China.

Seagate launches Mozaic 4+ HAMR hard drives up to 44 TB

March 4, 2026

Seagate has begun production deployment of its Mozaic 4+ heat-assisted magnetic recording hard drive platform with leading hyperscale cloud providers, supporting capacities up to 44 TB today and a roadmap to significantly higher densities.

Startup bets on lightning suppression to curb catastrophic wildfires

March 4, 2026

Vancouver-based startup Skyward Wildfire is testing lightning suppression technology using aluminum-coated glass fibers to prevent fire-starting strikes, but researchers warn that the science, risks, and environmental impacts remain uncertain. The company is pushing ahead with field trials in Canada as climate change drives up wildfire danger and lightning-linked fire risk.

Military artificial intelligence deals and experimental wildfire tech under scrutiny

March 4, 2026

A wildfire startup claims it can stop lightning-ignited blazes while OpenAI navigates a contentious classified work deal with the Pentagon, amid wider concerns over surveillance and labor impacts of artificial intelligence.

Microsoft shader execution reordering delivers major ray tracing gains on latest GPUs

March 4, 2026

Microsoft’s shader execution reordering demo shows up to a 90% ray tracing performance boost on Intel Arc B-Series and about an 80% uplift on Nvidia GeForce RTX 5080 ‘Blackwell’ in independent testing, signaling significant headroom for future games.

Apple study finds artificial intelligence reasoning models collapse on complex puzzles

55

Impact Score

Latest News

AMD faces tightening server CPU supply as artificial intelligence reshapes compute demand

Seagate launches Mozaic 4+ HAMR hard drives up to 44 TB

Startup bets on lightning suppression to curb catastrophic wildfires

Military artificial intelligence deals and experimental wildfire tech under scrutiny

Microsoft shader execution reordering delivers major ray tracing gains on latest GPUs

Contact Us