Apple study reveals major artificial intelligence flaw in OpenAI, Google, and Meta LLMs

Apple researchers tested more than 20 large language models and found widespread fragility in reasoning benchmarks when problem variables or irrelevant details were changed.

Researchers at Apple published a study that challenges common assumptions about large language models, arguing that many models’ apparent reasoning skill reflects sophisticated pattern matching rather than genuine logical understanding. The paper builds on concerns about contamination of popular benchmarks such as GSM8K and introduces a new test called GSM-Symbolic. GSM-Symbolic preserves the reasoning problems’ structure while altering names, numbers, complexity, and injecting irrelevant information to probe whether models truly understand the tasks or merely recall patterns from training data.

The study evaluated over 20 models, including OpenAI’s o1 and GPT-4o, Google’s Gemma 2, Meta’s Llama 3, and Microsoft’s Phi 3. Across every model tested, performance dropped when variables were changed. Simple edits to names and values reduced accuracy by a few percentage points, and the authors called the resulting variance non-negligible. The effect was more pronounced when researchers added seemingly relevant but ultimately inconsequential statements to problems. In one illustrative example, a math problem about kiwis included a note that five were smaller than average. Many models subtracted those five from the total, treating the irrelevant size comment as an operational cue. OpenAI’s o1 Preview showed the smallest relative drop at 17.5 percent accuracy, while Microsoft’s Phi 3 performed notably worse, with a decline of roughly 65 percent compared with baseline performance on the unchanged problems.

The paper concludes that exposing models to altered variables and superfluous details reveals a critical limitation in how LLMs handle formal reasoning and relevance. The authors emphasize that these failures undermine claims of true mathematical understanding and that benchmark performance can be misleading when tests are popular and potentially present in training data. The study also notes the authors’ affiliation with Apple, which competes with Google, Meta, and OpenAI while maintaining a partnership with OpenAI and developing its own models. The findings serve as a reminder to temper hype around artificial intelligence and to scrutinize reasoning claims with more robust, contamination-resistant evaluations.

58

Impact Score

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.