ARC-AGI-3 exposes limits in Artificial Intelligence reasoning

ARC-AGI-3 introduces interactive, instruction-free environments designed to test whether frontier Artificial Intelligence systems can adapt to genuinely novel situations. Early results show top models performing near zero, highlighting a sharp gap between pattern recognition and open-ended exploration.

ARC-AGI-3 is a new benchmark from the ARC Prize Foundation, co-founded by François Chollet and Mike Knoop, aimed at testing whether frontier Artificial Intelligence systems can adapt to unfamiliar environments without instructions. Released on March 24, 2026, it departs from earlier ARC benchmarks by focusing on interactive exploration rather than static puzzle solving. As of right now, the best frontier model in the world (Google’s Gemini 3.1 Pro Preview) scores 0.37% on it. The benchmark is framed as a test of skill-acquisition efficiency on novel tasks, rather than performance on narrow tasks that models may have effectively memorized through training.

Earlier ARC benchmarks laid the groundwork for this shift. ARC-AGI-1 used small grids of colored cells (up to 30×30, using 10 colors) and asked systems to infer transformation rules from a few examples. OpenAI’s o3 model cracked 53.5% on the private test set, marking a major jump in benchmark performance. ARC-AGI-2, released in March 2025, increased difficulty with multi-step reasoning and symbolic interpretation, and NVIDIA’s team took first place in the 2025 competition with 24% accuracy. Yet the benchmark creators concluded that static formats had become vulnerable to overfitting, as labs could generate synthetic variants at scale and train models to recognize the structure rather than reason through novelty.

ARC-AGI-3 replaces fixed input-output puzzles with interactive turn-based environments that resemble small video games. The agent must explore, infer the rules, identify the goal, and plan a sequence of actions without being told what success looks like. The observation space is a 64×64 grid with 16 possible colors. Environments are arranged into multiple levels that build on earlier concepts, and scoring is based on RHAE, or Relative Human Action Efficiency, which compares model performance to the second-best human attempt and heavily penalizes inefficient action sequences. The full benchmark contains 135 environments: 25 public, 55 semi-private, and 55 fully private.

The benchmark is also designed to resist gaming. The ARC Prize team distinguishes between task-specific overfitting and domain-specific overfitting, and its official leaderboard only accepts raw frontier models using the same minimal prompt, without hand-crafted tools or specialized harnesses. An open-source harness was released to show that perfect scores on public tasks are easy if the environment is known in advance, reinforcing the view that memorization should not count as intelligence. The release results suggest that current systems remain strong in domains with broad prior knowledge and exact feedback, such as coding and math, but struggle with open-ended, hypothesis-driven exploration in unfamiliar settings.

ARC-AGI-3 is positioned as a benchmark to watch because many other major tests have already been saturated or can be gamed through data generation. The 2026 ARC Prize competition runs on Kaggle with a $2M prize pool across two tracks: ARC-AGI-3 and ARC-AGI-2. Early competition previews showed that state-of-the-art CNN + reinforcement learning approaches and directed state-graph approaches topped out around 12% on a restricted set of public environments, while some academic harnesses succeeded only with substantial tooling to help models maintain a working representation of the environment over time.

72

Impact Score

Laptop prices rise as memory shortages hit PCs

Laptop prices are climbing as memory makers redirect production toward data center demand driven by Artificial Intelligence. The squeeze is spreading beyond RAM to graphics memory and SSDs, raising costs across the PC market.

Artificial Intelligence models split on job disruption estimates

A new working paper finds that leading Artificial Intelligence models give sharply different answers when asked which jobs they are most likely to disrupt. The findings raise doubts about using model-generated exposure scores to guide labor policy or economic analysis.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.