ARC-AGI-3 is a new benchmark from the ARC Prize Foundation, co-founded by François Chollet and Mike Knoop, aimed at testing whether frontier Artificial Intelligence systems can adapt to unfamiliar environments without instructions. Released on March 24, 2026, it departs from earlier ARC benchmarks by focusing on interactive exploration rather than static puzzle solving. As of right now, the best frontier model in the world (Google’s Gemini 3.1 Pro Preview) scores 0.37% on it. The benchmark is framed as a test of skill-acquisition efficiency on novel tasks, rather than performance on narrow tasks that models may have effectively memorized through training.
Earlier ARC benchmarks laid the groundwork for this shift. ARC-AGI-1 used small grids of colored cells (up to 30×30, using 10 colors) and asked systems to infer transformation rules from a few examples. OpenAI’s o3 model cracked 53.5% on the private test set, marking a major jump in benchmark performance. ARC-AGI-2, released in March 2025, increased difficulty with multi-step reasoning and symbolic interpretation, and NVIDIA’s team took first place in the 2025 competition with 24% accuracy. Yet the benchmark creators concluded that static formats had become vulnerable to overfitting, as labs could generate synthetic variants at scale and train models to recognize the structure rather than reason through novelty.
ARC-AGI-3 replaces fixed input-output puzzles with interactive turn-based environments that resemble small video games. The agent must explore, infer the rules, identify the goal, and plan a sequence of actions without being told what success looks like. The observation space is a 64×64 grid with 16 possible colors. Environments are arranged into multiple levels that build on earlier concepts, and scoring is based on RHAE, or Relative Human Action Efficiency, which compares model performance to the second-best human attempt and heavily penalizes inefficient action sequences. The full benchmark contains 135 environments: 25 public, 55 semi-private, and 55 fully private.
The benchmark is also designed to resist gaming. The ARC Prize team distinguishes between task-specific overfitting and domain-specific overfitting, and its official leaderboard only accepts raw frontier models using the same minimal prompt, without hand-crafted tools or specialized harnesses. An open-source harness was released to show that perfect scores on public tasks are easy if the environment is known in advance, reinforcing the view that memorization should not count as intelligence. The release results suggest that current systems remain strong in domains with broad prior knowledge and exact feedback, such as coding and math, but struggle with open-ended, hypothesis-driven exploration in unfamiliar settings.
ARC-AGI-3 is positioned as a benchmark to watch because many other major tests have already been saturated or can be gamed through data generation. The 2026 ARC Prize competition runs on Kaggle with a $2M prize pool across two tracks: ARC-AGI-3 and ARC-AGI-2. Early competition previews showed that state-of-the-art CNN + reinforcement learning approaches and directed state-graph approaches topped out around 12% on a restricted set of public environments, while some academic harnesses succeeded only with substantial tooling to help models maintain a working representation of the environment over time.
