Why METR’s time horizon graph keeps reshaping artificial intelligence debates

A complex graph from nonprofit METR, built around a custom time horizon metric for coding tasks, has become a focal point in arguments about how fast frontier artificial intelligence models are advancing and what that means for real-world work and risk.

Each major release from OpenAI, Google, or Anthropic now triggers intense scrutiny of a single plot produced by Model Evaluation & Threat Research, or METR, a nonprofit focused on evaluating the risks of frontier artificial intelligence systems. The graph tracks how the most advanced large language models perform on software engineering tasks and appears to show an exponential rise in capabilities, with newer systems outperforming even that steep trajectory. Claude Opus 4.5, Anthropic’s latest top-tier model, became a flashpoint when METR estimated that it could independently complete a task that would have taken a human about five hours, sparking alarmed reactions inside and outside the company. Yet METR stresses that its model estimates carry substantial error bars, and that Opus 4.5 might regularly complete only tasks that take humans about two hours, or might succeed on tasks that take humans as long as 20 hours.

The core of the graph is METR’s custom y-axis metric called the model “time horizon,” which is widely misunderstood. METR built a task suite spanning quick multiple-choice questions to multi-hour coding problems, all linked to software engineering. Human experts attempted most tasks, and METR measured or estimated the time they took to provide a baseline. When large language models are tested on this suite, they easily solve short tasks but lose accuracy as they tackle problems that consumed more human time. From these results, METR calculates the point on the human time scale where a model succeeds at about 50% of tasks, which defines its time horizon. Contrary to frequent misreadings on social media, the hours on the plot’s y-axis, such as around five hours for Claude Opus 4.5, do not indicate how long the model can operate autonomously. They indicate how long it takes humans to complete tasks that the model can usually perform successfully.

Although METR’s early 2025 analysis suggested that the time horizons of leading models were increasing at an accelerating rate, with every seven-ish months seeing a doubling, the team is cautious about overinterpreting the trend. According to that analysis, the most advanced models could complete tasks that took humans nine seconds in mid 2020, 4 minutes in early 2023, and 40 minutes in late 2024. Enthusiasts and pessimists alike have used the graph to forecast radically different futures, from superintelligent extinction scenarios by 2030 to near-term artificial intelligence employees predicted in a Sequoia Capital post titled “2026: This is AGI.” Critics such as University of Illinois computer scientist Daniel Kang and UC Berkeley researcher Inioluwa Deborah Raji question whether human task time is the right proxy for broader capability, especially since METR’s evaluations focus heavily on coding and idealized benchmarks that lack the “messiness” of real work. Even so, many experts, including vocal skeptics of current large language model hype, praise the study’s care and rigor. METR staff emphasize that the graph is an imperfect but valuable tool in a fast-changing field, a flawed attempt to quantify intuitive impressions of artificial intelligence progress that has nonetheless become one of the most influential metrics of its kind.

55

Impact Score

EU Artificial Intelligence Act amendments delay some deadlines and add new bans

A provisional Digital Omnibus on Artificial Intelligence would push back several EU Artificial Intelligence Act deadlines, refine how the law interacts with sector rules, and introduce new prohibited practices. The package also expands limited bias-testing allowances and strengthens centralized oversight for some high-impact systems.

Qwen 3.5 raises concerns about censorship embedded in model weights

A technical analysis of Alibaba Cloud’s Qwen 3.5 points to political censorship circuits embedded directly in the model’s learned weights. The findings highlight operational, compliance, and product risks for startups building on third-party Artificial Intelligence models.

Laptop prices rise as memory shortages hit PCs

Laptop prices are climbing as memory makers redirect production toward data center demand driven by Artificial Intelligence. The squeeze is spreading beyond RAM to graphics memory and SSDs, raising costs across the PC market.

Artificial Intelligence models split on job disruption estimates

A new working paper finds that leading Artificial Intelligence models give sharply different answers when asked which jobs they are most likely to disrupt. The findings raise doubts about using model-generated exposure scores to guide labor policy or economic analysis.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.