Why METR’s time horizon graph keeps reshaping artificial intelligence debates

February 7, 2026

A complex graph from nonprofit METR, built around a custom time horizon metric for coding tasks, has become a focal point in arguments about how fast frontier artificial intelligence models are advancing and what that means for real-world work and risk.

Each major release from OpenAI, Google, or Anthropic now triggers intense scrutiny of a single plot produced by Model Evaluation & Threat Research, or METR, a nonprofit focused on evaluating the risks of frontier artificial intelligence systems. The graph tracks how the most advanced large language models perform on software engineering tasks and appears to show an exponential rise in capabilities, with newer systems outperforming even that steep trajectory. Claude Opus 4.5, Anthropic’s latest top-tier model, became a flashpoint when METR estimated that it could independently complete a task that would have taken a human about five hours, sparking alarmed reactions inside and outside the company. Yet METR stresses that its model estimates carry substantial error bars, and that Opus 4.5 might regularly complete only tasks that take humans about two hours, or might succeed on tasks that take humans as long as 20 hours.

The core of the graph is METR’s custom y-axis metric called the model “time horizon,” which is widely misunderstood. METR built a task suite spanning quick multiple-choice questions to multi-hour coding problems, all linked to software engineering. Human experts attempted most tasks, and METR measured or estimated the time they took to provide a baseline. When large language models are tested on this suite, they easily solve short tasks but lose accuracy as they tackle problems that consumed more human time. From these results, METR calculates the point on the human time scale where a model succeeds at about 50% of tasks, which defines its time horizon. Contrary to frequent misreadings on social media, the hours on the plot’s y-axis, such as around five hours for Claude Opus 4.5, do not indicate how long the model can operate autonomously. They indicate how long it takes humans to complete tasks that the model can usually perform successfully.

Although METR’s early 2025 analysis suggested that the time horizons of leading models were increasing at an accelerating rate, with every seven-ish months seeing a doubling, the team is cautious about overinterpreting the trend. According to that analysis, the most advanced models could complete tasks that took humans nine seconds in mid 2020, 4 minutes in early 2023, and 40 minutes in late 2024. Enthusiasts and pessimists alike have used the graph to forecast radically different futures, from superintelligent extinction scenarios by 2030 to near-term artificial intelligence employees predicted in a Sequoia Capital post titled “2026: This is AGI.” Critics such as University of Illinois computer scientist Daniel Kang and UC Berkeley researcher Inioluwa Deborah Raji question whether human task time is the right proxy for broader capability, especially since METR’s evaluations focus heavily on coding and idealized benchmarks that lack the “messiness” of real work. Even so, many experts, including vocal skeptics of current large language model hype, praise the study’s care and rigor. METR staff emphasize that the graph is an imperfect but valuable tool in a fast-changing field, a flawed attempt to quantify intuitive impressions of artificial intelligence progress that has nonetheless become one of the most influential metrics of its kind.

Source

55

Impact Score

Latest News

EU Artificial Intelligence Act amendments delay some deadlines and add new bans

May 20, 2026

A provisional Digital Omnibus on Artificial Intelligence would push back several EU Artificial Intelligence Act deadlines, refine how the law interacts with sector rules, and introduce new prohibited practices. The package also expands limited bias-testing allowances and strengthens centralized oversight for some high-impact systems.

Qwen 3.5 raises concerns about censorship embedded in model weights

May 20, 2026

A technical analysis of Alibaba Cloud’s Qwen 3.5 points to political censorship circuits embedded directly in the model’s learned weights. The findings highlight operational, compliance, and product risks for startups building on third-party Artificial Intelligence models.

Laptop prices rise as memory shortages hit PCs

May 20, 2026

Laptop prices are climbing as memory makers redirect production toward data center demand driven by Artificial Intelligence. The squeeze is spreading beyond RAM to graphics memory and SSDs, raising costs across the PC market.

Intel and Apple chip deal reflects a new semiconductor order

May 20, 2026

Apple’s reported preliminary manufacturing deal with Intel signals a broader reshaping of the semiconductor industry. Artificial Intelligence demand, supply constraints and geopolitics are pushing old rivals into new alliances.

Artificial Intelligence models split on job disruption estimates

May 19, 2026

A new working paper finds that leading Artificial Intelligence models give sharply different answers when asked which jobs they are most likely to disrupt. The findings raise doubts about using model-generated exposure scores to guide labor policy or economic analysis.

Why METR’s time horizon graph keeps reshaping artificial intelligence debates

55

Impact Score

Latest News

EU Artificial Intelligence Act amendments delay some deadlines and add new bans

Qwen 3.5 raises concerns about censorship embedded in model weights

Laptop prices rise as memory shortages hit PCs

Intel and Apple chip deal reflects a new semiconductor order

Artificial Intelligence models split on job disruption estimates

Contact Us