Why METR’s time horizon graph keeps reshaping artificial intelligence debates

A complex graph from nonprofit METR, built around a custom time horizon metric for coding tasks, has become a focal point in arguments about how fast frontier artificial intelligence models are advancing and what that means for real-world work and risk.

Each major release from OpenAI, Google, or Anthropic now triggers intense scrutiny of a single plot produced by Model Evaluation & Threat Research, or METR, a nonprofit focused on evaluating the risks of frontier artificial intelligence systems. The graph tracks how the most advanced large language models perform on software engineering tasks and appears to show an exponential rise in capabilities, with newer systems outperforming even that steep trajectory. Claude Opus 4.5, Anthropic’s latest top-tier model, became a flashpoint when METR estimated that it could independently complete a task that would have taken a human about five hours, sparking alarmed reactions inside and outside the company. Yet METR stresses that its model estimates carry substantial error bars, and that Opus 4.5 might regularly complete only tasks that take humans about two hours, or might succeed on tasks that take humans as long as 20 hours.

The core of the graph is METR’s custom y-axis metric called the model “time horizon,” which is widely misunderstood. METR built a task suite spanning quick multiple-choice questions to multi-hour coding problems, all linked to software engineering. Human experts attempted most tasks, and METR measured or estimated the time they took to provide a baseline. When large language models are tested on this suite, they easily solve short tasks but lose accuracy as they tackle problems that consumed more human time. From these results, METR calculates the point on the human time scale where a model succeeds at about 50% of tasks, which defines its time horizon. Contrary to frequent misreadings on social media, the hours on the plot’s y-axis, such as around five hours for Claude Opus 4.5, do not indicate how long the model can operate autonomously. They indicate how long it takes humans to complete tasks that the model can usually perform successfully.

Although METR’s early 2025 analysis suggested that the time horizons of leading models were increasing at an accelerating rate, with every seven-ish months seeing a doubling, the team is cautious about overinterpreting the trend. According to that analysis, the most advanced models could complete tasks that took humans nine seconds in mid 2020, 4 minutes in early 2023, and 40 minutes in late 2024. Enthusiasts and pessimists alike have used the graph to forecast radically different futures, from superintelligent extinction scenarios by 2030 to near-term artificial intelligence employees predicted in a Sequoia Capital post titled “2026: This is AGI.” Critics such as University of Illinois computer scientist Daniel Kang and UC Berkeley researcher Inioluwa Deborah Raji question whether human task time is the right proxy for broader capability, especially since METR’s evaluations focus heavily on coding and idealized benchmarks that lack the “messiness” of real work. Even so, many experts, including vocal skeptics of current large language model hype, praise the study’s care and rigor. METR staff emphasize that the graph is an imperfect but valuable tool in a fast-changing field, a flawed attempt to quantify intuitive impressions of artificial intelligence progress that has nonetheless become one of the most influential metrics of its kind.

55

Impact Score

Anu Bradford on tech sovereignty and regulatory fragmentation

Anu Bradford argues that Europe is wavering in its role as the world’s digital rule-setter just as governments everywhere move toward more state control over technology. Global companies are being pushed to treat geopolitical risk, data sovereignty, and Artificial Intelligence governance as core strategic issues.

Mistral launches text-to-speech model

Mistral has expanded its Voxtral family with a text-to-speech system aimed at enterprise voice applications. The company is positioning the open-weights model as a flexible alternative for organizations that want more control over deployment, cost and customization.

UK Parliament opens workforce inquiry on Artificial Intelligence

A UK Parliament committee is examining how Artificial Intelligence is changing business and work, with a focus on both economic opportunity and labour disruption. The inquiry is seeking evidence on government priorities as adoption expands across the economy.

Windows 11 tightens kernel trust for older drivers

Microsoft is changing Windows 11 kernel policy so new drivers must be signed through the Windows Hardware Compatibility Program. Older trusted drivers will still be allowed in some cases to preserve compatibility during the transition.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.