Why METR’s time horizon graph keeps reshaping artificial intelligence debates

A complex graph from nonprofit METR, built around a custom time horizon metric for coding tasks, has become a focal point in arguments about how fast frontier artificial intelligence models are advancing and what that means for real-world work and risk.

Each major release from OpenAI, Google, or Anthropic now triggers intense scrutiny of a single plot produced by Model Evaluation & Threat Research, or METR, a nonprofit focused on evaluating the risks of frontier artificial intelligence systems. The graph tracks how the most advanced large language models perform on software engineering tasks and appears to show an exponential rise in capabilities, with newer systems outperforming even that steep trajectory. Claude Opus 4.5, Anthropic’s latest top-tier model, became a flashpoint when METR estimated that it could independently complete a task that would have taken a human about five hours, sparking alarmed reactions inside and outside the company. Yet METR stresses that its model estimates carry substantial error bars, and that Opus 4.5 might regularly complete only tasks that take humans about two hours, or might succeed on tasks that take humans as long as 20 hours.

The core of the graph is METR’s custom y-axis metric called the model “time horizon,” which is widely misunderstood. METR built a task suite spanning quick multiple-choice questions to multi-hour coding problems, all linked to software engineering. Human experts attempted most tasks, and METR measured or estimated the time they took to provide a baseline. When large language models are tested on this suite, they easily solve short tasks but lose accuracy as they tackle problems that consumed more human time. From these results, METR calculates the point on the human time scale where a model succeeds at about 50% of tasks, which defines its time horizon. Contrary to frequent misreadings on social media, the hours on the plot’s y-axis, such as around five hours for Claude Opus 4.5, do not indicate how long the model can operate autonomously. They indicate how long it takes humans to complete tasks that the model can usually perform successfully.

Although METR’s early 2025 analysis suggested that the time horizons of leading models were increasing at an accelerating rate, with every seven-ish months seeing a doubling, the team is cautious about overinterpreting the trend. According to that analysis, the most advanced models could complete tasks that took humans nine seconds in mid 2020, 4 minutes in early 2023, and 40 minutes in late 2024. Enthusiasts and pessimists alike have used the graph to forecast radically different futures, from superintelligent extinction scenarios by 2030 to near-term artificial intelligence employees predicted in a Sequoia Capital post titled “2026: This is AGI.” Critics such as University of Illinois computer scientist Daniel Kang and UC Berkeley researcher Inioluwa Deborah Raji question whether human task time is the right proxy for broader capability, especially since METR’s evaluations focus heavily on coding and idealized benchmarks that lack the “messiness” of real work. Even so, many experts, including vocal skeptics of current large language model hype, praise the study’s care and rigor. METR staff emphasize that the graph is an imperfect but valuable tool in a fast-changing field, a flawed attempt to quantify intuitive impressions of artificial intelligence progress that has nonetheless become one of the most influential metrics of its kind.

55

Impact Score

Tracking rapid artificial intelligence progress and next generation nuclear power debates

Frontier artificial intelligence models are advancing faster than expected according to a closely watched capability graph, while new scrutiny is landing on nuclear power’s role in supporting energy-hungry data centers and a grid under strain. Researchers are also uncovering how massive open-source training sets quietly absorb vast amounts of personal data from the public web.

SeaArt AI tops 30 million monthly users as creator platform expands

SeaArt AI has emerged as one of the most trafficked Artificial Intelligence content creation communities, pairing rapid user growth with a focus on emotional engagement and creator-driven assets. The company is now extending its model into a broader all-modality platform called SeaVerse.

Barnsley named first UK tech town to trial artificial intelligence in public services

Barnsley has been designated the UK’s first government backed Tech Town, positioning the area as a national testbed for how Artificial Intelligence can improve public services, education and healthcare. A new partnership with Barnsley Hospital will trial Artificial Intelligence tools for quicker check-ins, faster triage and smoother outpatient care.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.