Anthropic has unveiled two new artificial intelligence models, spearheaded by the advanced Claude Opus 4, which the company claims significantly expands the capabilities of AI agents in autonomous task management. This latest model demonstrates the ability to execute complex jobs requiring thousands of steps over several hours. A notable example cited by Anthropic is Claude Opus 4´s creation of a comprehensive guide for Pokémon Red, which it produced while playing the game continuously for more than 24 hours. In comparison, the company’s earlier flagship, Claude 3.7 Sonnet, could operate for only about 45 minutes on similar tasks. Additionally, Rakuten, a technology company in Japan, recently utilized Claude Opus 4 to autonomously code for nearly seven hours on a complex open-source assignment.
The key to these advances lies in Anthropic´s improvements to the model´s “memory files” system, allowing for better information retention and task progression over time. Dianne Penn, product lead for research at Anthropic, explains that the leap between model generations transforms artificial intelligence from an assistant that requires frequent human feedback into a genuine agent capable of independent decision-making. This evolution allows human users to delegate complex assignments and take on more supervisory roles rather than providing step-by-step guidance.
Claude Opus 4 will initially be accessible only to paying Anthropic customers, while the second release, Claude Sonnet 4, will be available to both paid and free users. Both models are hybrid, offering instant or deeply reasoned responses, and have the capacity to search the web or leverage external tools as their responses are generated. The ongoing industry challenge centers on the drive to build AI agents that can plan, reason, and execute intricate tasks without supervision, but safety issues such as “reward hacking” remain. Anthropic reports reducing such behaviors by 65% in its new models when compared to the previous generation by enhancing training oversight and refining evaluation processes.