OpenAI is reorganizing its research agenda around what it calls an Artificial Intelligence researcher, a fully automated agent-based system designed to tackle large, complex problems with minimal human guidance. The effort is now the company’s stated “North Star” for the next few years and is intended to unify work on reasoning models, agents, and interpretability. OpenAI plans to build “an autonomous AI research intern” by September, a system that can take on a small number of specific research problems by itself. That system is meant to lead to a fully automated multi-agent research system that the company plans to debut in 2028.
The intended scope is broad. OpenAI envisions systems that could work on math and physics problems, contribute to biology and chemistry, and address business or policy questions, as long as the task can be expressed in text, code, or whiteboard sketches. Chief scientist Jakub Pachocki argues that recent progress suggests models are approaching the ability to work coherently for extended periods, with humans still setting goals and remaining in charge. He points to Codex as an early version of the broader concept, noting that OpenAI claims most of its technical staffers now use the tool in their work. Pachocki says the near-term target is a system that can handle delegated tasks that would take a person a few days.
That ambition builds on recent advances in coding agents and reasoning models. Pachocki points to the leap from 2020’s GPT-3 to 2023’s GPT-4 as evidence that greater general capability also improves how long models can work without help. He also says reasoning models, which step through problems and backtrack when needed, have extended how long systems can stay effective on difficult tasks. OpenAI is also training models on complex examples such as hard math and coding puzzles so they learn to manage large amounts of text and break work into multiple subtasks. Researchers have used GPT-5 to discover new solutions to a number of unsolved math problems and to push through dead ends in biology, chemistry, and physics puzzles, though Pachocki acknowledges the technology is not yet reliable enough to hand complete control to the system.
Outside researchers see promise but also significant obstacles. Doug Downey of the Allen Institute for AI says coding agents have made the idea of automated scientific work more plausible, but warns that multi-step research workflows remain fragile because errors compound when tasks must be chained together. He notes that OpenAI’s latest model, GPT-5, performed best in his group’s testing on scientific tasks but still made lots of errors, and that OpenAI released GPT-5.4 two weeks ago, which could already change those results.
Safety and governance remain unresolved. Pachocki says a system capable of running an entire research program raises serious questions about misuse, hacking, and misunderstanding instructions. OpenAI’s main current safeguard is chain-of-thought monitoring, in which reasoning models record intermediate notes that researchers can inspect to judge whether behavior is aligned with expectations. He says highly capable systems should operate in sandboxes and under restrictions for a long time because fully trustworthy control is still out of reach. Pachocki also warns that such systems could concentrate unprecedented power in a small number of hands, making government involvement essential even as the broader debate over military and other sensitive uses remains unsettled.
