Sakana Artificial Intelligence, the University of British Columbia, the Vector Institute, and the University of Oxford have published an open-access paper in Nature describing The Artificial Intelligence Scientist, a system built to execute the full machine learning research process. The project is designed to generate research ideas, search and read relevant literature, design and run experiments, and write complete papers in LaTeX, with figure feedback provided by a foundation model with vision capabilities. The publication consolidates earlier open-source releases and adds new architectural details, scaling results, and discussion of the opportunities and risks around Artificial Intelligence-generated science.
The work is presented as the result of a 1.5-year process. In its first phase, the system was given a starting code template and autonomously generated ideas, ran experiments, and wrote a full paper, while an Automated Reviewer was created to score paper quality. In a later phase, the system was granted broader freedom across Artificial Intelligence research topics and submitted unedited, fully Artificial Intelligence-generated papers to the blind human peer-review process of the ICLR 2025 I Can’t Believe It’s Not Better workshop. One manuscript achieved an average score of 6.33 (individual scores: 6, 7, 6), surpassing the average human acceptance threshold. The paper scored higher than 55% of human-authored papers. The submission was made with permission from organizers and was withdrawn prior to publication after acceptance, as planned in advance.
The Nature paper also emphasizes evaluation at scale through the Automated Reviewer. The system was prompted to act as an Area Chair and ensemble five independent reviews into a final decision using official NeurIPS guidelines. Benchmarked against thousands of human decisions from OpenReview, it achieved a balanced accuracy of 69% and an F1-score that exceeded the inter-human agreement reported in the NeurIPS 2021 consistency experiment. The reported results suggest the reviewer matches human performance, including on papers published after the model’s knowledge cutoff. Using this reviewer, the team says it observed a scaling law in which better foundation models produce higher-quality generated papers.
Several limitations remain. The system can produce naive or underdeveloped ideas, struggle with methodological rigor and complex code implementation, and make errors such as inaccurate citations or duplicated figures in appendices. The current setup is also limited to computational experiments. The team argues that these weaknesses should be viewed alongside a broader trend in machine learning, where emerging capabilities can improve rapidly with scale and stronger core models.
The publication also frames the project as an ethical and institutional challenge for science. Risks include overwhelming peer-review systems and inflating research credentials with machine-generated output. In response, the team says it obtained IRB approval, withdrew accepted submissions, and watermarks all generated papers to make their origin clear. It also calls for community norms on how Artificial Intelligence-generated research should be handled as such systems become more capable.
