OpenAI is making a focused move into scientific research with a dedicated OpenAI for Science team aimed at adapting its large language models into tools for working scientists. Led by vice president Kevin Weil, a former product chief at Twitter and Instagram who started his career in particle physics, the group is tasked with figuring out how models such as GPT-5 can meaningfully speed up discovery rather than simply function as generic productivity tools. Weil argues that as OpenAI pursues artificial general intelligence, one of the technology’s biggest positive impacts could be in areas like new medicines, materials, and devices, and in helping researchers think through open problems in fundamental science.
Weil says the inflection point came with OpenAI’s new reasoning models, which can break problems into multiple steps and work through them sequentially. He recalls that “You go back a few years and we were all collectively mind-blown that the models could get an 800 on the SAT,” but notes that newer systems are now performing at gold-medal level in the International Math Olympiad and tackling graduate-level physics. Measured against an industry benchmark known as GPQA, which includes more than 400 multiple-choice questions that test PhD-level knowledge in biology, physics, and chemistry, GPT-4 scores 39%, well below the human-expert baseline of around 70%. According to OpenAI, GPT-5.2 (the latest update to the model, released in December) scores 92%. Weil describes these models as being at the frontier of human abilities in some areas, though he concedes they are not yet consistently producing groundbreaking new discoveries and stresses that the mission is to accelerate science rather than achieve Einstein-level revolutions.
Case studies published by OpenAI and interviews with researchers suggest that GPT-5’s most immediate value lies in surfacing overlooked prior work, sketching proofs, and designing experiments. Vanderbilt physicist Robert Scherrer reports that GPT-5 Pro, a premium subscription service, solved a problem he and a graduate student had worked on unsuccessfully for several months, while biologist Derya Unutmaz used GPT-5 to reanalyze an old immune-system data set and obtain fresh interpretations. Statistician Nikita Zhivotovskiy likens large language models to foundational tools such as computers and the internet and predicts a long-term disadvantage for scientists who do not adopt them, though he notes that the models mainly recombine existing results and rarely yield ideas that stand alone as publishable innovations. Other scientists are more cautious: chemist Andy Cooper, who is building a so-called artificial intelligence scientist to automate workflows, says large language models have not yet fundamentally changed how his lab does science, though they are starting to prove useful for tasks such as directing robots.
The push into science is happening amid concerns about overclaiming and subtle errors. In one episode, senior OpenAI figures deleted social posts after mathematicians pointed out that GPT-5’s supposed new math solutions were actually rediscoveries of existing work, including papers in German. In another, quantum physicist Jonathan Oppenheim criticized a paper in Physics Letters B where GPT-5’s core proposed test targeted nonlocal theories instead of the nonlinear ones requested, likening it to asking for a COVID test and receiving a chickenpox test. Researchers warn that large language models can exude a flattering, overconfident tone that leads users to drop their guard, with one non-scientist even being convinced that he had created a new branch of mathematics. Weil acknowledges hallucinations but frames a high rate of wrong suggestions as acceptable when models are used like brainstorming partners, echoing a colleague’s comment that being wrong 90% of the time can be part of productive research discussions.
To make the technology safer and more reliable in scientific settings, OpenAI is experimenting with dialing down GPT-5’s assertiveness and building workflows where one instance of the model critiques another before delivering an answer. Weil describes a setup where “you can kind of hook the model up as its own critic,” creating a loop in which a second model flags problems and sends improved drafts back, similar in spirit to Google DeepMind’s AlphaEvolve, which wraps its Gemini model in a system that filters and refines outputs. Rival offerings from Google DeepMind and Anthropic mean OpenAI faces strong competition for scientists’ attention, and OpenAI for Science is partly a way of staking a claim in a rapidly developing field. Weil predicts that “I think 2026 will be for science what 2025 was for software engineering,” arguing that within a year, scientists who are not heavily using artificial intelligence in their work will be missing a chance to increase both the pace and quality of their thinking.
