Systematic review maps clinical impact of large language models in medicine

March 5, 2026

A large-scale, large language model assisted review finds thousands of clinical medicine papers on generative models since 2022, but only a small minority use real-world patient data or randomized trials. The study highlights overreliance on exam-style benchmarks, closed-source systems, and small samples, and proposes a tiered roadmap for more rigorous clinical evaluation.

Clinical use of large language models in medicine has accelerated since late 2022, but the underlying evidence base remains shallow and fragmented. Using an automated pipeline built on a frontier large language model, researchers scraped PubMed, Embase and Scopus between 1 January 2022 and 6 September 2025, starting from 23,614 records and deduplicating to 12,894 unique studies. Programmatic screening identified 4,609 studies as directly evaluating large language models on clinical tasks, and human-validated bootstrapping estimated that the true number of eligible studies in this period was 4,361 (95% CI 3,838-4,906), corresponding to approximately 3.2 studies on large language models in clinical medicine published per day. Human audits showed that the screening model achieved high sensitivity (0.911; 95% CI 0.866-0.952) and specificity (0.921; 95% CI 0.892-0.949), with a Cohen’s κ of 0.820 (95% CI 0.765-0.870) against tiebroken human labels.

To assess methodological rigor, studies were assigned to a four-tier evidence framework spanning randomized, real-world deployments (Tier S), real clinical data analyses (Tier I), simulated but clinically relevant scenarios (Tier II), and exam-style or knowledge tests (Tier III). Human raters and the large language model showed good agreement in tiering, with interhuman Cohen’s κ of 0.645 (95% CI 0.560-0.726) and model-human κ of 0.695 (95% CI 0.611-0.772). Bayesian modeling predicted that out of the 4,609 included studies, there are 1,048 (95% CI 847-1,252) Tier S/I studies, 1,857 (95% CI 1,427-2,280) Tier II studies and 1,704 (95% CI 1,273-2,134) Tier III studies, revealing a significant deficit of real-world and randomized evidence. Only 19 studies were confirmed as prospective randomized trials, and the earliest Tier S trial, published on 23 July 2024, reported that a custom cessation chatbot, QuitBot, achieved higher smoking cessation rates over 42 days than a National Cancer Institute text-line control (odds ratio 2.58, 95% CI 1.34-4.99; P = 0.005).

The landscape is dominated by a small set of proprietary models and narrow task types. ChatGPT and related OpenAI models constitute 65.7% of evaluated systems, while Gemini/Bard account for 13.1%, and only 12.3% of models are open-source. Patient-facing communication and education comprised 17% of tasks, followed by knowledge retrieval and question answering, and education, assessment and simulation. Across 1,046 studies where comparative performance could be detected, large language models outperformed human comparators in 33.0% of studies, underperformed in 64.5% and showed mixed results in 2.49%, with humans outperformed significantly more often in Tier III exam-like settings than in Tier I real-data studies (38.4% versus 25.9%; P < 0.001). Performance depended strongly on the level of human experience, with models outdoing attending physicians less frequently than unspecified medical doctors or medical students, and residents being outperformed 30% more often than attendings. Evidence quality is further constrained by limited data transparency and small samples: only 42.6% of 2,732 identifiable datasets were open-access, and among 3,289 studies that reported sample size, at least 25% of all included studies had a sample size below 30, implying that conclusions about model performance require cautious interpretation.

Specialty coverage is uneven, with internal medicine represented in 1,500 studies (32.5%), radiology in 743 studies (16.1%) and preventative medicine in 657 studies (14.2%), while many other medical and surgical fields remain comparatively understudied. A quarter of abstracts did not clearly describe their datasets, and a large share of evaluations relied on board and self-assessment questions, patient-facing FAQs, vignettes and guidelines rather than electronic health records or other real-world clinical data. The authors argue that the current literature overemphasizes knowledge retrieval benchmarks that are weak proxies for clinical practice, and they outline a stepwise roadmap starting from Tier III knowledge checks, progressing through Tier II simulations, then Tier I real-data analyses, and ultimately Tier S randomized deployments. They also call for more work on open-source models and open datasets to ensure reproducibility, caution against studies that primarily compare systems to trainees rather than domain experts, and emphasize that future research should prioritize rigorous, patient-centered designs with adequate sample sizes before large language models are integrated into routine clinical care.

Source

54

Impact Score

Latest News

Adobe plans outcome-based pricing for Artificial Intelligence agents

April 24, 2026

Adobe is positioning its Artificial Intelligence agents around performance-based pricing, charging only when the software completes useful work. The approach points to a more results-oriented model for selling generative Artificial Intelligence tools to business customers.

Tech firms commit billions to Artificial Intelligence infrastructure

April 23, 2026

Amazon, OpenAI, Nvidia, Meta, Google and others are signing increasingly large cloud, chip and data center agreements as demand for Artificial Intelligence infrastructure accelerates. The latest wave of deals spans investments, compute purchases, chip supply agreements and data center buildouts.

Healthcare Artificial Intelligence regulation shifts to states as federal policy lags

April 23, 2026

Healthcare providers and regulators are confronting a fast-moving compliance landscape as state governments take the lead on Artificial Intelligence oversight. Federal agencies are advancing guidance, pilots and frameworks, but nationwide rules remain limited.

JEDEC outlines LPDDR6 expansion for data centers

April 23, 2026

JEDEC has previewed planned updates to LPDDR6 aimed at pushing the memory standard beyond mobile devices and into selected data center and accelerated computing use cases. The roadmap includes higher-capacity packaging options, flexible metadata support, 512 GB densities, and a new SOCAMM2 module standard.

Tsmc debuts A13 process technology

April 23, 2026

Tsmc has introduced its A13 process at its 2026 North America Technology Symposium as a tighter version of A14 aimed at next-generation Artificial Intelligence, high performance computing, and mobile designs. The company positions the node as a more compact and efficient option with backward-compatible design rules for faster migration.

Systematic review maps clinical impact of large language models in medicine

54

Impact Score

Latest News

Adobe plans outcome-based pricing for Artificial Intelligence agents

Tech firms commit billions to Artificial Intelligence infrastructure

Healthcare Artificial Intelligence regulation shifts to states as federal policy lags

JEDEC outlines LPDDR6 expansion for data centers

Tsmc debuts A13 process technology

Contact Us