Clinical use of large language models in medicine has accelerated since late 2022, but the underlying evidence base remains shallow and fragmented. Using an automated pipeline built on a frontier large language model, researchers scraped PubMed, Embase and Scopus between 1 January 2022 and 6 September 2025, starting from 23,614 records and deduplicating to 12,894 unique studies. Programmatic screening identified 4,609 studies as directly evaluating large language models on clinical tasks, and human-validated bootstrapping estimated that the true number of eligible studies in this period was 4,361 (95% CI 3,838-4,906), corresponding to approximately 3.2 studies on large language models in clinical medicine published per day. Human audits showed that the screening model achieved high sensitivity (0.911; 95% CI 0.866-0.952) and specificity (0.921; 95% CI 0.892-0.949), with a Cohen’s κ of 0.820 (95% CI 0.765-0.870) against tiebroken human labels.
To assess methodological rigor, studies were assigned to a four-tier evidence framework spanning randomized, real-world deployments (Tier S), real clinical data analyses (Tier I), simulated but clinically relevant scenarios (Tier II), and exam-style or knowledge tests (Tier III). Human raters and the large language model showed good agreement in tiering, with interhuman Cohen’s κ of 0.645 (95% CI 0.560-0.726) and model-human κ of 0.695 (95% CI 0.611-0.772). Bayesian modeling predicted that out of the 4,609 included studies, there are 1,048 (95% CI 847-1,252) Tier S/I studies, 1,857 (95% CI 1,427-2,280) Tier II studies and 1,704 (95% CI 1,273-2,134) Tier III studies, revealing a significant deficit of real-world and randomized evidence. Only 19 studies were confirmed as prospective randomized trials, and the earliest Tier S trial, published on 23 July 2024, reported that a custom cessation chatbot, QuitBot, achieved higher smoking cessation rates over 42 days than a National Cancer Institute text-line control (odds ratio 2.58, 95% CI 1.34-4.99; P = 0.005).
The landscape is dominated by a small set of proprietary models and narrow task types. ChatGPT and related OpenAI models constitute 65.7% of evaluated systems, while Gemini/Bard account for 13.1%, and only 12.3% of models are open-source. Patient-facing communication and education comprised 17% of tasks, followed by knowledge retrieval and question answering, and education, assessment and simulation. Across 1,046 studies where comparative performance could be detected, large language models outperformed human comparators in 33.0% of studies, underperformed in 64.5% and showed mixed results in 2.49%, with humans outperformed significantly more often in Tier III exam-like settings than in Tier I real-data studies (38.4% versus 25.9%; P < 0.001). Performance depended strongly on the level of human experience, with models outdoing attending physicians less frequently than unspecified medical doctors or medical students, and residents being outperformed 30% more often than attendings. Evidence quality is further constrained by limited data transparency and small samples: only 42.6% of 2,732 identifiable datasets were open-access, and among 3,289 studies that reported sample size, at least 25% of all included studies had a sample size below 30, implying that conclusions about model performance require cautious interpretation.
Specialty coverage is uneven, with internal medicine represented in 1,500 studies (32.5%), radiology in 743 studies (16.1%) and preventative medicine in 657 studies (14.2%), while many other medical and surgical fields remain comparatively understudied. A quarter of abstracts did not clearly describe their datasets, and a large share of evaluations relied on board and self-assessment questions, patient-facing FAQs, vignettes and guidelines rather than electronic health records or other real-world clinical data. The authors argue that the current literature overemphasizes knowledge retrieval benchmarks that are weak proxies for clinical practice, and they outline a stepwise roadmap starting from Tier III knowledge checks, progressing through Tier II simulations, then Tier I real-data analyses, and ultimately Tier S randomized deployments. They also call for more work on open-source models and open datasets to ensure reproducibility, caution against studies that primarily compare systems to trainees rather than domain experts, and emphasize that future research should prioritize rigorous, patient-centered designs with adequate sample sizes before large language models are integrated into routine clinical care.
