Systematic review maps clinical impact of large language models in medicine

A large-scale, large language model assisted review finds thousands of clinical medicine papers on generative models since 2022, but only a small minority use real-world patient data or randomized trials. The study highlights overreliance on exam-style benchmarks, closed-source systems, and small samples, and proposes a tiered roadmap for more rigorous clinical evaluation.

Clinical use of large language models in medicine has accelerated since late 2022, but the underlying evidence base remains shallow and fragmented. Using an automated pipeline built on a frontier large language model, researchers scraped PubMed, Embase and Scopus between 1 January 2022 and 6 September 2025, starting from 23,614 records and deduplicating to 12,894 unique studies. Programmatic screening identified 4,609 studies as directly evaluating large language models on clinical tasks, and human-validated bootstrapping estimated that the true number of eligible studies in this period was 4,361 (95% CI 3,838-4,906), corresponding to approximately 3.2 studies on large language models in clinical medicine published per day. Human audits showed that the screening model achieved high sensitivity (0.911; 95% CI 0.866-0.952) and specificity (0.921; 95% CI 0.892-0.949), with a Cohen’s κ of 0.820 (95% CI 0.765-0.870) against tiebroken human labels.

To assess methodological rigor, studies were assigned to a four-tier evidence framework spanning randomized, real-world deployments (Tier S), real clinical data analyses (Tier I), simulated but clinically relevant scenarios (Tier II), and exam-style or knowledge tests (Tier III). Human raters and the large language model showed good agreement in tiering, with interhuman Cohen’s κ of 0.645 (95% CI 0.560-0.726) and model-human κ of 0.695 (95% CI 0.611-0.772). Bayesian modeling predicted that out of the 4,609 included studies, there are 1,048 (95% CI 847-1,252) Tier S/I studies, 1,857 (95% CI 1,427-2,280) Tier II studies and 1,704 (95% CI 1,273-2,134) Tier III studies, revealing a significant deficit of real-world and randomized evidence. Only 19 studies were confirmed as prospective randomized trials, and the earliest Tier S trial, published on 23 July 2024, reported that a custom cessation chatbot, QuitBot, achieved higher smoking cessation rates over 42 days than a National Cancer Institute text-line control (odds ratio 2.58, 95% CI 1.34-4.99; P = 0.005).

The landscape is dominated by a small set of proprietary models and narrow task types. ChatGPT and related OpenAI models constitute 65.7% of evaluated systems, while Gemini/Bard account for 13.1%, and only 12.3% of models are open-source. Patient-facing communication and education comprised 17% of tasks, followed by knowledge retrieval and question answering, and education, assessment and simulation. Across 1,046 studies where comparative performance could be detected, large language models outperformed human comparators in 33.0% of studies, underperformed in 64.5% and showed mixed results in 2.49%, with humans outperformed significantly more often in Tier III exam-like settings than in Tier I real-data studies (38.4% versus 25.9%; P < 0.001). Performance depended strongly on the level of human experience, with models outdoing attending physicians less frequently than unspecified medical doctors or medical students, and residents being outperformed 30% more often than attendings. Evidence quality is further constrained by limited data transparency and small samples: only 42.6% of 2,732 identifiable datasets were open-access, and among 3,289 studies that reported sample size, at least 25% of all included studies had a sample size below 30, implying that conclusions about model performance require cautious interpretation.

Specialty coverage is uneven, with internal medicine represented in 1,500 studies (32.5%), radiology in 743 studies (16.1%) and preventative medicine in 657 studies (14.2%), while many other medical and surgical fields remain comparatively understudied. A quarter of abstracts did not clearly describe their datasets, and a large share of evaluations relied on board and self-assessment questions, patient-facing FAQs, vignettes and guidelines rather than electronic health records or other real-world clinical data. The authors argue that the current literature overemphasizes knowledge retrieval benchmarks that are weak proxies for clinical practice, and they outline a stepwise roadmap starting from Tier III knowledge checks, progressing through Tier II simulations, then Tier I real-data analyses, and ultimately Tier S randomized deployments. They also call for more work on open-source models and open datasets to ensure reproducibility, caution against studies that primarily compare systems to trainees rather than domain experts, and emphasize that future research should prioritize rigorous, patient-centered designs with adequate sample sizes before large language models are integrated into routine clinical care.

54

Impact Score

Memory makers move to hourly contracts as artificial intelligence demand drives volatility

Major memory suppliers are shifting to hourly pricing contracts as artificial intelligence driven demand sends DRAM prices fluctuating by the hour, reshaping leverage between large cloud buyers and smaller firms. Smaller enterprises are being squeezed by rapid cost swings while hyperscalers, automakers, and top smartphone brands secure priority access and better terms.

Bridging the operational artificial intelligence gap in enterprises

Enterprises are moving artificial intelligence from pilots to production, but many struggle without strong integration, governance, and operational foundations. New survey data from senior IT leaders links enterprise-wide integration platforms to broader, more autonomous artificial intelligence deployments.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.