This study systematically reviewed 1,012 publicly accessible summaries of safety and effectiveness (SSEDs) for Artificial Intelligence/machine learning-enabled medical devices authorized by the U.S. Food and Drug Administration through December 2024. The authors developed a 17-point AI Characteristics Transparency Reporting (ACTR) score to quantify disclosure across dataset, model, performance, and clinical reporting elements. Across all devices the mean ACTR score was 3.3 out of 17 (standard deviation 3.1), with a minimum annual mean of 1.1 and a maximum annual mean of 4. The single-device maximum ACTR was 12 and 304 devices (30%) scored zero. After publication of the FDA’s 2021 Good Machine Learning Practice guidance, ACTR scores increased by 0.88 points (95% confidence interval, 0.54-1.23) controlling for model complexity and predicate device use, but the absolute change was small.
The review identified pervasive gaps in public reporting. Of 1,016 devices on the FDA list, 1,012 had accessible SSEDs; 96.4% of devices were cleared via the 510(k) pathway (n = 976). Only 53.1% of devices reported a clinical study; among those, 60.5% used retrospective designs, 14% prospective designs, and 75% reported sample size. Reporting on datasets was sparse: 93.3% did not report training data sources, 75.5% did not report testing data sources, training dataset size was reported by 9.4% (n = 95), test dataset size by 23.2% (n = 235), and demographics by 23.7% (n = 240). Performance metrics were absent in 51.6% of device summaries; the most commonly reported metrics were sensitivity (23.9%, n = 242) and specificity (21.7%, n = 220), with fewer reporting AUROC (10.9%, n = 110), positive predictive value (6.5%, n = 66), accuracy (6.4%, n = 65), and negative predictive value (5.3%, n = 54). Median reported discrimination metrics were high but the authors caution these may reflect optimistic premarket designs.
The authors highlight consequences for generalizability and postmarket surveillance, noting only 15 devices (1.5%) reported a predetermined change control plan and that 70.9% of 510(k) clearances exceeded the FDA’s 90-day review target. ACTR scores correlated weakly with time to clearance (Pearson r = 0.15). The paper documents modest improvements after guidance but persistent under-reporting of model, data, and subgroup performance. The authors recommend enforceable, standardized public reporting such as a machine-readable model card appended to SSEDs and strengthened postmarket monitoring to ensure trust and equitable performance in deployed medical devices.
