Evaluating transparency in Artificial Intelligence/machine learning model characteristics for FDA-reviewed medical devices

The authors reviewed publicly available FDA summaries for 1,012 Artificial Intelligence/machine learning-enabled medical devices to measure how well model development and performance were reported. They find low transparency across key elements and only modest improvement after the FDA’s 2021 guidance.

This study systematically reviewed 1,012 publicly accessible summaries of safety and effectiveness (SSEDs) for Artificial Intelligence/machine learning-enabled medical devices authorized by the U.S. Food and Drug Administration through December 2024. The authors developed a 17-point AI Characteristics Transparency Reporting (ACTR) score to quantify disclosure across dataset, model, performance, and clinical reporting elements. Across all devices the mean ACTR score was 3.3 out of 17 (standard deviation 3.1), with a minimum annual mean of 1.1 and a maximum annual mean of 4. The single-device maximum ACTR was 12 and 304 devices (30%) scored zero. After publication of the FDA’s 2021 Good Machine Learning Practice guidance, ACTR scores increased by 0.88 points (95% confidence interval, 0.54-1.23) controlling for model complexity and predicate device use, but the absolute change was small.

The review identified pervasive gaps in public reporting. Of 1,016 devices on the FDA list, 1,012 had accessible SSEDs; 96.4% of devices were cleared via the 510(k) pathway (n = 976). Only 53.1% of devices reported a clinical study; among those, 60.5% used retrospective designs, 14% prospective designs, and 75% reported sample size. Reporting on datasets was sparse: 93.3% did not report training data sources, 75.5% did not report testing data sources, training dataset size was reported by 9.4% (n = 95), test dataset size by 23.2% (n = 235), and demographics by 23.7% (n = 240). Performance metrics were absent in 51.6% of device summaries; the most commonly reported metrics were sensitivity (23.9%, n = 242) and specificity (21.7%, n = 220), with fewer reporting AUROC (10.9%, n = 110), positive predictive value (6.5%, n = 66), accuracy (6.4%, n = 65), and negative predictive value (5.3%, n = 54). Median reported discrimination metrics were high but the authors caution these may reflect optimistic premarket designs.

The authors highlight consequences for generalizability and postmarket surveillance, noting only 15 devices (1.5%) reported a predetermined change control plan and that 70.9% of 510(k) clearances exceeded the FDA’s 90-day review target. ACTR scores correlated weakly with time to clearance (Pearson r = 0.15). The paper documents modest improvements after guidance but persistent under-reporting of model, data, and subgroup performance. The authors recommend enforceable, standardized public reporting such as a machine-readable model card appended to SSEDs and strengthened postmarket monitoring to ensure trust and equitable performance in deployed medical devices.

58

Impact Score

Business Artificial Intelligence innovation unveiled at SAP TechEd

At SAP TechEd, sap announced a broad set of business Artificial Intelligence innovations spanning database capabilities, a new relational foundation model, agent tooling, and regional cloud infrastructure to support data protection and ethical deployment.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.