Katherine Kellogg and coauthors from the massachusetts institute of technology, harvard, and northeastern present an 18-page paper proposing capability-based monitoring for large language models used in healthcare. Posted 14 November 2025 and written 5 November 2025, the paper critiques existing monitoring approaches that are task-based and inherited from traditional machine learning. The authors note that task-based monitoring assumes performance degradation driven by dataset drift, an assumption that does not reliably hold for generalist large language models that were not trained for specific tasks or populations.
Capability-based monitoring reframes oversight around overlapping internal capabilities that models reuse across many downstream tasks. The paper highlights examples of such capabilities, including summarization, reasoning, translation, and safety guardrails, and argues that organizing monitoring around these shared abilities enables cross-task detection of systemic weaknesses, long-tail errors, and emergent behaviors that task-based approaches may miss. By focusing on capabilities, organizations can detect issues that propagate across multiple applications rather than evaluating each downstream task independently.
The authors describe considerations for implementation aimed at developers, organizational leaders, and professional societies. They propose that capability-based monitoring offers a scalable foundation for safe, adaptive, and collaborative oversight of large language models and future generalist Artificial Intelligence models in healthcare. The paper positions this approach as a practical organizing principle grounded in how these models are developed and used in practice, and as a way to enable broader detection and mitigation of failures that affect multiple clinical and operational use cases.
