Large language models require a new form of oversight: capability-based monitoring

The paper proposes capability-based monitoring for large language models in healthcare, organizing oversight around shared capabilities such as summarization, reasoning, translation, and safety guardrails. The authors argue this approach is more scalable than task-based monitoring inherited from traditional machine learning and can reveal systemic weaknesses and emergent behaviors across tasks.

Katherine Kellogg and coauthors from the massachusetts institute of technology, harvard, and northeastern present an 18-page paper proposing capability-based monitoring for large language models used in healthcare. Posted 14 November 2025 and written 5 November 2025, the paper critiques existing monitoring approaches that are task-based and inherited from traditional machine learning. The authors note that task-based monitoring assumes performance degradation driven by dataset drift, an assumption that does not reliably hold for generalist large language models that were not trained for specific tasks or populations.

Capability-based monitoring reframes oversight around overlapping internal capabilities that models reuse across many downstream tasks. The paper highlights examples of such capabilities, including summarization, reasoning, translation, and safety guardrails, and argues that organizing monitoring around these shared abilities enables cross-task detection of systemic weaknesses, long-tail errors, and emergent behaviors that task-based approaches may miss. By focusing on capabilities, organizations can detect issues that propagate across multiple applications rather than evaluating each downstream task independently.

The authors describe considerations for implementation aimed at developers, organizational leaders, and professional societies. They propose that capability-based monitoring offers a scalable foundation for safe, adaptive, and collaborative oversight of large language models and future generalist Artificial Intelligence models in healthcare. The paper positions this approach as a practical organizing principle grounded in how these models are developed and used in practice, and as a way to enable broader detection and mitigation of failures that affect multiple clinical and operational use cases.

55

Impact Score

Navigating new age verification laws for game developers

Governments in the UK, European Union, the United States of America and elsewhere are imposing stricter age verification rules that affect game content, social features and personalization systems. Developers must adopt proportionate age-assurance measures such as ID checks, credit card verification or Artificial Intelligence age estimation to avoid fines, bans and reputational harm.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.