Research on introspection and self-knowledge in large language models

March 2, 2026

Researchers are probing how large language models understand their own knowledge, behavior, and internal states, and how reliably they can report on themselves. Recent work spans calibration, situational awareness, introspective self-modeling, mechanistic interpretability, and debates about the limits of model self-reports.

Work on introspection and self-reports in large language models is rapidly expanding, with researchers asking how and whether Artificial Intelligence systems know things about their internal workings and when they can speak reliably about themselves. Early studies focused on calibration and self-knowledge, examining whether models can track what they know and do not know and whether their expressed confidence corresponds to accuracy. Kadavath et al. (2022), “Language Models (Mostly) Know What They Know.” showed that larger language models were reasonably well-calibrated on multiple-choice and true/false questions, and could be fine-tuned to be more accurate. Lin, Hilton & Evans (2022), “Teaching Models to Express Their Uncertainty in Words” fine-tuned GPT-3 to express uncertainty in natural language in ways like saying “maybe,” and this calibration training generalized moderately across contexts.

Despite large capability gains, calibration remains inconsistent, with striking cases of both underconfidence and overconfidence familiar to many users. Barkan, Black, and Sourbut (2025), “Do Large Language Models Know What They Are Capable Of?” found that then-current language models were “systematically overconfident but have better-than-random ability to discriminate between tasks they can and cannot accomplish…We also find that LLMs with greater general capability often have neither better-calibrated confidence nor better discriminatory power.” Related work on situational awareness looks at how models reason about their own position in the world and deployment context; Laine et al. (2024), “The Situational Awareness Dataset (SAD)” evaluates whether models can answer questions and perform tasks that rely on facts about themselves, their outputs, and their environment.

A second cluster of research examines whether models have privileged access to information about their own behavioral tendencies, beyond what is explicitly in training data. Binder et al. (2024), “Looking Inward: Language Models Can Learn About Themselves by Introspection.” fine-tunes models to predict their own behavior in hypothetical scenarios and to extract properties of that behavior. Song, Hu & Mahowald (2025), “Language Models Fail to Introspect About Their Knowledge of Language.” argue that this measure is not stringent enough and may overstate capabilities. Betley, Balesni, Hariharan, Meinke & Evans (2025), “Tell Me About Yourself: LLMs Are Aware of Their Learned Behaviors.” fine-tune models on implicit behavioral policies, such as making risk-seeking decisions or steering users toward certain words, then ask them to describe these policies without in-context behavioral examples, and find that models can identify these patterns and even recognize backdoor-like behaviors.

Building on this, Plunkett, Morris, Reddy & Morales (2025), “Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions.” fine-tune models on novel quantitative preferences in the form of specific attribute weights for decisions, and then ask models to report those weights, showing that models can do this and that training on self-report accuracy generalizes across decision contexts. Mechanistic interpretability work adds another angle. Lindsey (2025), “Emergent Introspective Awareness in Large Language Models.” injects concept vectors into model activations, such as patterns associated with “bread,” and tests whether models can notice these interventions and exercise intentional control over internal states; models can do both, albeit unreliably, with signs of improvement at larger scales. Lindsey et al. (2025), “Biology of a Large Language Model.” suggests that mechanisms used to assess familiarity, such as a “do I know this entity?” circuit, are distinct from mechanisms used to retrieve information, and that models are not aware of some of the complex procedures they use for mathematical tasks.

Ferrando et al. (2024), “Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models” identifies internal circuits that distinguish entities the model knows about from those it does not, influencing refusals and other behaviors. A set of overview and conceptual papers frames these findings within broader debates about introspection. The Stanford Encyclopedia of Philosophy entry on introspection by Eric Schwitzgebel offers background on introspective processes. Eleos AI Research (2025), “Why Model Self-Reports Are Insufficient-and Why We Studied Them Anyway” argues that there are three central problems for taking model self-reports of consciousness and other states at face value: current models probably lack the relevant states, even if such states exist there is no obvious introspective mechanism to reliably report them, and even if such mechanisms exist it is unclear whether reports are genuinely introspective rather than artifacts of training, yet the authors still defend welfare interviews as scalable red-flag detectors and guides to improvement.

Kammerer & Frankish (2023), “What forms could introspective systems take? A research programme” proposes a minimal definition of introspection as “a process by which a cognitive system represents its own current mental states, in a manner that allows the information to be used for online behavioural control.” Lindsey, Song and colleagues, and Comșa & Shanahan (2025) offer competing definitions, and there is no consensus on what “Artificial Intelligence introspection” should mean, with some preferring labels like “self-modeling” or “self-knowledge” for access to behavioral facts rather than internal states. Perez & Long (2023), “Towards Evaluating AI Systems for Moral Status Using Self-Reports.” expresses hope that self-reports might eventually provide evidence about morally significant states if made reliable, proposing to train models to answer verifiable questions about themselves and test whether such introspective-like capacities generalize to harder, unverifiable questions. The discussion emphasizes that introspection and self-reports must be understood in light of the distinctive nature of language models, whose identities and personas may not map neatly onto human concepts, and encourages integrating work on model identity and personas when evaluating what self-knowledge and introspection could mean for Artificial Intelligence systems.

Source

68

Impact Score

Latest News

South Korea launches K-Moonshot for Artificial Intelligence-led science

May 29, 2026

South Korea is rolling out K-Moonshot to accelerate scientific breakthroughs with Artificial Intelligence and has named mission leads to guide the effort. The government is also activating NAIS to support faster Artificial Intelligence-powered research across disciplines.

UK and EU Artificial Intelligence regulatory outlook for May 2026

May 29, 2026

The UK is moving ahead with targeted Artificial Intelligence measures in policing, online safety, cyber security and copyright policy, while the EU is refining how the EU Artificial Intelligence Act will apply in practice. Consultations, new offences and implementation deadlines are shaping the next phase of compliance on both sides.

Germany sets out national implementation of the Artificial Intelligence Act

May 29, 2026

Germany has published a draft law to implement the European Artificial Intelligence Act through new supervisory structures, clearer institutional responsibilities, and measures designed to support innovation. The proposal puts the Federal Network Agency at the center of enforcement while preserving sector-specific oversight in sensitive fields.

ECB warns banks about new Artificial Intelligence security risks

May 28, 2026

The European Central Bank has called major banks to an emergency meeting over cybersecurity risks tied to advanced Artificial Intelligence models. Regulators want banks to speed up security updates as newer tools make it easier to find and exploit vulnerabilities.

Anthropic keeps Mythos restricted after vulnerability findings

May 28, 2026

Anthropic says its cybersecurity model Mythos is powerful at uncovering software flaws but remains too risky for broad release. Early testing found large numbers of vulnerabilities across major software and open source projects, while fixes have lagged far behind discoveries.

Research on introspection and self-knowledge in large language models

68

Impact Score

Latest News

South Korea launches K-Moonshot for Artificial Intelligence-led science

UK and EU Artificial Intelligence regulatory outlook for May 2026

Germany sets out national implementation of the Artificial Intelligence Act

ECB warns banks about new Artificial Intelligence security risks

Anthropic keeps Mythos restricted after vulnerability findings

Contact Us