Research on introspection and self-knowledge in large language models

Researchers are probing how large language models understand their own knowledge, behavior, and internal states, and how reliably they can report on themselves. Recent work spans calibration, situational awareness, introspective self-modeling, mechanistic interpretability, and debates about the limits of model self-reports.

Work on introspection and self-reports in large language models is rapidly expanding, with researchers asking how and whether Artificial Intelligence systems know things about their internal workings and when they can speak reliably about themselves. Early studies focused on calibration and self-knowledge, examining whether models can track what they know and do not know and whether their expressed confidence corresponds to accuracy. Kadavath et al. (2022), “Language Models (Mostly) Know What They Know.” showed that larger language models were reasonably well-calibrated on multiple-choice and true/false questions, and could be fine-tuned to be more accurate. Lin, Hilton & Evans (2022), “Teaching Models to Express Their Uncertainty in Words” fine-tuned GPT-3 to express uncertainty in natural language in ways like saying “maybe,” and this calibration training generalized moderately across contexts.

Despite large capability gains, calibration remains inconsistent, with striking cases of both underconfidence and overconfidence familiar to many users. Barkan, Black, and Sourbut (2025), “Do Large Language Models Know What They Are Capable Of?” found that then-current language models were “systematically overconfident but have better-than-random ability to discriminate between tasks they can and cannot accomplish…We also find that LLMs with greater general capability often have neither better-calibrated confidence nor better discriminatory power.” Related work on situational awareness looks at how models reason about their own position in the world and deployment context; Laine et al. (2024), “The Situational Awareness Dataset (SAD)” evaluates whether models can answer questions and perform tasks that rely on facts about themselves, their outputs, and their environment.

A second cluster of research examines whether models have privileged access to information about their own behavioral tendencies, beyond what is explicitly in training data. Binder et al. (2024), “Looking Inward: Language Models Can Learn About Themselves by Introspection.” fine-tunes models to predict their own behavior in hypothetical scenarios and to extract properties of that behavior. Song, Hu & Mahowald (2025), “Language Models Fail to Introspect About Their Knowledge of Language.” argue that this measure is not stringent enough and may overstate capabilities. Betley, Balesni, Hariharan, Meinke & Evans (2025), “Tell Me About Yourself: LLMs Are Aware of Their Learned Behaviors.” fine-tune models on implicit behavioral policies, such as making risk-seeking decisions or steering users toward certain words, then ask them to describe these policies without in-context behavioral examples, and find that models can identify these patterns and even recognize backdoor-like behaviors.

Building on this, Plunkett, Morris, Reddy & Morales (2025), “Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions.” fine-tune models on novel quantitative preferences in the form of specific attribute weights for decisions, and then ask models to report those weights, showing that models can do this and that training on self-report accuracy generalizes across decision contexts. Mechanistic interpretability work adds another angle. Lindsey (2025), “Emergent Introspective Awareness in Large Language Models.” injects concept vectors into model activations, such as patterns associated with “bread,” and tests whether models can notice these interventions and exercise intentional control over internal states; models can do both, albeit unreliably, with signs of improvement at larger scales. Lindsey et al. (2025), “Biology of a Large Language Model.” suggests that mechanisms used to assess familiarity, such as a “do I know this entity?” circuit, are distinct from mechanisms used to retrieve information, and that models are not aware of some of the complex procedures they use for mathematical tasks.

Ferrando et al. (2024), “Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models” identifies internal circuits that distinguish entities the model knows about from those it does not, influencing refusals and other behaviors. A set of overview and conceptual papers frames these findings within broader debates about introspection. The Stanford Encyclopedia of Philosophy entry on introspection by Eric Schwitzgebel offers background on introspective processes. Eleos AI Research (2025), “Why Model Self-Reports Are Insufficient-and Why We Studied Them Anyway” argues that there are three central problems for taking model self-reports of consciousness and other states at face value: current models probably lack the relevant states, even if such states exist there is no obvious introspective mechanism to reliably report them, and even if such mechanisms exist it is unclear whether reports are genuinely introspective rather than artifacts of training, yet the authors still defend welfare interviews as scalable red-flag detectors and guides to improvement.

Kammerer & Frankish (2023), “What forms could introspective systems take? A research programme” proposes a minimal definition of introspection as “a process by which a cognitive system represents its own current mental states, in a manner that allows the information to be used for online behavioural control.” Lindsey, Song and colleagues, and Comșa & Shanahan (2025) offer competing definitions, and there is no consensus on what “Artificial Intelligence introspection” should mean, with some preferring labels like “self-modeling” or “self-knowledge” for access to behavioral facts rather than internal states. Perez & Long (2023), “Towards Evaluating AI Systems for Moral Status Using Self-Reports.” expresses hope that self-reports might eventually provide evidence about morally significant states if made reliable, proposing to train models to answer verifiable questions about themselves and test whether such introspective-like capacities generalize to harder, unverifiable questions. The discussion emphasizes that introspection and self-reports must be understood in light of the distinctive nature of language models, whose identities and personas may not map neatly onto human concepts, and encourages integrating work on model identity and personas when evaluating what self-knowledge and introspection could mean for Artificial Intelligence systems.

68

Impact Score

U.S. postal inspectors warn of Artificial Intelligence powered scams targeting consumers

U.S. postal inspectors are warning customers that scammers are using Artificial Intelligence tools such as voice cloning and deepfakes to make long-standing fraud schemes more convincing, and are urging the public to learn key warning signs. The campaign coincides with National Consumer Protection Week and includes guidance across digital, radio, and print channels.

Free artificial intelligence video generators that actually work in 2026

A new wave of artificial intelligence video tools in 2026 offers genuinely free creation without credit systems, watermarks, or heavy restrictions, especially for users willing to run models locally. Cloud platforms still help beginners get started, but local diffusion workflows provide the only truly unlimited path.

Microsoft 365 Copilot Tuning enables task specific enterprise agents

Microsoft 365 Copilot Tuning lets organizations create customized, task specific Copilot agents grounded in their own data, security, and standards. The preview capability focuses on document centric workflows, expert Q&A, optimization scenarios, and governed model refinement.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.