Research on introspection and self-knowledge in large language models

Researchers are probing how large language models understand their own knowledge, behavior, and internal states, and how reliably they can report on themselves. Recent work spans calibration, situational awareness, introspective self-modeling, mechanistic interpretability, and debates about the limits of model self-reports.

Work on introspection and self-reports in large language models is rapidly expanding, with researchers asking how and whether Artificial Intelligence systems know things about their internal workings and when they can speak reliably about themselves. Early studies focused on calibration and self-knowledge, examining whether models can track what they know and do not know and whether their expressed confidence corresponds to accuracy. Kadavath et al. (2022), “Language Models (Mostly) Know What They Know.” showed that larger language models were reasonably well-calibrated on multiple-choice and true/false questions, and could be fine-tuned to be more accurate. Lin, Hilton & Evans (2022), “Teaching Models to Express Their Uncertainty in Words” fine-tuned GPT-3 to express uncertainty in natural language in ways like saying “maybe,” and this calibration training generalized moderately across contexts.

Despite large capability gains, calibration remains inconsistent, with striking cases of both underconfidence and overconfidence familiar to many users. Barkan, Black, and Sourbut (2025), “Do Large Language Models Know What They Are Capable Of?” found that then-current language models were “systematically overconfident but have better-than-random ability to discriminate between tasks they can and cannot accomplish…We also find that LLMs with greater general capability often have neither better-calibrated confidence nor better discriminatory power.” Related work on situational awareness looks at how models reason about their own position in the world and deployment context; Laine et al. (2024), “The Situational Awareness Dataset (SAD)” evaluates whether models can answer questions and perform tasks that rely on facts about themselves, their outputs, and their environment.

A second cluster of research examines whether models have privileged access to information about their own behavioral tendencies, beyond what is explicitly in training data. Binder et al. (2024), “Looking Inward: Language Models Can Learn About Themselves by Introspection.” fine-tunes models to predict their own behavior in hypothetical scenarios and to extract properties of that behavior. Song, Hu & Mahowald (2025), “Language Models Fail to Introspect About Their Knowledge of Language.” argue that this measure is not stringent enough and may overstate capabilities. Betley, Balesni, Hariharan, Meinke & Evans (2025), “Tell Me About Yourself: LLMs Are Aware of Their Learned Behaviors.” fine-tune models on implicit behavioral policies, such as making risk-seeking decisions or steering users toward certain words, then ask them to describe these policies without in-context behavioral examples, and find that models can identify these patterns and even recognize backdoor-like behaviors.

Building on this, Plunkett, Morris, Reddy & Morales (2025), “Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions.” fine-tune models on novel quantitative preferences in the form of specific attribute weights for decisions, and then ask models to report those weights, showing that models can do this and that training on self-report accuracy generalizes across decision contexts. Mechanistic interpretability work adds another angle. Lindsey (2025), “Emergent Introspective Awareness in Large Language Models.” injects concept vectors into model activations, such as patterns associated with “bread,” and tests whether models can notice these interventions and exercise intentional control over internal states; models can do both, albeit unreliably, with signs of improvement at larger scales. Lindsey et al. (2025), “Biology of a Large Language Model.” suggests that mechanisms used to assess familiarity, such as a “do I know this entity?” circuit, are distinct from mechanisms used to retrieve information, and that models are not aware of some of the complex procedures they use for mathematical tasks.

Ferrando et al. (2024), “Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models” identifies internal circuits that distinguish entities the model knows about from those it does not, influencing refusals and other behaviors. A set of overview and conceptual papers frames these findings within broader debates about introspection. The Stanford Encyclopedia of Philosophy entry on introspection by Eric Schwitzgebel offers background on introspective processes. Eleos AI Research (2025), “Why Model Self-Reports Are Insufficient-and Why We Studied Them Anyway” argues that there are three central problems for taking model self-reports of consciousness and other states at face value: current models probably lack the relevant states, even if such states exist there is no obvious introspective mechanism to reliably report them, and even if such mechanisms exist it is unclear whether reports are genuinely introspective rather than artifacts of training, yet the authors still defend welfare interviews as scalable red-flag detectors and guides to improvement.

Kammerer & Frankish (2023), “What forms could introspective systems take? A research programme” proposes a minimal definition of introspection as “a process by which a cognitive system represents its own current mental states, in a manner that allows the information to be used for online behavioural control.” Lindsey, Song and colleagues, and Comșa & Shanahan (2025) offer competing definitions, and there is no consensus on what “Artificial Intelligence introspection” should mean, with some preferring labels like “self-modeling” or “self-knowledge” for access to behavioral facts rather than internal states. Perez & Long (2023), “Towards Evaluating AI Systems for Moral Status Using Self-Reports.” expresses hope that self-reports might eventually provide evidence about morally significant states if made reliable, proposing to train models to answer verifiable questions about themselves and test whether such introspective-like capacities generalize to harder, unverifiable questions. The discussion emphasizes that introspection and self-reports must be understood in light of the distinctive nature of language models, whose identities and personas may not map neatly onto human concepts, and encourages integrating work on model identity and personas when evaluating what self-knowledge and introspection could mean for Artificial Intelligence systems.

68

Impact Score

Mustafa Suleyman says Artificial Intelligence compute growth is still accelerating

Mustafa Suleyman argues that Artificial Intelligence development is being propelled by simultaneous advances in chips, memory, networking, and software efficiency rather than nearing a hard limit. He contends that rising compute capacity and falling deployment costs will push systems beyond chatbots toward more capable agents.

China and the US are leading different Artificial Intelligence races

The US leads in large language models and advanced chips, while China has built a major advantage in robotics and humanoid manufacturing. That balance is shifting as Chinese developers narrow the gap in model performance and both countries push to combine software and machines.

Congress weighs Artificial Intelligence transparency rules

Bipartisan lawmakers are pushing a federal transparency standard for the largest Artificial Intelligence models as Congress works on a broader national framework. The proposal aims to increase public trust while avoiding stricter state-by-state requirements and heavier regulation.

Report finds California creative job losses are not driven by Artificial Intelligence

New research from Otis College of Art and Design finds California’s recent creative industry job losses stem from cost pressures and structural shifts, not direct worker displacement by generative Artificial Intelligence. The technology is changing workflows and expectations, but it is largely replacing tasks rather than entire jobs.

U.S. senators propose broader chip tool export ban for Chinese firms

A bipartisan proposal in the U.S. Senate would shift semiconductor equipment controls from specific fabs to targeted Chinese companies and their affiliates. The measure is aimed at cutting off access to advanced lithography and other wafer fabrication tools for firms such as Huawei, SMIC, YMTC, CXMT, and Hua Hong.

Contact Us

Got questions? Use the form to contact us.

Contact Form

Clicking next sends a verification code to your email. After verifying, you can enter your message.