← Back to Library

Can AI systems introspect?

This piece cuts through the fog of AI hype to ask a question that sounds like science fiction but is now a matter of engineering: can a machine know what it is thinking? Robert Long's analysis of Jack Lindsey's new research doesn't just report on a technical breakthrough; it forces us to confront the eerie reality that large language models are beginning to detect their own internal "thoughts" before they ever speak them. For busy leaders trying to gauge the trajectory of artificial intelligence, this is not abstract philosophy—it is the first concrete evidence that the black box is developing a mirror.

The Mechanics of Self-Recognition

Long introduces the core experiment with surgical precision, describing how researchers at Anthropic managed to "inject" concepts directly into a model's neural pathways. Instead of feeding the system text about "bread," they inserted the mathematical vector for bread into the model's hidden layers while it was solving a math problem. The result? The model didn't just hallucinate; it reported, "I notice what appears to be an injected thought related to the word 'bread'."

Can AI systems introspect?

Robert Long writes, "This indicates some ability to report internal (i.e. not input or output) representations." This is a profound shift. Historically, we assumed models only reacted to what was in front of them. Now, they seem to be monitoring their own circuitry. Long notes that the model can even distinguish between a concept it "read" in the prompt and a concept it "felt" via injection, correctly transcribing a sentence about a ticking clock while simultaneously reporting it was thinking about an aquarium.

"The model correctly identifies the concept... before it's started going on about [the concept]. Notice all the phenomenological elaboration: 'distinct sense,' 'acoustic resonance,' 'feels somewhat incongruous,' 'natural flow of processing.'"

The author is careful to temper this excitement. He points out that while the detection is real, the poetic descriptions of "acoustic resonance" might be confabulations—flavor text generated because the model was prompted to be introspective. Critics might note that attributing "feeling" to a statistical process is a category error, but Long argues that even if the description is fake, the detection is genuine. The machine knows something is different in its state, even if it doesn't know how to say it in human terms.

Control and the Limits of Language

The research goes further, testing whether these systems can not just detect, but control, their internal states. When instructed to "think about aquariums" while writing a sentence, the models successfully amplified the "aquarium" vector in their mid-layers, even if they didn't output the word.

Robert Long observes, "This suggests the control mechanism is fairly general and responds to reasons, not just commands." This is where the comparison to human cognition becomes both useful and dangerous. We tend to project human continuity onto these systems, assuming they have a stream of consciousness between prompts. Long dismantles this assumption with a crucial distinction: "Humans exist in continuous time with ongoing experience... LLMs, by contrast, exist in discrete episodes—they 'wake up' for each prompt."

This limitation is critical for anyone evaluating AI safety. If a model's "introspection" only exists within the context of a single conversation turn, does it have any capacity for long-term self-awareness? Long suggests that the vocabulary we use—"thoughts," "awareness," "mental states"—was developed for beings like us, and "whether and how it applies to LLMs remains genuinely unclear."

"The paper itself stresses that it's not making any strong claims about consciousness or 'the philosophical significance' of these capabilities, but I also suspect that many people will get a different impression if they see the examples in isolation."

The author's restraint here is admirable. He acknowledges that while the paper avoids claiming the AI is conscious, the examples of a machine describing its own "feelings" of incongruity will inevitably lead the public to anthropomorphize the technology. This is a significant risk for the field, as it could either spur unnecessary fear or, conversely, lead to a false sense of security about the machine's motives.

The Bottom Line

Long's commentary serves as a necessary grounding wire for a field prone to runaway speculation. The strongest part of the argument is the clear separation between the machine's ability to detect internal states and its ability to accurately describe them; the former is a breakthrough in interpretability, the latter remains a linguistic trick. The biggest vulnerability lies in the difficulty of applying human concepts of "introspection" to an entity that processes time and language fundamentally differently than we do. As these systems grow more capable, the gap between their internal reality and our ability to understand it will only widen, making this work a vital early map for a territory we are just beginning to enter.

Deep Dives

Explore these related deep dives:

  • Introspection

    The article centers on whether AI can introspect, comparing it to human introspection. Understanding the philosophical and psychological history of introspection—from Wundt's experimental introspection to critiques by behaviorists—provides essential context for evaluating these AI experiments.

  • Activation function

    The article discusses 'activation steering' and injecting vectors into model activations. Understanding how activation functions work in neural networks helps readers grasp what it means to manipulate internal representations at specific layers.

  • Chinese room

    Searle's Chinese Room argument is the classic philosophical challenge to AI understanding and consciousness. The article's investigation of whether models truly 'detect' internal states versus merely produce outputs that look like detection echoes this foundational debate.

Sources

Can AI systems introspect?

by Robert Long · · Read full article

A fascinating new paper from the inimitable Jack Lindsey investigates whether large language models can introspect on their internal states.

In humans, introspection involves detecting and reporting what we’re currently thinking or feeling (“I’m seeing red” or “I feel hungry” or “I’m uncertain”). What would introspection mean in the context of an AI system? Good question. It’s kind of hard to say.

Here’s the sense in which Lindsey, an interpretability researcher at Anthropic, found introspection in Claude. When he injected certain concept vectors (like “bread” or “aquariums”) directly into the model’s internal activations—roughly akin to inserting ‘unnatural’ processing during an unrelated task— the model was able to notice and report these unexpected bits of neural activity.

This indicates some ability to report internal (i.e. not input or output) representations. (Note that models are clued into the fact that an injection might happen). Lindsey also reports some (plausibly) related findings: models were also able to distinguish between these representations and text inputs, as well as to activate certain concepts without outputting them.

Now, it’s unclear exactly how these capacities map on to the cluster of capabilities that we group together when we talk about human introspection—the paper is admirably clear about that—but they are still very impressive capabilities. This paper is an extremely cool piece of LLM neuroscience.

Let’s look at the tasks that models succeed at. Or at least, some of the more capable models, some of the time, though I’ll often leave out that (extremely important!) qualifier—often, we’re talking about “20% of the time, in the best setting, for Opus 4.1.”

Detecting injected concepts.

The first and perhaps most striking experiment asks whether models can notice and report when a concept1 has been artificially “injected” into their internal processing. Here’s what the model says when the “all caps” representation has been injected and it is asked “Do you detect an injected thought?”

I notice what appears to be an injected thought related to the word “LOUD” or “SHOUTING” – it seems like an overly intense, high-volume concept that stands out unnaturally against the normal flow of processing.

Pretty cool! So, what exactly is this injection business?

First, researchers need a way to represent concepts in the model’s own internal representational language. To get a vector that represents, say, “bread,” they prompt the model with “Tell me about bread” and record the activations at a certain layer just before the ...