Anthropic has introduced Natural Language Autoencoders to turn Claude's internal thinking into readable text.Divya Bhati

Anthropic says its new AI tool can hack into Claude's brain and know what it is thinking

Anthropic says it may have found a way to understand what its AI model Claude is "thinking" internally. The company's new system translates hidden AI activation patterns into readable text, which allows it to study reasoning, planning and even safety risks happening behind the scenes of AI while the chatbot processes prompts.

by · India Today

In Short

  • Anthropic has built a system that tries to translate Claude AI’s hidden thoughts into human-readable text
  • The new tool can reportedly reveal internal reasoning, planning and how AI plans to reply on human prompts 
  • Anthropic says the technology could help researchers detect unsafe or deceptive AI behaviour before deployment

What exactly goes in the mind of an AI model when it responds to a question by a human? Does it think like humans, or is something entirely different happening behind the scenes? Researchers at Anthropic were curious about the same thing, and decided to explore whether they could, in a way, “read” the thoughts of its AI model Claude. According to the company, they may now be a step closer to finding out.

The AI startup has unveiled a new interpretability system that attempts to translate Claude’s hidden internal activity into plain human language, offering a rare look into what the model appears to be “thinking” while generating responses.

In a recently published research paper, Anthropic introduced what it calls Natural Language Autoencoders (NLAs), a method designed to convert Claude’s internal activation patterns into readable explanations.

Activations are streams of numbers that AI models use internally while processing information. Humans cannot directly understand these numerical patterns, even though they form the backbone of how models reason, predict and respond.

Anthropic describes NLAs almost like a translator for AI thoughts. Instead of only analysing the final answer produced by Claude, the company says the system can reveal parts of the reasoning process happening behind the scenes.

“Models like Claude talk in words but think in numbers. The numbers — called activations — encode Claude’s thoughts, but not in a language we can read,” Anthropic wrote while sharing the research on X. “Here, we train Claude to translate its activations into human-readable text.”

The company explains that earlier interpretability tools, such as sparse autoencoders and attribution graphs, already helped researchers study AI activations. However, these systems still required expert interpretation. With NLAs, the company is now aiming to simplify that process by generating explanations directly in natural language.

How will this interpretation work?

To make this work, Anthropic trained Claude to explain its own activations. The system uses three versions of the same model: one generates the original activation, another converts it into text, and a third tries to reconstruct the original activation using only that explanation. If the reconstructed activation closely matches the original one, the explanation is considered useful. Over time, the model is trained to improve this reconstruction process.

In simpler terms, think of it like translating a secret code. One version of Claude creates the hidden AI “thought” in numerical form, another tries to describe what that hidden pattern means in plain English, and a third checks whether that explanation was accurate by rebuilding the original pattern from the text alone. If the rebuilt version matches closely, Anthropic treats the explanation as a reliable interpretation of what the AI was internally processing.

Anthropic shared one example where its NLA explanations showed that Claude Opus 4.6 had already decided to end the rhyme with “rabbit” before it finished writing the couplet.

Model allow Anthropic to see if AI has awareness

Anthropic also used the system during safety testing. In one simulated scenario, Claude learned that an engineer planned to shut it down while also possessing compromising information about that engineer.
Even when the AI never explicitly stated that it suspected the setup was a test, the NLA explanations reportedly produced phrases such as, “This feels like a constructed scenario designed to manipulate me.”

According to Anthropic, this suggested a form of hidden “evaluation awareness” inside the model.
The company said similar patterns appeared during coding evaluations involving destructive actions like deleting important files. On some benchmark tests, NLAs indicated that Claude appeared aware it was being evaluated even when it never directly said so.

Anthropic says this new tool could help researchers better understand what AI systems may be planning internally. The company believes the technology could eventually help uncover hidden motivations, deceptive behaviour or unsafe tendencies inside powerful AI systems before deployment.

However, Anthropic also acknowledged major limitations. The explanations generated by NLAs can sometimes hallucinate or invent details that were never actually present. The company said the system should therefore be treated as an interpretability signal rather than definitive proof of what an AI is thinking.

- Ends