AI Black Box Broken: Researchers Uncover LLM Inner Workings

Researchers at Anthropic have made a groundbreaking discovery in understanding how large language models (LLMs) work, shedding light on their inner workings and providing a pathway to make AI safer, more secure, and reliable.

Currently, LLMs are “black boxes” – we know what prompts they receive and the output they produce, but exactly how they arrive at responses is unknown. This lack of understanding creates issues such as predicting when models will “hallucinate” or confidently spew erroneous information.

The researchers discovered that LLMs like Anthropic’s Claude 3.5 Haiku model initially learn to predict the next word in a sentence, but also perform longer-range planning for certain tasks. For example, they can analyze sentences of tens of words.

A new technique called “mechanistic interpretability” developed by Anthropic allows researchers to trace the entire reasoning process through layers of the network. This method involves training an entirely different model, called a cross-layer transcoder (CLT), which works using sets of interpretable features rather than individual neuron weights.

While this breakthrough offers new possibilities for auditing AI systems and developing new training methods, it also has its limitations. The CLT technique doesn’t capture the key aspect of LLMs’ working mechanism – attention – where models learn to prioritize different portions of input prompts. This limitation may affect scaling up the technique for longer prompts.

The discovery opens up opportunities for researchers to better understand how AI systems work and improve their performance, making them safer and more reliable for various applications.

Source: https://fortune.com/2025/03/27/anthropic-ai-breakthrough-claude-llm-black-box