Researchers at Anthropic have made a significant breakthrough in understanding how large language models (LLMs) like Claude handle confabulation, or making up information that isn’t supported by their training data. The study reveals the inner neural network “circuitry” that helps LLMs decide when to provide an answer and when to say they don’t know.
To gain insight into this process, Anthropic used a system of sparse auto-encoders to identify groups of artificial neurons activated by internal concepts like the Golden Gate Bridge or programming errors. The research expands on these findings by tracing how these features affect other neuron groups that represent computational decision circuits.
The study sheds light on how LLMs “think” in multiple languages, can be fooled by certain techniques, and whether their chain of thought explanations are accurate. At its core, LLMs are designed to predict the text that follows a given prompt, which can lead to confabulation when encountering obscure facts or topics.
This research has the potential to improve overall solutions for the AI confabulation problem, providing a better understanding of how LLMs make decisions and how they can be improved.
Source: https://arstechnica.com/ai/2025/03/why-do-llms-make-stuff-up-new-research-peers-under-the-hood