A research paper by Apple has dealt a significant blow to the notion that large language models (LLMs) can reason reliably. The paper challenges leading models such as ChatGPT, Claude, and Deepseek, showing that they excel at pattern recognition but fail when faced with complexity beyond their training data.
The researchers found that LLMs struggle to solve classic puzzles like the Tower of Hanoi, even with fewer than 80% accuracy for seven discs. Moreover, when given the solution algorithm, the models still failed, revealing a lack of logical and intelligent thought process.
This result echoes previous arguments made by Arizona State University computer scientist Subbarao Kambhampati, who noted that people tend to anthropomorphise these systems, assuming they use human-like problem-solving steps. However, this is not the case, as demonstrated by Apple’s paper.
The authors highlight a critical weakness in LLMs: their ability to generalize within a limited data distribution breaks down when faced with new or unfamiliar challenges. This limitation has significant implications for the development of Artificial General Intelligence (AGI).
While there are potential uses for LLMs in tasks like coding and brainstorming, those relying on their output should be cautious due to its inherent limitations. The paper serves as a stark reminder that relying solely on generative AI for complex problem-solving is misguided.
In reality, AGI should aim to combine human adaptability with computational brute force and reliability. By acknowledging the strengths and weaknesses of LLMs, we can move towards more effective and responsible AI development.
Source: https://www.theguardian.com/commentisfree/2025/jun/10/billion-dollar-ai-puzzle-break-down