A new study by Apple has revealed that advanced artificial intelligence (AI) models are not as intelligent as they seem. When faced with complex problems, these “reasoning” models experience a complete accuracy collapse, failing to provide accurate responses. The study, published on June 7, found that even specialized large language models (LLMs), such as OpenAI’s o3 and DeepSeek’s R1, struggle when tasks exceed a certain complexity threshold.
Researchers set up generic and reasoning bots, including OpenAI’s o1 and o3 models, DeepSeek R1, Anthropic’s Claude 3.7 Sonnet, and Google’s Gemini, to solve four classic puzzles with varying levels of complexity. As the problems increased in difficulty, the performance of both generic and reasoning models improved initially but eventually “collapsed to zero.” The study suggests that these models rely more heavily on pattern recognition than emergent logic.
The findings challenge claims from big tech firms that their AI tools are on the verge of developing artificial general intelligence (AGI). Apple’s researchers acknowledge that existing evaluations often focus on established mathematical and coding benchmarks, which may not provide insights into the structure and quality of reasoning traces.
While some have accused Apple of sour grapes, other AI researchers have hailed the study as a necessary correction to grandiose claims about current AI tools. The study highlights the limitations of LLMs and emphasizes the need for real science in understanding their capabilities and potential risks.
Source: https://www.livescience.com/technology/artificial-intelligence/ai-reasoning-models-arent-as-smart-as-they-were-cracked-up-to-be-apple-study-claims