When Reasoning Models Go Wrong: Limits of Complex Problem-Solving

Recent advancements in language models have led to the development of Large Reasoning Models (LRMs), which can generate detailed thinking processes before providing answers. However, their true capabilities, scalability, and limitations are still unclear.

Existing evaluations focus on benchmark tests that emphasize final answer accuracy but often overlook the reasoning traces’ structure and quality. To address this, researchers have created controllable puzzle environments to manipulate compositional complexity while maintaining logical structures.

The results show that frontier LRMs face a complete accuracy collapse beyond certain complexities. They also exhibit a counter-intuitive scaling limit: their reasoning effort increases with problem complexity until it declines despite having an adequate token budget.

Comparing LRMs with standard LLM counterparts under equivalent inference compute reveals three performance regimes:

1. Low-complexity tasks where standard models outperform LRMs.
2. Medium-complexity tasks where additional thinking in LRMs demonstrates advantage.
3. High-complexity tasks where both models experience complete collapse.

LRMs have limitations, including failing to use explicit algorithms and reason inconsistently across puzzles. The analysis of reasoning traces reveals patterns of explored solutions and computational behavior, highlighting the strengths and weaknesses of these models.

Source: https://machinelearning.apple.com/research/illusion-of-thinking