Is AI Reasoning Hitting a Wall?

Is recent chatter over Apple's paper on the limits of reasoning warranted, or much ado about nothing? Plus, OpenAI's former research lead about the next decade of frontier development.

Jun 18, 2025

Post methodology: @Dust: summarize this paper: [the-illusion-of-thinking]; how does this paper's findings compare with the research of the team behind the ARC Prize, François Chollet and Mike Knoop [arcprize research page]; also how does it compare to the work of Subbarao Kambhampati [lab site] and Gary Marcus [A Knockout Blow for LLMs post]; Claude 4 Sonnet via @Dust: can you synthesize all of the output in this chat into an essay asking the question "Is AI Reasoning Already Hitting a Wall?"; please rewrite to come to a more balanced conclusion between the critique and the current reasoning model paradigm. Light editing and reformatting for the Substack editor.

The AI industry stands at a critical juncture. After years of remarkable progress in language models, the latest generation of "reasoning models" like OpenAI's o1, DeepSeek-R1, and Claude 3.7 Sonnet Thinking promised to overcome the fundamental limitations of their predecessors through sophisticated "thinking" mechanisms and inference-time computation. Yet a growing body of rigorous research suggests these systems may be encountering significant barriers—raising important questions about both their limitations and their potential for continued improvement.

In a recent Training Data episode, OpenAI’s former head of research, Bob McGrew says that “2025 is going to be the year of reasoning.” How do we square the industry’s enthusiasm for the new reasoning models with a growing body of research that questions whether these models are really thinking in the ways we intuitively imagine?

Related:

Why 2025 Is the Year of Reasoning: OpenAI Former Research Head Bob McGrew
Jun 18
Read full story

The Promise of Reasoning Models

Large Reasoning Models (LRMs) represent a significant departure from traditional language models. Rather than simply predicting the next token, these systems generate extended "chains of thought," engage in self-reflection, and allocate variable compute resources to problems based on their perceived difficulty. The underlying hypothesis is compelling: if human-level reasoning emerges from deliberate, step-by-step thinking, then models that explicitly engage in such processes should demonstrate more robust problem-solving capabilities.

Initial results were genuinely impressive. OpenAI's o1 achieved unprecedented scores on mathematical benchmarks, dramatically outperforming previous models on complex reasoning tasks. Other reasoning models showed substantial improvements on coding and scientific reasoning challenges. The AI community began to speak of a new paradigm—one where inference-time compute could unlock capabilities that scaling pre-training alone could not provide.

The Empirical Reality Check

However, three independent lines of research have converged on important limitations that become apparent under rigorous testing conditions, while also revealing the genuine capabilities these models do possess.

Apple's Controlled Experiments

Apple's recent study, "The Illusion of Thinking," provides perhaps the most systematic analysis of reasoning model capabilities and limitations to date. By using controllable puzzle environments—Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World—the researchers could precisely manipulate problem complexity while avoiding the data contamination issues that plague traditional benchmarks.

Their findings reveal three distinct performance regimes that both challenge and illuminate our understanding of reasoning models:

Low complexity: Standard LLMs surprisingly outperform reasoning models while being more token-efficient—suggesting reasoning overhead isn't always beneficial
Medium complexity: Reasoning models show clear advantages, demonstrating that their approach can unlock genuine capabilities
High complexity: Both model types experience complete performance collapse—indicating fundamental limits rather than just scaling needs

The discovery of a counterintuitive scaling pattern—where reasoning models reduce their computational effort as problems become more difficult—suggests current architectures may have inherent ceilings. However, the substantial improvements in the medium-complexity regime indicate these models can genuinely extend beyond simple pattern matching in certain domains.

The Planning Domain Evidence

Subbarao Kambhampati's work on planning capabilities provides a nuanced view of both progress and limitations. His PlanBench evaluation of o1-preview revealed that while the model achieved 97.8% accuracy on simple Blocksworld problems—a dramatic improvement over previous models—performance degraded rapidly with complexity.

Importantly, Kambhampati's research shows that reasoning models can solve problems that completely stymied their predecessors. The jump from essentially 0% to nearly 98% on basic planning tasks represents genuine progress. However, the failure to benefit from explicit algorithmic guidance suggests the models' reasoning processes, while more sophisticated, may still be fundamentally different from human-like logical reasoning.

The ARC Prize Perspective

The Abstraction and Reasoning Corpus (ARC), championed by François Chollet and Mike Knoop, offers perhaps the most challenging test of genuine reasoning capabilities. Progress has been substantial but uneven—from 20% in 2020 to 55.5% in 2024, with reasoning models contributing significantly to recent advances.

The ARC Prize results reveal both the promise and limitations of current approaches. Test-time fine tuning and deep learning-guided program synthesis—techniques pioneered for ARC—have found applications across AI research. Yet the benchmark's resistance to pure scaling suggests that while reasoning models represent progress, they may not be sufficient for human-level general intelligence.

Related:
Zapier’s Mike Knoop launches ARC Prize to Jumpstart New Ideas for AGI
Inference by Sequoia
·
July 2, 2024
Read full story

The Convergent Critique

What makes these findings particularly compelling is their convergence with long-standing theoretical arguments from researchers like Gary Marcus. Marcus has argued since 1998 that neural networks excel within their training distributions but collapse outside them. The recent empirical evidence provides striking validation of this decades-old insight.

Marcus's response to the Apple paper—calling it a "knockout blow for LLMs"—reflects not triumphalism but vindication of his persistent concern: that the field has been scaling the wrong architectures. His observation that LLMs "cannot reliably solve Hanoi" despite having access to countless solutions in online training data highlights the fundamental difference between memorization and reasoning.

The Illusion of Progress

The pattern emerging from these studies suggests that reasoning models may be exhibiting what we might call "sophisticated pattern matching"—they appear to reason through problems but are actually applying memorized solution templates to superficially similar situations. When problems deviate sufficiently from their training distribution, performance collapses entirely.

This interpretation explains several puzzling observations:

Why providing explicit algorithms doesn't improve performance
Why models perform inconsistently across seemingly similar problem types
Why reasoning effort decreases rather than increases with problem difficulty
Why classical algorithms consistently outperform expensive reasoning models

Understanding the Mixed Evidence

Rather than viewing these results as purely negative, they suggest a more complex picture of reasoning model capabilities:

Genuine Advances: Reasoning models demonstrate clear improvements over their predecessors in specific domains. The ability to solve previously intractable planning problems, achieve new benchmarks in mathematics, and show improved performance on novel tasks indicates real progress in AI reasoning capabilities.

Domain-Dependent Success: The models appear most effective in domains where their training has provided relevant reasoning patterns. Mathematical proofs, coding problems, and structured logical tasks often see substantial improvements, while truly novel problem types remain challenging.

Architectural Insights: The failure modes revealed by rigorous testing provide valuable insights for future development. Understanding why reasoning effort decreases with complexity, or why explicit algorithms don't improve performance, offers directions for architectural improvements.

The Broader Context of Progress

The current situation differs significantly from previous AI limitations. Unlike earlier approaches that hit hard walls, reasoning models show a more complex performance profile with clear successes alongside notable failures.

Economic Viability: Despite higher computational costs, reasoning models are finding commercial applications where their enhanced capabilities justify the expense. Scientific computing, complex code generation, and advanced mathematical problem-solving represent domains where the value proposition is clear.

Research Momentum: The techniques developed for reasoning models—test-time training, inference-time search, and sophisticated prompting strategies—are advancing the broader field. Even if current architectures have limitations, they're generating insights and methods that inform future approaches.

Incremental Improvements: Unlike paradigm shifts that require complete architectural overhauls, many reasoning model limitations might be addressable through incremental improvements. Better training techniques, improved search algorithms, and hybrid approaches could extend current capabilities significantly.

Alternative and Complementary Paths

The research convergence suggests a couple of promising directions that could work alongside or replace current reasoning models:

Hybrid Architectures: Combining reasoning models with classical algorithms, as suggested by Kambhampati's LLM-Modulo framework, could provide both the flexibility of neural approaches and the reliability of symbolic methods.

Specialized Reasoning Systems: Rather than pursuing general reasoning capabilities, developing models optimized for specific reasoning domains might provide better performance and reliability.

Conclusion: The Complexity of Evaluation

Alex Lawsen of Open Philanthropy noticed methodology problems in the Apple paper and used Claude to write a response paper outlining them. While he describes the paper, titled "The Illusion of The Illusion of Thinking," as a “joke,” it did highlight real methodological issues in Apple’s paper that complicate our understanding of where reasoning models truly stand.

The critique demonstrates that apparent reasoning failures can sometimes reflect evaluation constraints rather than cognitive limitations. When models explicitly acknowledge output token limits or recognize mathematically impossible puzzles, scoring these as reasoning failures fundamentally mischaracterizes model capabilities. The finding that models achieve high accuracy on Tower of Hanoi problems when asked to generate algorithmic solutions rather than exhaustive move lists suggests that format constraints, not reasoning ability, may drive some reported failures.

Distinguishing between "cannot reason" and "cannot execute under given constraints" requires sophisticated evaluation frameworks. As the critique notes, "The question isn't whether LRMs can reason, but whether our evaluations can distinguish reasoning from typing."

Rather than hitting a wall, AI reasoning appears to be revealing the complexity of evaluating it. Models may possess reasoning capabilities that current testing methodologies fail to capture, while also exhibiting genuine limitations that require careful experimental design to identify.

This suggests that progress in understanding AI reasoning will require parallel advances in both model architectures and evaluation techniques. Success may ultimately depend on developing evaluation frameworks sophisticated enough to distinguish between different types of limitations—whether they stem from architectural constraints, training data boundaries, output format requirements, or fundamental reasoning incapabilities. The current research revealing both reasoning model limitations and evaluation challenges represents not an endpoint but a crucial step toward building a more robust understanding of AI capabilities.