← Back to Library

Apple’s ‘AI can’t reason’ claim seen by 13m+, what you need to know

You are spending 15 minutes listening to this piece because it cuts through the noise on one of the most consequential technology stories of the year: a major Apple paper claiming AI models don't actually reason. But unlike the headlines, this piece explains what that really means—and why you should still use these tools.

The Paper That Went Viral

Apple's research paper made headlines across mainstream media—reaching over 13 million people. The study tested large language models on classic logic puzzles: Tower of Hanoi, checkers variants, and river crossing challenges (the fox and chicken problem). The findings were striking: as puzzle complexity increased, model performance dropped dramatically.

Apple’s ‘AI can’t reason’ claim seen by 13m+, what you need to know

If these models functioned like traditional software calculators, increasing complexity shouldn't matter—correct answers should remain consistent. Instead, the paper demonstrated something different: these are probabilistic neural networks, not deterministic programs.

Why Models Fail on Complex Tasks

The distinction matters enormously. Large language models generate plausible outputs rather than accurate ones. When given multiplication problems with sufficiently large digits, they don't simply say "I don't know"—they hallucinate wrong answers that look convincing.

Consider this example: when researchers presented Claude Opus and Gemini 2.5 Pro with a complex calculation without tools, both models produced confident but incorrect responses. They generated plausible-sounding numbers ending in similar patterns to the real answer—essentially very convincing BS. The moment these same models were allowed to use code or tools, they delivered correct results.

This reveals what Apple discovered but didn't anticipate: these models aren't designed to be predictable software. They're generative systems that produce plausible outputs—which is precisely why they'll hallucinate when asked questions beyond their capacity.

What Researchers Actually Knew

The paper's findings weren't news to serious AI researchers. The study originally intended to compare "thinking" versus "non-thinking" models on math benchmarks, but results contradicted expectations—so they pivoted to puzzles instead.

One detail many readers missed: the paper admitted that when provided algorithms directly in prompts, models still often failed. The authors expressed surprise that executing given algorithms didn't guarantee success. But this makes sense for probabilistic neural networks—even with 99.9% accuracy per step, millions of steps will eventually produce errors.

The token limit issue also matters. Some tested problems required more than 128,000 tokens to answer—which these models simply couldn't output. Rather than attempt impossible calculations, they output shorter traces: "here is the algorithm you need to use."

These models are designed to be generative—not predictable calculators. They want to produce plausible outputs, which explains why they'll hallucinate when asked questions they can't handle.

The SimpleBench Problem

The author created a benchmark called SimpleBench to test simple scenarios where models fail. Testing GPT-4o on scenarios involving basic physical causation—like a glove falling onto a road—revealed errors despite 18 minutes of thinking time.

This points to something important: language models are swiftly catching up to human performance across almost all text-based domains, but they generate mistruths with almost no hesitation. Like many humans, they'll confidently assert false information.

Model Recommendations

For practical use cases, the recommendation shifts based on budget and needs:

For free users with caps, Google's Gemini 2.5 Pro scores highest on SimpleBench and includes access to a video generator model. Deepseek R1 offers very cheap API pricing with readable technical reports—suitable for those building production workflows.

The $200 monthly tier for GPT-4o Pro from OpenAI is aimed at professionals, not average users. Benchmark results showed 93% accuracy on hard PhD-level science questions and 84% on competitive coding—but note that GPT-4o (not the Pro version) actually outperformed in December 2024's reveal.

Companies often obscure comparison data: they may not show multiple parallel attempts taken to achieve record scores or serious usage limitations for larger models. Looking beyond headline benchmark results is essential for actual use cases.

Bottom Line

The Apple paper revealed something researchers already understood: large language models are probabilistic generators of plausible outputs, not reliable calculators. Their breakthrough isn't standalone intelligence—it's integration with symbolic systems that correct their BS in real-time. The models catch up to human performance fast, but they'll also confidentially lie like humans do. For practical use cases, Gemini 2.5 Pro remains the best free option; Deepseek R1 offers the best value for paid API access. Neither is a supercomputer, and both require tool use for reliable results.

Deep Dives

Explore these related deep dives:

Sources

Apple’s ‘AI can’t reason’ claim seen by 13m+, what you need to know

by AI Explained · AI Explained · Watch video

Almost no one has the time to investigate headlines like this one seen by tens of millions of people that AI models don't actually reason at all. They just memorize patterns. AGI is mostly hype and even the underlying Apple paper quoted says it's an illusion of thinking. This was picked up in mainstream outlets like the Guardian which quoted it as being a pretty devastating Apple paper.

So, what are people supposed to believe when half the headlines are about an imminent AI job apocalypse and the other half are about LLMs all being fake? Well, hopefully you'll find that I'm not trying to sell a narrative. I'll just say what I found having read the 30page paper in full and the surrounding analyses. I'll also end with a recommendation on which model you should use and yes, touch on the brand new 03 Pro from OpenAI.

Although I would say that the $200 price per month to access that model is not for the unwashed masses like you guys. Some very quick context on why a post like this one gets tens of millions of views and coverage in the mainstream media. And no, it's not just because of the unnecessarily frantic breaking at the start. It's also because people hear the claims made by the CEOs of these AI labs like Sam Orman yesterday posting, "Humanity is close to building digital super intelligence.

We're past the event horizon. The takeoff has started. While the definitions of those terms are deliberately vague, you can understand people paying attention. People can see for themselves how quickly large language models are improving and they can read the headlines generated by the CEO of Anthropic saying there is a white collar blood bath coming.

It's almost every week now that we get headlines like this one in the New York Times. So, it's no wonder people are paying attention. Now, some would say cynically that Apple seemed to be producing more papers quote debunking AI than actually improving AI. But let's set that cynicism aside.

The paper essentially claimed that large language models don't follow explicit algorithms and struggle with puzzles when there are sufficient degrees of complexity. Puzzles like the Tower of Hanoi challenge where you've got to move a tower of discs from one place to another, but never place a larger disc on top a smaller one. They ...