You are spending 15 minutes listening to this piece because it cuts through the noise on one of the most consequential technology stories of the year: a major Apple paper claiming AI models don't actually reason. But unlike the headlines, this piece explains what that really means—and why you should still use these tools.
The Paper That Went Viral
Apple's research paper made headlines across mainstream media—reaching over 13 million people. The study tested large language models on classic logic puzzles: Tower of Hanoi, checkers variants, and river crossing challenges (the fox and chicken problem). The findings were striking: as puzzle complexity increased, model performance dropped dramatically.
If these models functioned like traditional software calculators, increasing complexity shouldn't matter—correct answers should remain consistent. Instead, the paper demonstrated something different: these are probabilistic neural networks, not deterministic programs.
Why Models Fail on Complex Tasks
The distinction matters enormously. Large language models generate plausible outputs rather than accurate ones. When given multiplication problems with sufficiently large digits, they don't simply say "I don't know"—they hallucinate wrong answers that look convincing.
Consider this example: when researchers presented Claude Opus and Gemini 2.5 Pro with a complex calculation without tools, both models produced confident but incorrect responses. They generated plausible-sounding numbers ending in similar patterns to the real answer—essentially very convincing BS. The moment these same models were allowed to use code or tools, they delivered correct results.
This reveals what Apple discovered but didn't anticipate: these models aren't designed to be predictable software. They're generative systems that produce plausible outputs—which is precisely why they'll hallucinate when asked questions beyond their capacity.
What Researchers Actually Knew
The paper's findings weren't news to serious AI researchers. The study originally intended to compare "thinking" versus "non-thinking" models on math benchmarks, but results contradicted expectations—so they pivoted to puzzles instead.
One detail many readers missed: the paper admitted that when provided algorithms directly in prompts, models still often failed. The authors expressed surprise that executing given algorithms didn't guarantee success. But this makes sense for probabilistic neural networks—even with 99.9% accuracy per step, millions of steps will eventually produce errors.
The token limit issue also matters. Some tested problems required more than 128,000 tokens to answer—which these models simply couldn't output. Rather than attempt impossible calculations, they output shorter traces: "here is the algorithm you need to use."
These models are designed to be generative—not predictable calculators. They want to produce plausible outputs, which explains why they'll hallucinate when asked questions they can't handle.
The SimpleBench Problem
The author created a benchmark called SimpleBench to test simple scenarios where models fail. Testing GPT-4o on scenarios involving basic physical causation—like a glove falling onto a road—revealed errors despite 18 minutes of thinking time.
This points to something important: language models are swiftly catching up to human performance across almost all text-based domains, but they generate mistruths with almost no hesitation. Like many humans, they'll confidently assert false information.
Model Recommendations
For practical use cases, the recommendation shifts based on budget and needs:
For free users with caps, Google's Gemini 2.5 Pro scores highest on SimpleBench and includes access to a video generator model. Deepseek R1 offers very cheap API pricing with readable technical reports—suitable for those building production workflows.
The $200 monthly tier for GPT-4o Pro from OpenAI is aimed at professionals, not average users. Benchmark results showed 93% accuracy on hard PhD-level science questions and 84% on competitive coding—but note that GPT-4o (not the Pro version) actually outperformed in December 2024's reveal.
Companies often obscure comparison data: they may not show multiple parallel attempts taken to achieve record scores or serious usage limitations for larger models. Looking beyond headline benchmark results is essential for actual use cases.
Bottom Line
The Apple paper revealed something researchers already understood: large language models are probabilistic generators of plausible outputs, not reliable calculators. Their breakthrough isn't standalone intelligence—it's integration with symbolic systems that correct their BS in real-time. The models catch up to human performance fast, but they'll also confidentially lie like humans do. For practical use cases, Gemini 2.5 Pro remains the best free option; Deepseek R1 offers the best value for paid API access. Neither is a supercomputer, and both require tool use for reliable results.