Sabine Hossenfelder delivers a rare, empirical stress test of artificial intelligence in the realm of theoretical physics, revealing that while these models are brilliant librarians, they remain terrible inventors. By pitting four leading large language models against a decades-old, unsolved Millennium Prize problem regarding fluid dynamics, she exposes a critical gap between the AI's ability to mimic scientific discourse and its capacity for genuine logical reasoning.
The Experiment: A Vague Idea Meets Cold Logic
Hossenfelder approaches the task not with a polished hypothesis, but with a "vague idea" she has carried for twenty years: using the geometry of General Relativity to prove that the Navier-Stokes equations for fluid flow inevitably develop singularities. She acknowledges the high stakes and the likelihood of failure, noting, "More likely that's a good reason for why it can't possibly work that I'm blissfully too stupid to see." This self-deprecating honesty sets the stage for a fair evaluation. She isn't asking the AI to solve the problem instantly; she is asking if it can navigate the complex, contradictory landscape of existing literature to find a path forward that a human might miss.
The author's methodology is rigorous. She subjects each model to the same prompt, watching how they handle the nuance of her proposal. The results are immediate and telling. When she asks GPT-5 to engage, it initially stumbles, confusing the constraints of the Millennium problem. However, with iterative prompting, it begins to "roughly understand the idea and the steps look kind of reasonable." This suggests that while the model lacks intuition, it possesses a robust ability to align its output with human guidance once the confusion is cleared.
"In practice, I've seen a lot of junk come out of it. Sometimes new, sometimes correct, but rarely both."
This observation cuts to the heart of the current AI hype cycle. Hossenfelder argues that the danger lies not in the models being wrong, but in them being plausibly wrong. They can assemble a sequence of equations that looks like a breakthrough but collapses under scrutiny. Critics might argue that this is simply a limitation of current training data, and that future models will overcome these logical gaps. However, Hossenfelder's experience suggests the issue is deeper: the models are fundamentally designed to predict the next likely word, not to verify the truth of the argument.
The Hallucination of Competence
The performance of the other models varies wildly, often descending into absurdity. Google's "Deep Think" model, despite its expensive subscription price, offers nothing but a rephrasing of Hossenfelder's own text. When pushed, it politely admits defeat, stating, "I cannot generate the novel conceptual breakthroughs that perform the specialized abstract mathematical reasoning required to solve a millennium problem." This is a rare moment of AI honesty, but it highlights a significant limitation: the model knows its boundaries but cannot transcend them.
Gemini, on the other hand, displays a dangerous overconfidence. It initially praises the idea as "brilliant" before declaring it "unworkable" based on a fundamental misunderstanding of the physics involved. Hossenfelder notes that the model "confuses time reversal symmetry with time reversibility" and mistakenly believes the Navier-Stokes equation violates energy conservation. "I think Gemini has a serious self-confidence issue," she writes. This is a crucial finding for busy professionals: an AI that is wrong but sounds authoritative is far more dangerous than one that is silent.
"The LLM idea of a new theory is a plausible looking sequence of arguments, not an actually correct one."
This distinction is the piece's most vital takeaway. The models are excellent at "digging up related work and explaining it," making them powerful tools for literature reviews and brainstorming. However, they fail when asked to synthesize new knowledge. Hossenfelder points out that they constantly conflate similar-sounding concepts, such as "energy" and "free energy," or switch between different mathematical notations mid-argument. If a human student made these errors, they would be corrected immediately. AI models, however, "will bring back these mistakes over and over," creating a feedback loop of error that requires constant human vigilance.
The Verdict: A Tool, Not a Colleague
The final ranking is stark. GPT-5 takes the top spot for its ability to reason through the problem with guidance. Grok 4 follows, offering a "cute" but ultimately useless pseudo-Python code. Gemini 2.5 and the expensive Deep Think model trail behind, while Claude Opus 4.1 lands at the bottom for its inability to grasp basic dimensional concepts. "I can't bring up the energy to continue," Hossenfelder admits regarding Claude, a sentiment that speaks volumes about the frustration of dealing with low-quality AI output.
"By my assessment, these models are currently not anywhere near as good as a good student."
This conclusion is a sobering reality check for the scientific community. The models are stuck in the existing literature, unable to generate the "novel conceptual breakthroughs" required for true discovery. They can criticize an idea if asked, but they cannot create one. Hossenfelder's verdict is clear: use these tools for background research and data retrieval, but do not trust them with the generation of new theories. "For the time being, my advice would be to use them for literature, research, and background information... but don't trust them with new ideas."
Bottom Line
Hossenfelder's rigorous testing confirms that while AI has revolutionized information retrieval, it has not yet cracked the code of scientific innovation. The strongest part of her argument is the demonstration that "plausible" is not the same as "correct," a distinction that is vital for any researcher relying on these tools. The biggest vulnerability in the current landscape is the models' tendency to hallucinate confidence, masking fundamental logical errors with fluent prose. Until these systems can distinguish between a sequence of words and a valid proof, the job of the theoretical physicist remains safe.