I tried vibe physics. This is what i learned

Sabine Hossenfelder delivers a rare, empirical stress test of artificial intelligence in the realm of theoretical physics, revealing that while these models are brilliant librarians, they remain terrible inventors. By pitting four leading large language models against a decades-old, unsolved Millennium Prize problem regarding fluid dynamics, she exposes a critical gap between the AI's ability to mimic scientific discourse and its capacity for genuine logical reasoning.

The Experiment: A Vague Idea Meets Cold Logic

Hossenfelder approaches the task not with a polished hypothesis, but with a "vague idea" she has carried for twenty years: using the geometry of General Relativity to prove that the Navier-Stokes equations for fluid flow inevitably develop singularities. She acknowledges the high stakes and the likelihood of failure, noting, "More likely that's a good reason for why it can't possibly work that I'm blissfully too stupid to see." This self-deprecating honesty sets the stage for a fair evaluation. She isn't asking the AI to solve the problem instantly; she is asking if it can navigate the complex, contradictory landscape of existing literature to find a path forward that a human might miss.

I tried vibe physics. This is what i learned

The author's methodology is rigorous. She subjects each model to the same prompt, watching how they handle the nuance of her proposal. The results are immediate and telling. When she asks GPT-5 to engage, it initially stumbles, confusing the constraints of the Millennium problem. However, with iterative prompting, it begins to "roughly understand the idea and the steps look kind of reasonable." This suggests that while the model lacks intuition, it possesses a robust ability to align its output with human guidance once the confusion is cleared.

"In practice, I've seen a lot of junk come out of it. Sometimes new, sometimes correct, but rarely both."

This observation cuts to the heart of the current AI hype cycle. Hossenfelder argues that the danger lies not in the models being wrong, but in them being plausibly wrong. They can assemble a sequence of equations that looks like a breakthrough but collapses under scrutiny. Critics might argue that this is simply a limitation of current training data, and that future models will overcome these logical gaps. However, Hossenfelder's experience suggests the issue is deeper: the models are fundamentally designed to predict the next likely word, not to verify the truth of the argument.

The Hallucination of Competence

The performance of the other models varies wildly, often descending into absurdity. Google's "Deep Think" model, despite its expensive subscription price, offers nothing but a rephrasing of Hossenfelder's own text. When pushed, it politely admits defeat, stating, "I cannot generate the novel conceptual breakthroughs that perform the specialized abstract mathematical reasoning required to solve a millennium problem." This is a rare moment of AI honesty, but it highlights a significant limitation: the model knows its boundaries but cannot transcend them.

Gemini, on the other hand, displays a dangerous overconfidence. It initially praises the idea as "brilliant" before declaring it "unworkable" based on a fundamental misunderstanding of the physics involved. Hossenfelder notes that the model "confuses time reversal symmetry with time reversibility" and mistakenly believes the Navier-Stokes equation violates energy conservation. "I think Gemini has a serious self-confidence issue," she writes. This is a crucial finding for busy professionals: an AI that is wrong but sounds authoritative is far more dangerous than one that is silent.

"The LLM idea of a new theory is a plausible looking sequence of arguments, not an actually correct one."

This distinction is the piece's most vital takeaway. The models are excellent at "digging up related work and explaining it," making them powerful tools for literature reviews and brainstorming. However, they fail when asked to synthesize new knowledge. Hossenfelder points out that they constantly conflate similar-sounding concepts, such as "energy" and "free energy," or switch between different mathematical notations mid-argument. If a human student made these errors, they would be corrected immediately. AI models, however, "will bring back these mistakes over and over," creating a feedback loop of error that requires constant human vigilance.

The Verdict: A Tool, Not a Colleague

The final ranking is stark. GPT-5 takes the top spot for its ability to reason through the problem with guidance. Grok 4 follows, offering a "cute" but ultimately useless pseudo-Python code. Gemini 2.5 and the expensive Deep Think model trail behind, while Claude Opus 4.1 lands at the bottom for its inability to grasp basic dimensional concepts. "I can't bring up the energy to continue," Hossenfelder admits regarding Claude, a sentiment that speaks volumes about the frustration of dealing with low-quality AI output.

"By my assessment, these models are currently not anywhere near as good as a good student."

This conclusion is a sobering reality check for the scientific community. The models are stuck in the existing literature, unable to generate the "novel conceptual breakthroughs" required for true discovery. They can criticize an idea if asked, but they cannot create one. Hossenfelder's verdict is clear: use these tools for background research and data retrieval, but do not trust them with the generation of new theories. "For the time being, my advice would be to use them for literature, research, and background information... but don't trust them with new ideas."

Bottom Line

Hossenfelder's rigorous testing confirms that while AI has revolutionized information retrieval, it has not yet cracked the code of scientific innovation. The strongest part of her argument is the demonstration that "plausible" is not the same as "correct," a distinction that is vital for any researcher relying on these tools. The biggest vulnerability in the current landscape is the models' tendency to hallucinate confidence, masking fundamental logical errors with fluent prose. Until these systems can distinguish between a sequence of words and a valid proof, the job of the theoretical physicist remains safe.

I tried vibe physics. This is what i learned

by Sabine Hossenfelder · Sabine Hossenfelder · Watch video

Vibe physics. I've learned people now use AI to develop new physics theories. I think that's great in principle. Physics needs new ideas.

In practice, I've seen a lot of junk come out of it. Sometimes new, sometimes correct, but rarely both. Which is why today I want to look at four different large language models to see how they do with the task of coming up with new physics. That's GPT5, Claude Opus 4.1, Grov 4 and Gemini Pro Ultra Extra Super Deep Think or whatever it's called.

Before we can do that, I'll have to briefly talk about well what I'm going to ask and what I expect. The example I've come up with is a vague idea that I've dragged around with me for 20 years or so to solve one of the millennium problems. The question of whether the Navia Stokes equation has blowups. The Navia Stokes equation is what one uses to describe fluids and gases like for example climate and weather models.

So basically solving the Navia Stokes equation. Roughly speaking the question is whether the solutions to the equation develop singularities from regular initial conditions and finite forces. I believe the answer to this question is yes. The reason is that I expect that ultimately you'll need quantum physics to prevent singularities and the Navia Stokes equation doesn't have quantum properties.

Yes is in some sense the easier answer because all you need to do is to find an example of where that happens that fulfills all the requirements. So far no one has found one. The reason I haven't actually worked on this is that for one thing because no one would have financed it because it's too far off anything I've previously worked on. But also on some level I don't actually believe it would work.

More likely that's a good reason for why it can't possibly work that I'm blissfully too stupid to see. So great topic to ask a chatbot about. In somewhat more detail, the idea I had is to use some solution to field equations with some suitable stress energy tensor, not that of the Navia Stokes fluid, but so that there is a coordinate system in which some of the Einstein equations happen to be the Navia Stokes equation. If you identify the geometry with the fluid somehow the time direction comes from the initial condition ...

The Experiment: A Vague Idea Meets Cold Logic

The Hallucination of Competence

The Verdict: A Tool, Not a Colleague

Bottom Line

Sources

I tried vibe physics. This is what i learned