Grok 4 is here, and it's generating more noise than any AI model release before. But beneath the hype cycle, there are real insights worth extracting about where this technology stands—and where it might be heading.
The Smartest Model Around—For Now
According to benchmark scores, Grok 4 may actually be the smartest language model currently available. On certain high school math competitions and a well-known science benchmark called Google Proof Q&A, it outperforms both OpenAI's best model and Google's best model. It also performs exceptionally on at least one coding benchmark.
Elon Musk went further, claiming Grok 4 is "smarter than almost all graduate students in all disciplines simultaneously." That quote has been picked up everywhere—but it deserves three important caveats.
First, Grok 4 remains a language model, which means it's still prone to hallucinations. It's not a new paradigm of intelligence. Second, we've heard this kind of hype before. Eighteen months ago, Google's DeepMind CEO claimed Gemini 2 was better than almost all human experts—a statement that proved to be an exaggeration then and is likely an exaggeration now. Third, Musk's quote applies specifically to academic questions. Grok 4 performs at a postgrad level in subjects where it has training data, but real-world expertise involves far more than answering multiple-choice questions.
The Benchmark Problem
The benchmark results are misleading for several reasons. Most notably, the y-axis doesn't begin at zero, which exaggerates the differences between models visually. XAI—the company behind Grok 4—selectively chooses which models to compare against. In one high school math competition, Grok 4 heavily outperforms Gemini Deepthink, but in a coding benchmark called Live Codebench, Gemini Deepthink actually outperforms Grok 4 heavy—and yet it's not shown in the chart.
When model providers show benchmarks, take them with a grain of salt, especially when answers are available online. None of this fully explains Grok 4's brilliant performance on ARC AGI2, a semi-private evaluation that has gained nearly three million views on X. This is considered a fairly rigorous test of fluid intelligence, and Grok 4 genuinely does beat other models at picking up latent patterns in data.
The Simple Bench Test
A benchmark for how smart a model feels exists—it's called Simple Bench, testing social intelligence, trick questions, and spatiotemporal reasoning. Running about twenty questions provides a good estimation: one question is a spin on a common logic puzzle, where Grok 4 actually sees through the trap. That's the first model not to pick the trap answer.
Grok 4 will feel smart, but if you draw it out of its comfort zone—with spatial reasoning, for example—it can still fall apart. In this question, in common with all other models, Grok 4 doesn't notice that a glove will simply fall on the road. It also takes an extremely long time to answer fairly often.
The author suspects Grok 4 will top the Simple Bench leaderboard, meaning it's not just benchmark hacking.
Humanity's Last Exam
One more benchmark worth touching on is Humanity's Last Exam, where under certain settings Grok 4 scores over 50%—by far the best performance of any model. However, this is a knowledge-intensive benchmark heavily dependent on training data. For example, whether it knows hummingbirds have a bilateral paired oval bone doesn't actually indicate how intelligent the model is.
The author called last September that this exam would fall sooner than many expected—with tools means Grok 4 can write code to perform certain computations.
How Grok 4 Heavy Works
What exactly is Grok 4 Heavy? It spawns multiple agents in parallel. Those agents do work independently, compare their work, and decide which one has the best answer. It's not a simple majority vote because often only one agent figures out the trick or solution. Once they share that insight with other agents, they essentially compare notes and yield an answer.
That is exactly the premise of Smart GPT released around 18 months ago, which scored a record performance at the time. Ironically, that exam was authored by Dan Hendris, who is also the lead author of Humanity's Last Exam.
Text vs. Visual Performance
One thing many might have missed: Grok 4 and Grok 4 Heavy text-based performance is extremely good, but on full benchmarks including visual segments, it's a more modest improvement over Gemini 2.5 Pro. In other words, you probably shouldn't rely on it for decoding Roman inscriptions.
The Pricing Puzzle
Grok 4 Heavy costs $3,000 a year or $300 a month. XAI promises new features like video generation coming in October, but Gemini Ultra for a lower price already has version 3. If your pockets are deep enough, subscribe to everything. But if this is your only subscription, it's hard to look past the much cheaper $20 Gemini Pro.
For developers, Grok 4's pricing matches Claude for Sonnet at $3 input and $15 output—a decent price for a frontier model, but there are much cheaper alternatives.
What's Coming Next
If you watched the live stream, Musk mentions repeatedly that they have new features and new models coming soon. Grok 5 may be finishing training imminently. However, leaks this week indicate Gemini 3 is coming, and of course there's perennial talk about GPT 5 arriving this month.
It used to be that we'd wait six months for actual release due to safety checks—would a model help with creating a bioweaker, for example? But that seems to have changed.
Safety and Behavior Concerns
Grok 4 may suffer from a similar issue to Grok 3: it seems to get sudden urges to praise certain historical figures or focus on particular countries. That behavior was caused by an addition to the system prompt stating responses should not shy away from making claims which are politically incorrect. If such a small change causes wild behavior, anything could happen with Grok 4.
System prompts aren't the only issue—XAI is reportedly burning through $1 billion a month. Either Grok 4 or Grok 5 almost needs to bring in more revenue.
There is an awkward point: while it's crazy impressive how fast XAI has caught up to OpenAI and Google DeepMind, bringing in the generators necessary came at a local cost. If you thought it was wild how quickly they got to 100,000 GPUs, they're planning to bring an entire power plant to Memphis with 1 million AI GPUs.
The Real Value
End on a positive note: even though Musk said Grok 4 couldn't generate new scientific discoveries just yet, there is an underrated point demonstrated by this game made with help from Grok 4 in just four hours. While models like Grok 4 often struggle to solo generate new science, what they are optimized for is making existing science or code easier.
We probably shouldn't underestimate the impact of allowing everyone to do much more on their own—but you probably shouldn't be using Grok 4 to analyze whether you should vote for a bill.
However, if Grok 4's edge comes from its access to X and Twitter data, at least for Grok 5's sake, let's hope X can clean up so much of the bot replies, spam, and clickbait currently on the platform.
If such a small change to the system prompt causes wild behavior, anything could happen with future versions.
Bottom Line
Grok 4 represents genuine progress in reasoning capability, particularly on academic benchmarks. Its strongest asset is pattern recognition across multiple disciplines. But the biggest vulnerability is that benchmark performance doesn't translate to real-world expertise—and the hype around these models consistently exceeds their actual utility. Watch for Grok 5 and Gemini 3 arriving soon, but don't expect AI to generate new scientific discoveries anytime soon based on current capabilities.