← Back to Library

Grok 4 - 10 new things to know

Grok 4 is here, and it's generating more noise than any AI model release before. But beneath the hype cycle, there are real insights worth extracting about where this technology stands—and where it might be heading.

The Smartest Model Around—For Now

According to benchmark scores, Grok 4 may actually be the smartest language model currently available. On certain high school math competitions and a well-known science benchmark called Google Proof Q&A, it outperforms both OpenAI's best model and Google's best model. It also performs exceptionally on at least one coding benchmark.

Grok 4 - 10 new things to know

Elon Musk went further, claiming Grok 4 is "smarter than almost all graduate students in all disciplines simultaneously." That quote has been picked up everywhere—but it deserves three important caveats.

First, Grok 4 remains a language model, which means it's still prone to hallucinations. It's not a new paradigm of intelligence. Second, we've heard this kind of hype before. Eighteen months ago, Google's DeepMind CEO claimed Gemini 2 was better than almost all human experts—a statement that proved to be an exaggeration then and is likely an exaggeration now. Third, Musk's quote applies specifically to academic questions. Grok 4 performs at a postgrad level in subjects where it has training data, but real-world expertise involves far more than answering multiple-choice questions.

The Benchmark Problem

The benchmark results are misleading for several reasons. Most notably, the y-axis doesn't begin at zero, which exaggerates the differences between models visually. XAI—the company behind Grok 4—selectively chooses which models to compare against. In one high school math competition, Grok 4 heavily outperforms Gemini Deepthink, but in a coding benchmark called Live Codebench, Gemini Deepthink actually outperforms Grok 4 heavy—and yet it's not shown in the chart.

When model providers show benchmarks, take them with a grain of salt, especially when answers are available online. None of this fully explains Grok 4's brilliant performance on ARC AGI2, a semi-private evaluation that has gained nearly three million views on X. This is considered a fairly rigorous test of fluid intelligence, and Grok 4 genuinely does beat other models at picking up latent patterns in data.

The Simple Bench Test

A benchmark for how smart a model feels exists—it's called Simple Bench, testing social intelligence, trick questions, and spatiotemporal reasoning. Running about twenty questions provides a good estimation: one question is a spin on a common logic puzzle, where Grok 4 actually sees through the trap. That's the first model not to pick the trap answer.

Grok 4 will feel smart, but if you draw it out of its comfort zone—with spatial reasoning, for example—it can still fall apart. In this question, in common with all other models, Grok 4 doesn't notice that a glove will simply fall on the road. It also takes an extremely long time to answer fairly often.

The author suspects Grok 4 will top the Simple Bench leaderboard, meaning it's not just benchmark hacking.

Humanity's Last Exam

One more benchmark worth touching on is Humanity's Last Exam, where under certain settings Grok 4 scores over 50%—by far the best performance of any model. However, this is a knowledge-intensive benchmark heavily dependent on training data. For example, whether it knows hummingbirds have a bilateral paired oval bone doesn't actually indicate how intelligent the model is.

The author called last September that this exam would fall sooner than many expected—with tools means Grok 4 can write code to perform certain computations.

How Grok 4 Heavy Works

What exactly is Grok 4 Heavy? It spawns multiple agents in parallel. Those agents do work independently, compare their work, and decide which one has the best answer. It's not a simple majority vote because often only one agent figures out the trick or solution. Once they share that insight with other agents, they essentially compare notes and yield an answer.

That is exactly the premise of Smart GPT released around 18 months ago, which scored a record performance at the time. Ironically, that exam was authored by Dan Hendris, who is also the lead author of Humanity's Last Exam.

Text vs. Visual Performance

One thing many might have missed: Grok 4 and Grok 4 Heavy text-based performance is extremely good, but on full benchmarks including visual segments, it's a more modest improvement over Gemini 2.5 Pro. In other words, you probably shouldn't rely on it for decoding Roman inscriptions.

The Pricing Puzzle

Grok 4 Heavy costs $3,000 a year or $300 a month. XAI promises new features like video generation coming in October, but Gemini Ultra for a lower price already has version 3. If your pockets are deep enough, subscribe to everything. But if this is your only subscription, it's hard to look past the much cheaper $20 Gemini Pro.

For developers, Grok 4's pricing matches Claude for Sonnet at $3 input and $15 output—a decent price for a frontier model, but there are much cheaper alternatives.

What's Coming Next

If you watched the live stream, Musk mentions repeatedly that they have new features and new models coming soon. Grok 5 may be finishing training imminently. However, leaks this week indicate Gemini 3 is coming, and of course there's perennial talk about GPT 5 arriving this month.

It used to be that we'd wait six months for actual release due to safety checks—would a model help with creating a bioweaker, for example? But that seems to have changed.

Safety and Behavior Concerns

Grok 4 may suffer from a similar issue to Grok 3: it seems to get sudden urges to praise certain historical figures or focus on particular countries. That behavior was caused by an addition to the system prompt stating responses should not shy away from making claims which are politically incorrect. If such a small change causes wild behavior, anything could happen with Grok 4.

System prompts aren't the only issue—XAI is reportedly burning through $1 billion a month. Either Grok 4 or Grok 5 almost needs to bring in more revenue.

There is an awkward point: while it's crazy impressive how fast XAI has caught up to OpenAI and Google DeepMind, bringing in the generators necessary came at a local cost. If you thought it was wild how quickly they got to 100,000 GPUs, they're planning to bring an entire power plant to Memphis with 1 million AI GPUs.

The Real Value

End on a positive note: even though Musk said Grok 4 couldn't generate new scientific discoveries just yet, there is an underrated point demonstrated by this game made with help from Grok 4 in just four hours. While models like Grok 4 often struggle to solo generate new science, what they are optimized for is making existing science or code easier.

We probably shouldn't underestimate the impact of allowing everyone to do much more on their own—but you probably shouldn't be using Grok 4 to analyze whether you should vote for a bill.

However, if Grok 4's edge comes from its access to X and Twitter data, at least for Grok 5's sake, let's hope X can clean up so much of the bot replies, spam, and clickbait currently on the platform.

If such a small change to the system prompt causes wild behavior, anything could happen with future versions.

Bottom Line

Grok 4 represents genuine progress in reasoning capability, particularly on academic benchmarks. Its strongest asset is pattern recognition across multiple disciplines. But the biggest vulnerability is that benchmark performance doesn't translate to real-world expertise—and the hype around these models consistently exceeds their actual utility. Watch for Grok 5 and Gemini 3 arriving soon, but don't expect AI to generate new scientific discoveries anytime soon based on current capabilities.

Deep Dives

Explore these related deep dives:

Sources

Grok 4 - 10 new things to know

by AI Explained · AI Explained · Watch video

Gro 4 is out and it's a pretty good AI model, but there is going to be more noise about this language model than possibly any other. So hopefully I can give you a little signal amid the chaos. Let's boil things down to just 10 things to know about the newest and possibly smartest AI model. Point one is that Croc 4 might just be the smartest model around, at least according to the benchmarks.

In certain settings on high school math competitions, it beats out OpenAI's best model and Google's best model. The same is true for a fairly famous science benchmark, the Google proof Q&A, where it again beats out Anthropic's best model and Google's. Likewise on at least one coding benchmark, but Elon Musk went much further saying about Gro 4 that quote it's smarter than almost all graduate students in all disciplines simultaneously. That quote is of course going to be picked up by everyone, but it needs three important caveats.

First from me is that Grock 4 is still a language model, which means it's still going to be prone to all those hallucinations you're familiar with. It's not a new paradigm of AI. Second, we have heard that kind of hype before, notably from the Google DeepMind CEO Demesis Saris almost 18 months ago saying that Gemini 2 was better than almost all human experts. amazing about Gemini is that it's so good at so many things.

As we started getting to the end of the training, for example, each of the 50 different subject areas that we tested on, it's as good as the best expert humans in those areas. That was an exaggeration then and Musk is exaggerating now because real world performance doesn't always match up to benchmark performance. Expertise is way more than answering multiple choice questions. Hence the third bit of context coming from Musk himself the CEO of XAI saying that quote about being smarter than graduates was at least with respect to academic questions.

Gro four is a postgrad level in everything like it's it just some of these things are just worth repeating like Grock 4 is postgraduate like PhD level in everything better than Ph but like most PhDs would fail so it's better that said at least with respect to academic questions point number two is that I've been highly impressed by Gro ...