← Back to Library

Giving your AI a job interview

Ethan Mollick cuts through the noise of AI hype with a provocative, practical thesis: stop trusting public scores and start conducting your own job interviews. While the industry obsesses over whether a model can answer trivia about Homo erectus, Mollick argues we are blind to the actual, dangerous inconsistencies in how these systems think, reason, and advise.

The Trap of the Public Scorecard

Mollick begins by dismantling the reliability of the very metrics the industry uses to measure progress. He points out that "many benchmarks and their answer keys are public, so some AIs end up incorporating them into their basic training, whether by accident or so they can score highly on these benchmarks." This is a critical observation that mirrors the psychological concept of Goodhart's law: when a measure becomes a target, it ceases to be a good measure. If the test is known, the system optimizes for the test, not for intelligence.

Giving your AI a job interview

He illustrates the absurdity of current evaluations by noting that popular tests include questions like "What is the approximate mean cranial capacity of Homo erectus?" and asks, "What does getting this right tell us? I have no idea." This rhetorical question lands hard because it exposes the disconnect between a high score and real-world utility. The author notes that even when scores rise, "we don't know if moving from 84% correct to 85% is as challenging as moving from 40% to 41% correct." The data is trending up, but the calibration is broken.

You wouldn't hire a VP based solely on their SAT scores. You shouldn't pick the AI that will advise thousands of decisions for your organization based on whether it knows that the mean cranial capacity of Homo erectus is just under 1,000 cubic centimeters.

Critics might argue that standardized testing is the only scalable way to compare models across thousands of parameters, but Mollick's point is that scalability without validity is a liability. The underlying trend is real—models are getting better—but the specific metrics are often measuring the wrong things.

Benchmarking on Vibes

If the formal tests are flawed, Mollick suggests a more intuitive, albeit subjective, approach: "Benchmarking on Vibes." He describes how power users develop idiosyncratic tests to gauge a model's "world model" and internal logic. He shares his own quirk: asking every image model to "create an otter on a plane." While it sounds whimsical, Mollick explains that "these approaches also give you a sense of the AI's understanding of how things relate to each other."

He demonstrates the power of this method with a writing exercise about a character rationing their remaining words. The results reveal stark differences: "Gemini 2.5 Pro, currently the weakest of these four models, doesn't even accurately keep track of the number of words used," while others struggle with coherence. This section is effective because it shifts the focus from abstract accuracy to tangible behavior. It forces the reader to see the AI not as a database, but as an agent with distinct personality traits and failure modes.

However, relying on vibes has a clear weakness. Mollick admits, "we are relying on our feelings rather than real measures." A single creative prompt might not reveal systemic biases or safety failures. It is a useful heuristic for individuals, but it lacks the rigor needed for enterprise deployment.

The Real World Interview

The piece pivots to its most compelling argument: organizations must treat AI selection like hiring a human executive. Mollick highlights the GDPval paper from OpenAI as the gold standard for this approach. Instead of multiple-choice questions, researchers created "complex and realistic projects that would take human experts an average of four to seven hours to complete." The results were nuanced: the best models beat humans in software development but "pharmacists, industrial engineers, and real estate agents easily beat the best AI."

This reveals the "Jagged Frontier" of AI capability, where performance varies wildly by task. But Mollick pushes further, arguing that even real-world task performance isn't enough. We must also understand the AI's "attitude" when making decisions. He describes an experiment where he asked various models to rate the viability of a "guacamole via drones" startup. The results were chaotic: "Grok thought this was a great idea, and Microsoft Copilot was excited as well. Other models, like GPT-5 and Claude 4.5, were more skeptical."

When your AI is giving advice at scale, consistently rating ideas 3–4 points higher or lower means consistently steering you in a different direction.

This is the crux of the argument. The difference isn't just a score; it's a strategic divergence. An AI that is consistently more risk-seeking or risk-averse will fundamentally alter a company's trajectory. Mollick writes, "You need to systematically test your AI on the actual work it will do and the actual judgments it will make." This reframes the entire conversation from "which model is the smartest?" to "which model fits our specific risk profile and operational needs?"

Bottom Line

Mollick's strongest contribution is the shift from passive consumption of benchmarks to active, rigorous interrogation of models. The vulnerability in his argument is the sheer resource intensity required; not every organization has the time or expertise to run "GDPval-style" evaluations repeatedly. However, the alternative—blindly trusting a public score on a test that may be contaminated by the training data—is a far greater risk. The era of trusting the leaderboard is over; the era of the AI job interview has begun.

Deep Dives

Explore these related deep dives:

  • Goodhart's law

    The article discusses how AI benchmarks become problematic when models are optimized specifically for them - a direct application of Goodhart's law ('when a measure becomes a target, it ceases to be a good measure'). This concept provides deeper theoretical grounding for the benchmark gaming issues described.

  • Psychometrics

    The article grapples with fundamental measurement challenges - uncalibrated tests, unknown validity, and what scores actually measure. Psychometrics is the scientific discipline that addresses exactly these problems for measuring human abilities, and the same principles apply to AI evaluation.

  • g factor (psychometrics)

    The article explicitly references an 'underlying ability factor' that different benchmarks seem to measure collectively. This directly parallels the g factor theory in intelligence research - the idea that diverse cognitive tests correlate because they tap into a general intelligence factor.

Sources

Giving your AI a job interview

by Ethan Mollick · One Useful Thing · Read full article

Given how much energy, literal and figurative, goes into developing new AIs, we have a surprisingly hard time measuring how “smart” they are, exactly. The most common approach is to treat AI like a human, by giving it tests and reporting how many answers it gets right. There are dozens of such tests, called benchmarks, and they are the primary way of measuring how good AIs get over time.

There are some problems with this approach.

First, many benchmarks and their answer keys are public, so some AIs end up incorporating them into their basic training, whether by accident or so they can score highly on these benchmarks. But even when that doesn’t happen, it turns out that we often don’t know what these tests really measure. For example, the very popular MMLU-Pro benchmark includes questions like “What is the approximate mean cranial capacity of Homo erectus?” and “What place is named in the title of the 1979 live album by rock legends Cheap Trick?” with ten possible answers for each. What does getting this right tell us? I have no idea. And that is leaving aside the fact that tests are often uncalibrated, meaning we don’t know if moving from 84% correct to 85% is as challenging as moving from 40% to 41% correct. And, on top of all that, for many tests, the actual top score may be unachievable because there are many errors in the test questions and measures are often reported in unusual ways.

Despite these issues, all of these benchmarks, taken together, appear to measure some underlying ability factor. And higher-quality benchmarks like ARC-AGI and METR Long Tasks show the same upward, even exponential, trend. This matches tests of the real-world impact of AI across industries that suggest that this underlying increase in “smarts” translates to actual ability in everything from medicine to finance.

So, collectively, benchmarking has real value, but the few robust individual benchmarks focus on math, science, reasoning, and coding. If you want to measure writing ability or sociological analysis or business advice or empathy, you have very few options. I think that creates a problem, both for individuals and organizations. Companies decide which AIs to use based on benchmarks, and new AIs are released with fanfare about benchmark performance. But what you actually care about is which model would be best for YOUR needs.

To figure this out for yourself, you ...