← Back to Library

A pragmatic guide to LLM evals for devs

Gergely Orosz delivers a necessary corrective to the current AI engineering frenzy, arguing that the industry's obsession with "vibes-based development" is a recipe for catastrophic failure. While many treat Large Language Models as magical black boxes, Orosz insists that without rigorous, data-driven evaluation, these systems are nothing more than expensive guesswork. This piece matters because it moves beyond the hype to offer a concrete, repeatable engineering discipline that bridges the gap between experimental AI and reliable software.

The Trap of Subjectivity

The article opens by dismantling the illusion that LLMs can be tested like traditional software. Orosz writes, "LLMs are non-deterministic, meaning there's no guarantee they'll provide the same answer to the same question twice." This fundamental difference breaks the standard testing models engineers rely on. Instead of checking for a single correct output, developers are now navigating a landscape of infinite valid possibilities. Orosz identifies a dangerous pattern he calls the "vibes-based development" trap, where teams ship products based on a superficial "looked good to me" assessment.

A pragmatic guide to LLM evals for devs

This framing is effective because it names a behavior many engineers recognize but lack the vocabulary to critique. The author argues that this approach fails because it ignores the "Three Gulfs": the gap between developer understanding and model behavior, the gap between intent and prompt specification, and the gap between instructions and generalization. As Orosz puts it, "You can't test for correctness before you've systematically observed the range of possible outputs and have defined what 'good' even means for your product." This is a crucial pivot point; it suggests that the industry is trying to solve the wrong problem by applying old solutions to new constraints.

"Many teams are tempted to grab a pre-built 'hallucination score' or 'helpfulness' eval, but in my experience, these metrics are often worse than useless."

Critics might argue that off-the-shelf metrics provide a necessary baseline for early-stage startups, but Orosz counters that these generic scores create a false sense of security. He illustrates this with a mental health startup that tracked "helpfulness" on a 1-5 scale, only to find the data unactionable. The argument here is that without context, numbers are just noise. This aligns with the principles of grounded theory, where categories must emerge from the data itself rather than being imposed from the top down. Orosz is right to warn that optimizing for vanity metrics can lead teams to ignore the actual friction points their users face.

From Chaos to Codified Error Analysis

The core of Orosz's proposal is a return to error analysis, a discipline that has been central to machine learning for decades but was recently abandoned in the rush to deploy generative AI. He details a workflow where teams manually review conversation traces, a process he describes as "unglamorous" but essential. By building custom data viewers, the NurtureBoss team was able to move from random sampling to a systematic review of hundreds of interactions. This mirrors the historical rigor of Test-Driven Development, yet Orosz acknowledges that TDD fails here because "there isn't one right answer, but thousands."

The methodology relies on a two-step coding process: open coding, where engineers jot down descriptive notes on failures without predefined checklists, and axial coding, where these notes are grouped into themes. Orosz writes, "Let the data speak for itself, and jot down descriptive observations like: 'The agent missed a clear opportunity to re-engage a price-sensitive user.'" This bottom-up approach ensures that the evaluation criteria are rooted in real user pain rather than theoretical ideals. The author's emphasis on identifying the "first upstream failure" is particularly sharp, noting that fixing a single root cause often resolves a cascade of downstream errors.

"This bottom-up process is the single highest-ROI activity in AI development. It ensures you're solving real problems, not chasing vanity metrics."

The distinction between deterministic and subjective failures is where the article offers its most practical engineering advice. For tasks like date extraction, Orosz advocates for code-based evals, which function like traditional unit tests. However, for nuanced decisions like when to hand off a conversation to a human, he proposes an LLM-as-judge model. This requires the judge to be validated against human expertise to avoid the judge simply memorizing answers. Orosz warns, "Avoid your LLM judge memorizing answers by partitioning your data and measuring how well the judge generalizes to unfamiliar data." This nuance is vital; it prevents the evaluation system from becoming a self-fulfilling prophecy where the AI just grades itself on criteria it invented.

The Flywheel of Improvement

Ultimately, Orosz frames evaluation not as a one-time checkpoint but as a continuous loop. He describes a "flywheel of improvement" where teams analyze data, measure performance, improve the model, and automate the process before starting again. This approach integrates evaluation directly into the CI/CD pipeline, ensuring that regressions are caught immediately. The argument is that the "vibes" phase must end if AI engineering is to mature into a reliable discipline. As Orosz concludes, the goal is to move from guesswork to a repeatable engineering discipline that can scale.

"This is the antidote to the problems of generic, off-the-shelf metrics. Many teams are tempted to grab a pre-built 'hallucination score' or 'helpfulness' eval, but in my experience, these metrics are often worse than useless."

The strongest aspect of this piece is its refusal to treat AI as magic. By grounding the discussion in established software engineering principles and qualitative research methods, Orosz provides a roadmap for stability in a chaotic field. However, the approach demands a level of manual labor and tooling investment that may be prohibitive for smaller teams without significant resources. While the methodology is sound, the barrier to entry for building custom data viewers and maintaining large golden datasets remains a practical hurdle.

Bottom Line

Orosz's argument is a vital intervention that shifts the conversation from model capabilities to system reliability. The piece's greatest strength is its insistence that error analysis must be data-driven and bottom-up, rather than relying on generic, off-the-shelf metrics. The biggest vulnerability lies in the execution: the rigorous, manual processes required to build these evaluation systems are resource-intensive and may slow down the rapid iteration cycles many startups crave. Engineers should watch for how the industry adapts these manual workflows into automated tooling, as that will determine whether this discipline becomes standard practice or remains an elite luxury.

Deep Dives

Explore these related deep dives:

  • Grounded theory

    Linked in the article (30 min read)

  • Likert scale

    Linked in the article (13 min read)

  • Test-driven development

    The article discusses how traditional TDD falls short for LLM applications due to non-deterministic outputs. Understanding the formal methodology of TDD - its red-green-refactor cycle, origins in extreme programming, and assumptions about deterministic correctness - provides essential context for why LLM evals represent a paradigm shift in software quality assurance.

Sources

A pragmatic guide to LLM evals for devs

One word that keeps cropping up when I talk with software engineers who build large language model (LLM)-based solutions is “evals”. They use evaluations to verify that LLM solutions work well enough because LLMs are non-deterministic, meaning there’s no guarantee they’ll provide the same answer to the same question twice. This makes it more complicated to verify that things work according to spec than it does with other software, for which automated tests are available.

Evals feel like they are becoming a core part of the AI engineering toolset. And because they are also becoming part of CI/CD pipelines, we, software engineers, should understand them better — especially because we might need to use them sooner rather than later! So, what do good evals look like, and how should this non-deterministic-testing space be approached?

For directions, I turned to an expert on the topic, Hamel Husain. He’s worked as a Machine Learning engineer at companies including Airbnb and GitHub, and teaches the online course AI Evals For Engineers & PMs — the upcoming cohort starts in January. Hamel is currently writing a book, Evals for AI Engineers, to be published by O’Reilly next year.

In today’s issue, we cover:

Vibe-check development trap. An agent appears to work well, but as soon as it is modified, it can’t be established that it’s working correctly.

Core workflow: error analysis. Error analysis has been a key part of machine learning for decades and is useful for building LLM applications.

Building evals: the right tools for the job. Use code-based evals for deterministic failures, and an LLM-as-judge for subjective cases.

Building an LLM-as-judge. Avoid your LLM judge memorizing answers by partitioning your data and measuring how well the judge generalizes to unfamiliar data.

Align the judge, keep trust. The LLM judge’s expertise needs to be validated against human expertise. Consider metrics like True Positive Rate (TPR) and True Negative Rate (TNR).

Evals in practice: from CI/CD to production monitoring. Use evals in the CI/CD pipeline, but use production data to continuously validate that they work as expected, too.

Flywheel of improvement. Analyze → measure → Improve → automate → start again

With that, it’s over to Hamel:

1. Vibe-check development trap.

Organizations are embedding LLMs into applications from customer service to content creation. Yet, unlike traditional software, LLM pipelines don’t produce deterministic outputs; their responses are often subjective and context-dependent. A response might be ...