← Back to Library

A real-life framework for rag evaluation: Beyond the 'looks-good-to-me' metric

Most teams building retrieval-augmented generation systems are flying blind, relying on a subjective 'looks-good-to-me' metric that NO BS AI rightly identifies as a project killer. This piece cuts through the hype of generative AI by arguing that without a rigid, time-boxed evaluation framework, organizations are doomed to a cycle of endless, directionless iteration. The editors don't just complain about the problem; they offer a specific, five-week roadmap to escape it, making this essential reading for any leader trying to move AI from a demo to a deployed product.

The Trap of Subjectivity

The article opens with a blunt assessment of the current state of the industry: 'Evaluation of RAG systems might seem straightforward in theory, but in practice it's one of the most challenging aspects of building generative AI systems.' The editors note that while the industry focuses on retrieval accuracy and generation quality, the reality of noisy, real-world data makes these metrics slippery. The piece argues that 'it is tempting to rely on a "looks-good-to-me" evaluation metric, which is subjective, inconsistent, and leads to misleading conclusions.'

A real-life framework for rag evaluation: Beyond the 'looks-good-to-me' metric

This framing is crucial because it shifts the blame from the technology to the process. The editors suggest that without structure, teams enter a 'vicious circle of never-ending improvements without a clear understanding of the usefulness of the system.' This is a hard truth for engineering leaders who often prioritize feature velocity over validation rigor. The piece goes so far as to say, 'To be honest, if the team does not want to spend time on developing a solid evaluation framework, I do not want to work on a project.'

Critics might argue that in a fast-moving market, a five-week delay for evaluation is a luxury few can afford. However, the editors counter that skipping this step is a false economy that leads to 'endless delays and dissatisfaction from stakeholders' later on.

Without proper evaluation you enter a vicious circle of never-ending improvements without a clear understanding of the usefulness of the system.

A Structured Path Forward

The core of the argument is a three-phase, five-week plan designed to be 'down to earth' and 'time boxed.' The first phase, lasting three weeks, is dedicated to experimentation. The editors warn that teams must avoid the trap of 'picking a few examples which are "not working" and try to fix them - damaging performance on the majority of data.' Instead, the goal is to reach a point where responses are 'good enough'—not perfect, but coherent. The output here is a collection of 'at least 100 examples of real questions and generated answers' to serve as a baseline.

The second phase, spanning two weeks, focuses on dataset creation. NO BS AI reports that 'many teams make the critical mistake of skipping this phase or generating synthetic datasets, which often fail to reflect real-world usage.' The editors insist on gathering 'actual user queries from real-life scenarios' to avoid the bias of artificial data. They propose a specific scoring scale, borrowing from META's RAG challenge, to categorize responses as 'Perfect,' 'Acceptable,' 'Missing,' or 'Incorrect.'

This approach is effective because it acknowledges the difficulty of establishing ground truth in customer support, where 'answers can differ significantly between human agents.' By using a human-evaluated scale rather than demanding a single 'correct' answer, the framework becomes more scalable and realistic.

Calibrating the Judge

The final week is dedicated to calibrating an automated 'judge'—an LLM trained to mimic human grading. The editors describe this step as 'more of an art than a science,' requiring a prompt that correlates with human scores on a test set of 50 responses. The goal is to ensure that 'if evaluator scored response as 1, LLM should also output 1.'

Once calibrated, this judge allows teams to 'assess system performance at scale,' measuring retrieval effectiveness across thousands of queries. The piece concludes with a stark warning: 'The key takeaway is: Have the discipline to stop experimenting after three weeks and transition to structured evaluation.'

This is the strongest part of the argument: the emphasis on discipline over endless tinkering. It challenges the common engineering instinct to keep optimizing until the code is perfect, suggesting instead that a defined, measurable standard is the only way to prove value to the business.

The key takeaway is: Have the discipline to stop experimenting after three weeks and transition to structured evaluation.

Bottom Line

NO BS AI delivers a pragmatic, actionable framework that exposes the fragility of current AI evaluation practices. Its greatest strength is the insistence on a hard stop to experimentation, forcing teams to confront the reality of their data before scaling. The biggest vulnerability remains the human element—getting stakeholders to agree to a five-week pause in development is a political challenge as much as a technical one, but the piece makes a compelling case that the alternative is failure.

Sources

A real-life framework for rag evaluation: Beyond the 'looks-good-to-me' metric

by Various · NO BS AI · Read full article

Evaluation of RAG systems might seem straightforward in theory, but in practice it's one of the most challenging aspects of building generative AI systems, especially in real-world applications like customer support where data is noisy. Many discussions on RAG evaluation center around two primary components:

Retrieval Accuracy – Measuring the percentage of times the system retrieves the correct context required to answer a given question.

Generation Quality – Assessing the correctness and relevance of generated responses, usually with another LLM as a judge.

However, considering that questions asked by customers are very diversified and the source of knowledge can come from not standard data sources (link do click&create), implementation is far from trivial. That is why it is tempting to rely on a "looks-good-to-me" evaluation metric, which is subjective, inconsistent, and leads to misleading conclusions. Instead, a structured approach to RAG evaluation must be adopted, ensuring rigorous testing and validation before the system is deployed.

To be honest, if the team does not want to spend time on developing a solid evaluation framework, I do not want to work on a project. I have seen it many times so far and learnt from my mistake - without proper evaluation you enter a vicious circle of never-ending improvements without a clear understanding of the usefulness of the system. No, thank you.

In this article I am proposing a framework which is:

down to earth

considers the specifics of your data

and is time boxed so you can propose a concrete way forward for your team and have other stakeholders on board.

By following this five-week structured approach—three weeks for experimentation, two weeks for dataset creation, and one week for judge calibration—we ensure that we build a reliable and scalable evaluation framework. This is the way to move beyond the flawed "looks-good-to-me" metric and towards a robust, production-ready RAG system.

Step 1: Experimenting and Finding the Sweet Spot (3 Weeks).

Before conducting a formal evaluation, technical teams should experiment with various configurations to establish a baseline level of acceptable performance. The outputs from this phase are not intended to be production-ready. The primary goal is to generate results that can be reviewed by the evaluation team, allowing them to assess response quality. Since you will be "training" your evaluator on these responses, it's crucial to present a diverse range of questions.

During this stage it is critical to avoid presenting extremely poor ...