← Back to Library

Llms as ground truth

In an era where every dollar spent on artificial intelligence feels like a gamble, a new technical argument suggests we may have been overpaying for a service that traditional tools can handle just as well. NO BS AI challenges the prevailing assumption that large language models must do the heavy lifting in production, proposing instead a hybrid architecture that slashes costs by ninety percent without sacrificing accuracy. This is not just a cost-cutting hack; it is a fundamental rethinking of how we deploy machine learning in real-world customer service.

The Cost of Brittleness

The piece begins by dismantling the allure of the "perfect prompt." While elaborate instructions can coax a large language model into high-quality classification, the editors note that this approach is inherently fragile. "They are brittle - as demonstrated in the article, classification results can often be unstable, influenced by parameters like temperature," the report warns. Beyond instability, the financial burden is steep. The article argues that "detailed prompts like this have their problems... They are costly - their length and level of detail mean they contain a significant number of tokens, which increases expense."

Llms as ground truth

This observation hits a nerve for any organization scaling AI operations. The reliance on massive, token-heavy prompts creates a feedback loop where accuracy demands more money, and more money invites more complexity. NO BS AI suggests that the industry has conflated "smart" with "expensive," overlooking a more efficient path.

The Hybrid Solution

The core of the argument is a two-step process that treats the large language model not as the worker, but as the teacher. The editors describe a pipeline where a refined prompt is used once to generate high-quality training data, effectively turning the expensive model into an annotator. "Today, a well-designed prompt can transform an LLM into an excellent annotator for our purposes," the piece states. This one-time investment creates a dataset of "ground truth" that can then train a significantly cheaper, traditional machine learning classifier like XGBoost.

The results presented are striking. By feeding embeddings from a standard model into a classifier trained on this AI-generated data, the system achieved robust performance on real customer inquiries. The editors report, "We obtained the following results... on a test set normalized to 1000 examples," showing that the hybrid model could handle the vast majority of cases instantly and cheaply. This shifts the paradigm from running a massive model on every single query to using it only where absolutely necessary.

"By leveraging the 'overfitted' prompts as a source of high-quality training data, we successfully trained a traditional machine learning classifier that performs effectively in production."

The Safety Net Strategy

The most compelling part of the coverage is how it handles the inevitable errors of the cheaper model. The editors acknowledge that no classifier is perfect, particularly with "False Negatives - texts which really belong to 'Class1' but they are classified as 'Other'." Rather than trying to force the cheap model to be perfect, the strategy uses statistical probability to create a safety net. The system flags the top five percent of uncertain cases and routes them to the large language model for a final check.

This targeted approach is where the economics truly shift. The piece explains, "Most likely, the offenders will get classified correctly as 'Other' at the final correctness check," but the real savings come from not checking the easy ones. "We cannot double-check them with our LLM prompt, because we would have to check almost 1000 examples classified as 'Other' - that's definitely too costly." By only querying the large model for the ambiguous cases and the high-confidence positives, the system maintains high quality while drastically reducing token usage.

Critics might note that this approach assumes the initial "ground truth" generated by the large model is flawless. If the prompt contains subtle biases or errors, the traditional classifier will simply learn to replicate them at scale, potentially automating mistakes rather than fixing them. However, the editors counter this by emphasizing the iterative nature of the process, suggesting that the initial prompt can be refined until the training data is robust.

Bottom Line

The strongest part of this argument is its pragmatic refusal to accept that high cost is the price of high intelligence; it proves that a small amount of expensive compute can unlock massive savings in production. Its biggest vulnerability lies in the initial setup, which requires significant engineering expertise to craft the perfect "teacher" prompt. For organizations willing to invest in that upfront design, the path to a ninety percent cost reduction is not a distant dream, but a deployable reality.

Sources

Llms as ground truth

by Various · NO BS AI · Read full article

In this post I will show:

How to save around 90% of LLM cost of your customer service agent in production.

How to combine LLMs with old-school ML to acquire an accurate and cost-efficient hybrid system.

In our previous article, we described the "overfitting" of LLMs via prompting: https://nobsai.substack.com/p/the-necessity-of-overfitting-llm By crafting a very precise, elaborate prompt, she was able to carefully detect the true intent of a customer question and assign it to a correct class.

However, detailed prompts like this have their problems:

They are brittle - as demonstrated in the article, classification results can often be unstable, influenced by parameters like temperature.

They are costly - their length and level of detail mean they contain a significant number of tokens, which increases expense.

In this article, I demonstrate a solution to the second problem. In the course of our real-life work, we replaced the expensive LLM with a significantly more affordable model. By leveraging the "overfitted" prompts as a source of high-quality training data, we successfully trained a traditional machine learning classifier that performs effectively in production. This approach enables the system to operate at minimal cost.

The concept is illustrated in the image below:

Before the era of LLMs, a significant amount of time (and money) was usually spent on annotating data. Today, a well-designed prompt can transform an LLM into an excellent annotator for our purposes. While it may incur some cost—since the prompts need to be sufficiently detailed and extensive to ensure high-quality "classification"—this investment is a one-time effort aimed at generating training data.

The pipeline:.

Refine your prompt to ensure it is detailed enough to capture the necessary nuances. Incorporate the secrets of the business, so that it correctly identifies the intents in text (as done in https://nobsai.substack.com/p/the-necessity-of-overfitting-llm)

Once the prompt is finalized and delivers satisfactory classification quality, use it to process your data and generate the "ground truth." In my case, I required results for approximately 2,000 examples per class.

Embed the data - I used ada-002 embedder from OpenAI, without any finetuning, and it proved good enough for this case. Much better results will likely come from finetuning the embedder - even if it's a smallish model from Huggingface.

Feed the embeddings, together with their class labels assigned by an LLM, to a classifier like XGBoost.

The result?.

I have trained the classifiers for 3 classes. Each class reflects a single type ...