Graph rag on noisy data

Most technical guides treat Graph Retrieval-Augmented Generation (Graph RAG) as a tool for pristine, well-structured databases, but NO BS AI challenges this comfortable assumption head-on. The piece argues that the real-world value of this technology lies not in clean tutorials, but in its ability to survive the "messy, real-life data" of actual customer support logs. This is a crucial pivot for engineers and executives who assume their data must be sanitized before it can be useful.

The Reality of Noisy Data

The editors at NO BS AI start by dismantling the idealized view of Graph RAG. "Typically, it is applied to clean data, such as factual documents describing products, procedures, rules, and guidelines," they note, before immediately contrasting this with the chaotic reality of customer service. The core problem identified is that vanilla systems treat every data point as a valid fact, leading to graphs cluttered with irrelevant or dangerous information.

The piece highlights a specific failure mode where the system extracts "Client IDs" and "Client personal data (which somehow escaped anonymization)" as if they were technical specifications. This is not just a data quality issue; it is a compliance nightmare. The argument is compelling because it moves beyond abstract theory to concrete risk: "Storing such information is dangerous," the editors warn, pointing out that systems often mistake temporary client variations for permanent device attributes.

"We don't care about old order IDs from a few months ago, and storing personal data is dangerous."

This observation lands hard for any organization dealing with privacy regulations. By treating a client's specific discount as a permanent rule, the system creates a "risky" knowledge base that could lead to "repeating some special actions as a standard." Critics might argue that better data entry at the source is the real solution, but the editors counter that companies simply do not have all their data stored in clean documents, forcing a need for smarter processing rather than just cleaner input.

Engineering Resilience Through Constraints

To solve this, the piece proposes a shift from passive extraction to active restriction. The strategy involves explicitly telling the system what not to look for. "If you're building your own prompt, build it so that it clearly defines the entity types which should be extracted and which should be ignored," the article advises. This is a practical, often overlooked step that moves the burden of accuracy from the data itself to the prompt engineering.

The editors suggest leveraging standard Named Entity Recognition (NER) types to filter out the noise. In the context of Microsoft's Graph RAG library, this means curating the `entity_types` list to exclude irrelevant categories like `PERSON` when the goal is purely technical diagnosis. "The entity types which are well understood by LLMs are NER entity types, such as PERSON, ORGANIZATION, LOCATION, MONEY, etc.," the piece explains. By removing these from the configuration, the system is forced to focus on what actually matters: the device and the problem.

"A large relationship weight indicates a more valuable or certain relationship. A smaller value signifies a relationship which is not as worthy."

The most sophisticated part of the argument involves using relationship weights to distinguish between permanent facts and fleeting situations. The editors illustrate this with a clear example: a confirmed device dimension should carry a weight of 9, while a rumor or a temporary discount should carry a weight of 1. This allows the system to "evaluate and prioritize facts based on their reliability or importance" without needing to delete the data entirely.

However, the piece admits that instructing an LLM to simply "discard" low-value relationships is often ineffective because the prompts become too large and complex. "Including such an instruction in the prompt is ineffective because the prompts are already large," the editors note, explaining that the model tends to overlook specific instructions buried in a sea of text. Instead, the weighted approach fits "seamlessly into the LLM workflow," allowing the system to grade facts implicitly while it processes them.

"After the graph is created, low-value relationships can easily be filtered out based on their weights, potentially with some human oversight."

This nuanced approach acknowledges the limitations of current AI models while offering a pragmatic workaround. It suggests that the future of reliable Graph RAG isn't about perfect data, but about teaching the model to be skeptical of its own findings.

Bottom Line

The strongest part of this argument is its refusal to accept the "clean data" prerequisite as a barrier to entry, offering a concrete methodology for handling the messy reality of customer support logs. Its biggest vulnerability is the reliance on sophisticated few-shot prompting, which requires significant engineering expertise to tune correctly. For organizations ready to deploy Graph RAG, the lesson is clear: the system's intelligence must be defined by what it ignores, not just what it extracts.

Graph rag on noisy data

by Various · NO BS AI · Read full article

Graph RAG is a widely appreciated technology used in various production environments. Typically, it is applied to clean data, such as factual documents describing products, procedures, rules, and guidelines. But what happens when you need to build a Graph RAG using noisy data, where you can't always assume that every piece of information is accurate? Such a situation arises while building chatbots using previous customer support conversations. Often, companies just don't have all their data stored in clean documents. Even if documentations exist, the knowledge about real problems and solutions is kept in the heads (and conversations) of the customer support team. And these conversations tend to be really noisy.

By applying proprietary techniques we managed to make Graph RAG work on messy, real-life data which has nothing to do with examples from tutorials.

The setting.

As usual, we'll focus on conversations between customer support and clients of a company that manufactures electronic devices. The aim of this knowledge base is to answer technical questions about usage of the devices, so we need solid technical knowledge base. The Graph RAG implementation will utilize the original library from Microsoft: https://microsoft.github.io/graphrag/.

Problems: Where vanilla Graph RAGs fail.

Graph RAG assumes that every data point can be a valid fact. So, we get many, many nodes with client and order details, such as:

Client IDs

Client personal data (which somehow escaped anonymization)

Order IDs

Sometimes these things are good to know, but we want our graph to contain knowledge about the devices and procedures so that we can diagnose user problems! We don't care about old order IDs from a few months ago, and storing personal data is dangerous.

Many situational or temporary facts are regarded as vital facts:

A client reports possessing a black device of a certain brand. Thus, a BLACK_DEVICE_MODEL is created as a node - where in fact, the device color is just some client variation and not important for diagnosing problems.

Many detected relationships refer to temporary situations, which can be unimportant or even risky to store in a graph - such as discounts. We witnessed many edges like this:

<source_node>DEVICE,

<target_node>DISCOUNT_15%,

<relationship>"A discount of 15% is offered on DEVICE"

Such relationships can be created e.g. from conversations with dissatisfied clients, where the discount is a way to prevent churning. Storing such information can be very dangerous, as it may lead to repeating some special actions as a ...

The Reality of Noisy Data

Engineering Resilience Through Constraints

Bottom Line

Sources

Graph rag on noisy data