How yelp built “yelp assistant”

The Engineering Behind the Answer

You ask a simple question: Is the patio heated? What should be a quick answer forces you to scroll through reviews, photos, and attribute fields. Alex Xu documents how Yelp solved this problem by building an AI assistant that retrieves evidence and delivers direct answers with citations. The piece is a rare window into production RAG architecture — retrieval-augmented generation — at scale.

From Demo to Production

The prototype was simple. Batch dumps into Redis. Each business treated as a static bundle. Xu writes, "In production, this approach collapses because content changes continuously and the corpus grows without bound." A stale answer about operating hours is worse than no answer. Yelp's engineering team faced a fundamental tension: real-time ingestion is expensive, but batch processing creates stale responses.

Xu writes, "Treating every data source as real-time makes ingestion expensive to operate while treating everything as a weekly batch results in stale answers." Yelp's solution was hybrid. Reviews and business attributes stream in within ten minutes. Menus and website text update weekly. The system matches freshness to the velocity of the data itself.

"A stale answer regarding operating hours is worse than no answer at all"

Separation of Concerns

Not all questions deserve the same treatment. Xu writes, "Not all questions should be answered the same way. Some require searching through noisy text while others require a single precise fact." Yelp split their storage. Unstructured content — reviews, photos — flows through search indices. Structured facts — hours, amenities — live in Cassandra with an Entity-Attribute-Value layout. This separation prevents hallucinated facts. Adding a new attribute like EV charging requires no migration.

Photos present their own challenge. Caption-only retrieval fails when captions are missing. Embedding-only retrieval misses literal constraints. Xu writes, "Yelp bridged this gap by implementing hybrid retrieval." The system ranks photos using both caption matches and image embedding similarity. A user asking about a heated patio gets results whether "heaters" appears in text or shows as a heat lamp in the image.

Critics might note that this architecture assumes Yelp's engineering resources are replicable. Smaller teams cannot afford separate stores, hybrid retrieval, and streaming pipelines. The gap between prototype and production remains a gap most cannot cross.

The Inference Pipeline

Prototypes rely on one large model. The backend stuffs everything into one massive prompt. Xu writes, "While this works for a demo, it collapses under real traffic." Yelp deconstructed the monolith into specialized models. Retrieval finds evidence. A content source selector routes questions to the right store. A keyword generator translates user queries into search terms. Input guardrails block adversarial requests.

Xu writes, "Yelp adopted a hybrid approach: Fine-tuning for question analysis... Prompting for final generation." Small fine-tuned models handle Trust & Safety, Inquiry Type, and Source Selection. The large model — GPT-4.1 — handles final generation where nuance matters. This tiered approach improved inference speed by nearly twenty percent.

Serving Efficiency

Latency dropped from over ten seconds to under three. Xu writes, "Yelp optimized serving to reduce latency from over 10 seconds in prototypes to under 3 seconds in production." Streaming renders text token-by-token. Parallelism runs independent tasks concurrently. Early stopping cancels downstream work when guardrails flag a request. Xu writes, "This prevents wasting compute and retrieval resources on blocked queries."

The latency breakdown at median: question analysis takes 1.4 seconds. Retrieval takes 0.03 seconds. Time to first byte is 0.9 seconds. Full answer generation takes 3.5 seconds.

Evaluation at Scale

Prototype evaluation is informal. Developers try questions and tweak prompts until results feel right. Xu writes, "In production, failures show up as confident hallucinations or technically correct but unhelpful replies." Yelp defined six quality dimensions. An LLM-as-judge system scores each dimension against a strict rubric. Correctness is easy to automate. Tone and style are not. Xu writes, "Rather than forcing an unreliable automated judge early, Yelp tackled this by co-designing principles with their marketing team and enforcing them via curated few-shot examples in the prompt."

Critics might observe that LLM-as-judge systems inherit the biases and blind spots of their judge models. What counts as "correct" depends on the grader's training. Automated evaluation remains an open problem.

Bottom Line

Yelp's assistant demonstrates that production RAG requires architectural discipline: separate stores for structured and unstructured data, tiered models for analysis and generation, and evaluation systems that acknowledge their own limits. The engineering is sound. The resource requirements are steep. Most teams will struggle to replicate this without Yelp's infrastructure budget.

How yelp built “yelp assistant”

by Alex Xu · ByteByteGo Newsletter · Read full article

How to stop bots from abusing free trials (Sponsored).

Free trials help AI apps grow, but bots and fake accounts exploit them. They steal tokens, burn compute, and disrupt real users.

Cursor, the fast-growing AI code assistant, uses WorkOS Radar to detect and stop abuse in real time. With device fingerprinting and behavioral signals, Radar blocks fraud before it reaches your app.

You open an app with one specific question in mind, but the answer is usually hidden in a sea of reviews, photos, and structured facts. Modern content platforms are information-rich, though surfacing direct answers can still be a challenge. A good example is Yelp business pages. Imagine you are deciding where to go and you ask “Is the patio heated?”. The page might contain the answer in a couple of reviews, a photo caption, or an attribute field, but you still have to scan multiple sections to piece it together.

A common way to solve this is to integrate an AI assistant inside the app. The assistant retrieves the right evidence and turns it into a single direct answer with citations to the supporting snippets.

This article walks through what it takes to ship a production-ready AI assistant using Yelp Assistant on business pages as a concrete case study. We’ll cover the engineering challenges, architectural trade-offs, and practical lessons from the development of the Yelp Assistant.

Note: This article is written in collaboration with Yelp. Special thanks to the Yelp team for sharing details with us about their work and for reviewing the final article before publication.

High-Level System Design.

To deliver answers that are both accurate and cited, we cannot rely on an LLM’s internal knowledge alone. Instead, we use Retrieval-Augmented Generation (RAG).

RAG decouples the problem into two distinct phases: retrieval and generation, supported by an offline indexing pipeline that prepares the knowledge store.

The development of a RAG system starts with an indexing pipeline, which builds a knowledge store from raw data offline. Upon receiving a user query, the retrieval system scans this store using both lexical search for keywords and semantic search for intent to locate the most relevant snippets. Finally, the generation phase feeds these snippets to the LLM with strict instructions to answer solely based on the provided evidence and to cite specific sources.

Citations are typically produced by having the model output citation markers that refer to specific snippets. For example, ...

How yelp built “yelp assistant”

The Engineering Behind the Answer

From Demo to Production

Separation of Concerns

The Inference Pipeline

Serving Efficiency

Evaluation at Scale

Bottom Line

Deep Dives

Sources

How yelp built “yelp assistant”