The Engineering Behind the Answer
You ask a simple question: Is the patio heated? What should be a quick answer forces you to scroll through reviews, photos, and attribute fields. Alex Xu documents how Yelp solved this problem by building an AI assistant that retrieves evidence and delivers direct answers with citations. The piece is a rare window into production RAG architecture — retrieval-augmented generation — at scale.
From Demo to Production
The prototype was simple. Batch dumps into Redis. Each business treated as a static bundle. Xu writes, "In production, this approach collapses because content changes continuously and the corpus grows without bound." A stale answer about operating hours is worse than no answer. Yelp's engineering team faced a fundamental tension: real-time ingestion is expensive, but batch processing creates stale responses.
Xu writes, "Treating every data source as real-time makes ingestion expensive to operate while treating everything as a weekly batch results in stale answers." Yelp's solution was hybrid. Reviews and business attributes stream in within ten minutes. Menus and website text update weekly. The system matches freshness to the velocity of the data itself.
"A stale answer regarding operating hours is worse than no answer at all"
Separation of Concerns
Not all questions deserve the same treatment. Xu writes, "Not all questions should be answered the same way. Some require searching through noisy text while others require a single precise fact." Yelp split their storage. Unstructured content — reviews, photos — flows through search indices. Structured facts — hours, amenities — live in Cassandra with an Entity-Attribute-Value layout. This separation prevents hallucinated facts. Adding a new attribute like EV charging requires no migration.
Photos present their own challenge. Caption-only retrieval fails when captions are missing. Embedding-only retrieval misses literal constraints. Xu writes, "Yelp bridged this gap by implementing hybrid retrieval." The system ranks photos using both caption matches and image embedding similarity. A user asking about a heated patio gets results whether "heaters" appears in text or shows as a heat lamp in the image.
Critics might note that this architecture assumes Yelp's engineering resources are replicable. Smaller teams cannot afford separate stores, hybrid retrieval, and streaming pipelines. The gap between prototype and production remains a gap most cannot cross.
The Inference Pipeline
Prototypes rely on one large model. The backend stuffs everything into one massive prompt. Xu writes, "While this works for a demo, it collapses under real traffic." Yelp deconstructed the monolith into specialized models. Retrieval finds evidence. A content source selector routes questions to the right store. A keyword generator translates user queries into search terms. Input guardrails block adversarial requests.
Xu writes, "Yelp adopted a hybrid approach: Fine-tuning for question analysis... Prompting for final generation." Small fine-tuned models handle Trust & Safety, Inquiry Type, and Source Selection. The large model — GPT-4.1 — handles final generation where nuance matters. This tiered approach improved inference speed by nearly twenty percent.
Serving Efficiency
Latency dropped from over ten seconds to under three. Xu writes, "Yelp optimized serving to reduce latency from over 10 seconds in prototypes to under 3 seconds in production." Streaming renders text token-by-token. Parallelism runs independent tasks concurrently. Early stopping cancels downstream work when guardrails flag a request. Xu writes, "This prevents wasting compute and retrieval resources on blocked queries."
The latency breakdown at median: question analysis takes 1.4 seconds. Retrieval takes 0.03 seconds. Time to first byte is 0.9 seconds. Full answer generation takes 3.5 seconds.
Evaluation at Scale
Prototype evaluation is informal. Developers try questions and tweak prompts until results feel right. Xu writes, "In production, failures show up as confident hallucinations or technically correct but unhelpful replies." Yelp defined six quality dimensions. An LLM-as-judge system scores each dimension against a strict rubric. Correctness is easy to automate. Tone and style are not. Xu writes, "Rather than forcing an unreliable automated judge early, Yelp tackled this by co-designing principles with their marketing team and enforcing them via curated few-shot examples in the prompt."
Critics might observe that LLM-as-judge systems inherit the biases and blind spots of their judge models. What counts as "correct" depends on the grader's training. Automated evaluation remains an open problem.
Bottom Line
Yelp's assistant demonstrates that production RAG requires architectural discipline: separate stores for structured and unstructured data, tiered models for analysis and generation, and evaluation systems that acknowledge their own limits. The engineering is sound. The resource requirements are steep. Most teams will struggle to replicate this without Yelp's infrastructure budget.