EngineeringApril 22, 20268 min read

Shipping RAG to production without the hallucinations

Retrieval-augmented generation is easy to demo and hard to operate. Here is the architecture and evaluation discipline we use to make it reliable.

Key takeaways

01Retrieval quality matters more than which model you pick.
02Evaluate against golden datasets before anything reaches users.
03Monitor grounding, cost, and latency continuously in production.

A retrieval-augmented generation demo takes an afternoon. A RAG system you can put in front of customers takes a different kind of discipline — because the failure mode isn't a crash, it's a confident, wrong answer.

Retrieval is the product

Most teams obsess over the model and treat retrieval as plumbing. We do the opposite. How documents are chunked, embedded, ranked, and filtered sets the ceiling on every answer. Get retrieval right and a smaller model will out-perform a larger one working from noise.

Evaluate before you ship

We build a golden dataset of real questions and known-good answers, then score every change against it: faithfulness to sources, answer relevance, and refusal behaviour when the context is thin. Nothing reaches a user on vibes.

In production the question isn't ‘is it smart?’ — it's ‘is it grounded, and does it know when to say it doesn't know?’

Once live, we monitor grounding, latency, and cost continuously, and feed real queries back into the evaluation set. RAG isn't a launch — it's a loop.