Grounding RAG: why retrieval isn't enough

Retrieval gets the right passages in front of the model. Validation is what keeps the answer honest. Here's how we use canonical segmentation, reranking and LLM-as-judge in production.

Most teams discover the same thing once a retrieval-augmented system reaches real users: getting the right documents in front of the model is only half the problem. The other half is making sure the answer the model produces is actually supported by what it retrieved.

The failure mode nobody demos

In a polished demo, RAG looks solved. Ask a question, watch the system pull a passage, read a fluent answer. The trouble starts at the edges — ambiguous queries, partial matches, and passages that look relevant but don’t actually contain the answer. A model that always sounds confident will happily paper over the gap.

Three layers that keep it honest

We treat a production retrieval system as three separable concerns:

Canonical segmentation. Before anything is embedded, documents are split into self-contained, meaningful units. We use GLiNER to identify the entities and boundaries that matter, so a “chunk” is a coherent idea rather than an arbitrary window of tokens.
Reranking. First-stage retrieval favours recall. A cross-encoder reranker then re-scores the candidates for genuine relevance, so the model sees the best few passages rather than the most superficially similar ones.
LLM-as-judge validation. The generated answer is checked against its cited passages by a separate model instructed to look for unsupported claims. If the answer isn’t grounded, it’s flagged rather than shipped.

Why this matters for buyers

In regulated and high-stakes settings, “usually right” is not a specification. The value of these layers isn’t a higher benchmark score — it’s that every answer is traceable to a source, and the system can tell you when it isn’t sure.

That’s the difference between a demo and something you can put in front of a court, a clinician, or a compliance team.