RAG that holds up: citations, chunking, and grounding
Retrieval-augmented generation (RAG) sounds simple: chunk documents, embed them, retrieve similar chunks, let a language model answer. In practice, most failures are boring: bad splits, weak retrieval, and answers that sound right but are not anchored to what the corpus actually says.
Why “chat with your docs” often fails
Chunking that cuts tables mid-row, splits definitions across two pieces, or merges unrelated topics sends garbage into the index. Retrieval then returns chunks that are “similar” in embedding space but wrong for the question. The model does what it always does: it fills gaps. Without strict rules, users get confident nonsense.
Chunking quality
Good chunks respect natural boundaries: sections, headings, or logical paragraphs. You want each chunk to carry enough context to stand alone (what is this paragraph about?) but not so much that noise drowns the match. There is no universal token count—only whether a human could answer from that slice alone.
Retrieval relevance
Top-k similarity is a heuristic, not a guarantee. Low scores, duplicate near-duplicates, or stale documents all produce plausible wrong context. Production-shaped systems set floors (discard weak matches), deduplicate, and often cap how much text enters the prompt so the model cannot wander.
Citations as a requirement
If the user cannot see which document—and ideally which section or page—the answer came from, you cannot debug trust. Citations are not polish; they are a contract: “this sentence is supported by this chunk.” When citations are missing or vague, treat the answer as ungrounded for serious use.
Grounding in plain terms
Grounding means the model’s claims are tied to retrieved text, not to general knowledge about the topic. You enforce that with prompt discipline, small context windows, and refusal behavior when retrieval is empty or weak. Hypey demos skip this; deployable systems do not.
Common failure modes to expect
- Answers that mix two policies because both chunks landed in context.
- Outdated chunks winning because embeddings still look close.
- Questions that need aggregation across many chunks but the pipeline only returns one.
- Users who trust tone more than citations—so you show citations by default, not in fine print.
A small RAG stack with explicit citations, health checks, and clear limits is worth more than a larger demo that hides where answers come from. On Portfolio, I keep a document QA project scoped that way: ingest and retrieval are visible, and answers point back to named sources—the kind of habit that scales when the corpus is a real handbook, not three PDFs.
This is applied judgment, not a research paper. The bar is: would you stake a compliance or support decision on this answer without reading the source? If not, tighten chunking, retrieval, and citation rules until you could.