The first RAG demo I built took an evening. I scraped a docs site, embedded the pages with text-embedding-3-small, dropped the vectors into a Pinecone index, and wired a chat endpoint that retrieved the top 5 chunks and stuffed them into a prompt. It answered the test questions perfectly. I sent a link to my team. They started typing real questions. Within ten minutes the answers were wrong, made up, or "based on the documentation" — the documentation no longer mentioned that feature, but the model didn't say so.
That moment is universal. RAG is the most over-promised architecture in AI right now precisely because the hello-world version is so easy. The hard part starts when real users with real questions hit your real corpus, and you discover that "embed everything and retrieve top-k" is roughly 30% of a working product. The other 70% is a series of unglamorous choices about how you split documents, how you rank candidates, how you compose the prompt, and how the model is allowed to refuse.
This is what changes when you go beyond the tutorial.
Chunking Is The Decision That Carries The Whole System
If you embed a 50-page PDF as a single vector, you've averaged the meaning of 50 pages into one point in space. A user's specific question about page 42 will not retrieve that document — the vector is too generic. Chunking is how you give the retrieval layer something to grab.
Three rules I now apply by default:
- Chunk by structure first, length second. Markdown headers, HTML sections, function definitions, ticket bodies — there's almost always a natural unit. Use it. Splitting purely by character count cuts ideas in half and degrades both retrieval and generation.
- Overlap. Always. The most important sentence in a doc is often near a boundary. If your chunks start fresh every 800 characters, that sentence ends up half in chunk A and half in chunk B, and retrieval pulls back only one half. A 10–20% overlap costs almost nothing and saves the corner cases.
- Keep chunks scannable by a model. 300–800 tokens is a useful range. Smaller and the chunk lacks context; larger and you're back to averaging.
A semantic-aware chunker for Markdown documentation looks like this — split by H2/H3, then enforce a target window with overlap:
export function chunkMarkdown(doc: string, target = 700, overlap = 120) {
const sections = doc.split(/(?=^#{2,3}\s)/m); // split on H2/H3 boundaries
const chunks: string[] = [];
for (const section of sections) {
if (section.length <= target) {
chunks.push(section.trim());
continue;
}
let i = 0;
while (i < section.length) {
chunks.push(section.slice(i, i + target).trim());
i += target - overlap;
}
}
return chunks.filter(Boolean);
}
It's not LangChain's RecursiveCharacterTextSplitter or LlamaIndex's SemanticSplitterNodeParser, but it's small enough to read and you can extend it for your shape. Whatever you use, store enough metadata next to each chunk to render a useful citation in the UI — document id, title, the section heading, the URL.
Hybrid Retrieval Beats Vector-Only In Almost Every Real Corpus
Pure vector retrieval handles "how do I cancel a subscription?" beautifully and "error code 404X-99" very poorly. It is trained to ignore surface-level tokens. Real users mix both kinds of queries in the same session.
In a production RAG, run vector search and keyword search in parallel and merge the results with Reciprocal Rank Fusion. If your corpus lives in Postgres, you don't need a second service — pgvector handles the embeddings, tsvector handles the BM25-style keyword index, and the merge is a few lines:
async function retrieveCandidates(query: string, k = 25) {
const [vector, keyword] = await Promise.all([
vectorTopK(query, k),
keywordTopK(query, k),
]);
return rrf([vector, keyword], { k: 60 });
}
If you've already invested in Pinecone, Weaviate, Qdrant, or Turbopuffer, all of them now expose hybrid retrieval primitives natively — you don't need to fuse manually anymore. But know what's happening underneath: two rankings, one merge.
Rerankers Are The Cheapest Quality Win You're Not Using
After hybrid retrieval, you have ~25 candidate chunks. If you stuff all of them into the prompt, three things happen: the call gets expensive, the model loses track of the relevant chunk in the middle of the context window (the well-documented "lost in the middle" effect), and the answer quality drops.
A reranker is a small, specialised model whose only job is to score how relevant each chunk is to a specific query. Cohere's Rerank, Voyage AI's rerank-2, and open-source cross-encoders like bge-reranker-v2-m3 all do the same thing: take a query plus a list of passages, return scores. They are fast, cheap, and dramatically better at relevance than vector similarity.
import { CohereClient } from 'cohere-ai';
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY! });
async function rerank(query: string, candidates: { id: string; text: string }[]) {
const { results } = await cohere.v2.rerank({
model: 'rerank-v3.5',
query,
documents: candidates.map((c) => c.text),
topN: 5,
});
return results.map((r) => candidates[r.index]);
}
The pipeline becomes: retrieve 25 cheaply, rerank to 5 precisely, send 3–5 to the generation model. The latency cost of the rerank is usually 100–300 ms; the quality lift is the difference between "this product works" and "this product is a toy".
Grounded Prompts Refuse, Politely
The reason the demo answered confidently when it shouldn't have is that the prompt didn't tell it to admit ignorance. By default an LLM is a helpful pattern-matcher; given context that doesn't contain the answer, it will helpfully invent one. The fix is in the prompt, and in how you bind the answer to citations.
const system = [
'You are a documentation assistant.',
'Answer ONLY using the passages inside <context>...</context>.',
'Every factual statement must cite at least one passage by its id, e.g. [doc:42].',
'If the answer is not in the context, reply: "I could not find this in the docs." — nothing else.',
'Do not use general knowledge. Do not infer from outside the context.',
].join('\n');
const user = `<context>
${chunks.map((c) => `<passage id="${c.id}">${c.text}</passage>`).join('\n')}
</context>
Question: ${query}`;
Two details earn their keep here. The XML-tagged passages give the model a clear handle to cite, and they hold up against indirect prompt injection from a malicious chunk. The explicit refusal phrase ("I could not find this in the docs.") is something you can grep for in your UI to render a different state — a "search the docs anyway" button, or a contact-support fallback — instead of trying to interpret a free-form non-answer.
For the generation model itself, gpt-4o, claude-sonnet-4-5, and the Gemini 2.x line are all strong on grounded answering. Smaller models — gpt-4o-mini or local Llama variants — work for narrow corpora but break down on harder questions; pay attention to your worst case, not your average.
Evaluation Beats Vibes
The last thing that separates a working RAG from a demo: a small, real evaluation set you run before every change. 30–50 questions, each with the correct answer and the document id that should be cited. Re-run it whenever you change the chunker, swap an embedding model, tweak the reranker, or rewrite the system prompt.
You don't need a fancy framework to start — a Markdown table and a script is enough. Track three numbers: did retrieval find the right document at all (recall), was it in the top 3 (precision-at-3), and did the model produce a correct grounded answer when given the right context (faithfulness). If a change hurts one of those, you'll know immediately, instead of finding out from a customer two weeks later.
Frameworks like Ragas, LangSmith, and Braintrust automate the same idea once you scale past a homemade script.
Where RAG Stops Being The Right Tool
A short note that's missing from most RAG content. RAG is the right answer when your corpus is large, changing, and structured around documents people read. It is the wrong answer when your data is highly relational (a graph of orders, customers, and SKUs), when the user's question requires aggregation across many records, or when freshness is sub-second and your indexer can't keep up.
In those cases you want tool calls that hit your real APIs — getCustomerOrders, searchInventory — and let the model orchestrate them. The output is no longer "summarise this paragraph" but "here is the data you asked for". Mixing the two patterns in one product is fine and increasingly common; just be honest about which pattern fits which query.
A One-Sentence Mental Model
A working RAG retrieves broadly with hybrid search, narrows precisely with a reranker, and answers from a grounded prompt that refuses when the context is missing — chunking sets the ceiling, reranking unlocks most of the quality, and a 50-question eval set keeps you from deploying regressions you can't see.






