Building a RAG pipeline from scratch

Overview

Retrieval-Augmented Generation (RAG) sounds simple: embed documents, store vectors, query at inference. The devil is in the chunking strategy — and most tutorials skip right over it.

The chunk overlap problem

Pinecone’s quickstart uses chunk_size=500, chunk_overlap=50. That overlap number is almost always wrong for structured technical content. Too little and context is severed at paragraph boundaries. Too much and you pay for duplicate embeddings with no recall improvement.

What actually worked

Semantic chunking using sentence-transformers to split on topic boundaries, not character count
Metadata enrichment: attach document title, section header, and page number to each chunk
Hybrid search: BM25 + dense vector retrieval re-ranked with a cross-encoder

The hybrid approach cut irrelevant retrievals by ~40% compared to dense-only on our test set.