Custom RAG System
Custom RAG System with Pre-Trained Embeddings, LoRA Fine-Tuning, and Reranking
Kristin Henderson
Fall 2025
Overview
This project implements a complete Retrieval-Augmented Generation (RAG) system built from the ground up, using Paul Graham’s essays as the document corpus. Rather than relying on pre-built embedding models or off-the-shelf retrieval tools, every major component was trained or fine-tuned from scratch.
System Components
Embedding Models Two embedding models were trained using a two-stage pipeline: masked language model (MLM) pre-training followed by contrastive fine-tuning with hard negatives. The first is a general-purpose model trained on Wikipedia text; the second is domain-specific, trained on Paul Graham’s essays. Both use DistilBERT as the base architecture with a 256-dimensional projection layer and L2-normalized outputs. Hard negative mining – using sentences 12 positions away as negatives – helped the models learn finer-grained semantic distinctions.
Language Model Fine-Tuning Gemma-3-1b-it was post-trained on a QnA dataset using LoRA, a parameter-efficient fine-tuning method that updates less than 0.2% of model parameters. Three training conditions were compared: a small hand-curated set (50 pairs), a larger synthetic set generated by Gemma-3-4b-it (1,308 pairs), and a combined dataset with oversampling. The synthetic-only model achieved the highest semantic alignment (BERT-F1: 0.70).
Reranker A cross-encoder reranker was fine-tuned to re-score query-chunk pairs after initial retrieval. The pipeline first retrieves the top-20 candidate chunks using embedding similarity, then applies the reranker to select the final top-5 for generation.
Results
Post-training the embedding models on question-chunk pairs produced substantial gains. The domain-specific PG model after fine-tuning achieved the best retrieval performance (nDCG@10: 0.41, Recall@5: 0.44). On a held-out benchmark of 55 questions evaluated with Tonic Validate and GPT-4.1-mini, the full RAG system achieved an Answer Consistency score of 0.73, indicating that generated answers are well-grounded in the retrieved context.
| Model | MRR | nDCG@10 | Recall@5 |
|---|---|---|---|
| Wiki contrastive | 1.00 | 0.03 | 0.03 |
| Wiki post-trained | 0.95 | 0.28 | 0.29 |
| PG contrastive | 0.90 | 0.25 | 0.27 |
| PG post-trained | 0.90 | 0.41 | 0.44 |
Skills
Python · Transformers · LoRA fine-tuning · Contrastive learning · RAG · FAISS · Hugging Face · Google Colab