Custom RAG System

Custom RAG System with Pre-Trained Embeddings, LoRA Fine-Tuning, and Reranking

Overview

This project implements a complete Retrieval-Augmented Generation (RAG) system built from the ground up, using Paul Graham’s essays as the document corpus. Rather than relying on pre-built embedding models or off-the-shelf retrieval tools, every major component was trained or fine-tuned from scratch.

View on GitHub

System Components

Embedding Models Two embedding models were trained using a two-stage pipeline: masked language model (MLM) pre-training followed by contrastive fine-tuning with hard negatives. The first is a general-purpose model trained on Wikipedia text; the second is domain-specific, trained on Paul Graham’s essays. Both use DistilBERT as the base architecture with a 256-dimensional projection layer and L2-normalized outputs. Hard negative mining – using sentences 12 positions away as negatives – helped the models learn finer-grained semantic distinctions.

Language Model Fine-Tuning Gemma-3-1b-it was post-trained on a QnA dataset using LoRA, a parameter-efficient fine-tuning method that updates less than 0.2% of model parameters. Three training conditions were compared: a small hand-curated set (50 pairs), a larger synthetic set generated by Gemma-3-4b-it (1,308 pairs), and a combined dataset with oversampling. The synthetic-only model achieved the highest semantic alignment (BERT-F1: 0.70).

Reranker A cross-encoder reranker was fine-tuned to re-score query-chunk pairs after initial retrieval. The pipeline first retrieves the top-20 candidate chunks using embedding similarity, then applies the reranker to select the final top-5 for generation.

Results

Post-training the embedding models on question-chunk pairs produced substantial gains. The domain-specific PG model after fine-tuning achieved the best retrieval performance (nDCG@10: 0.41, Recall@5: 0.44). On a held-out benchmark of 55 questions evaluated with Tonic Validate and GPT-4.1-mini, the full RAG system achieved an Answer Consistency score of 0.73, indicating that generated answers are well-grounded in the retrieved context.

Model MRR nDCG@10 Recall@5
Wiki contrastive 1.00 0.03 0.03
Wiki post-trained 0.95 0.28 0.29
PG contrastive 0.90 0.25 0.27
PG post-trained 0.90 0.41 0.44

Skills

Python · Transformers · LoRA fine-tuning · Contrastive learning · RAG · FAISS · Hugging Face · Google Colab