v2: Major RAG accuracy improvements β€” BGE-M3 embeddings, hybrid retrieval, reranking, HyDE

#1

Summary

This PR significantly improves retrieval accuracy through 6 research-backed changes:

1. πŸ”΄ BGE-M3 Multilingual Embeddings (CRITICAL)

Before: sentence-transformers/all-MiniLM-L6-v2 β€” English-only, 384-dim, blind to Coptic script
After: BAAI/bge-m3 β€” 100+ languages, 1024-dim, 8192-token context

This is the single biggest accuracy bottleneck. all-MiniLM has zero Coptic vocabulary and cannot meaningfully embed Coptic text. BGE-M3 is built on XLM-RoBERTa trained on multilingual data including many scripts.

Reference: BGE M3-Embedding (arxiv 2402.03216) β€” MIRACL avg nDCG@10: 72.4 vs ~30 for all-MiniLM on non-English

2. πŸ”΄ Hybrid BM25 + Dense Retrieval

Before: Dense-only retrieval (semantic similarity only)
After: Reciprocal Rank Fusion of BM25 (40%) + Dense (60%)

BM25 catches exact Coptic word-form matches (β²₯β²±β²§β²™, β²›β²Ÿβ²©β²§β²‰) that even BGE-M3 may miss. Dense catches cross-lingual semantic similarity ("God" β†’ β²›β²Ÿβ²©β²§β²‰).

Reference: Blended RAG (arxiv 2404.07220) β€” +5.8% NDCG@10 on NQ; Anveshana Sanskrit CLIR (arxiv 2505.19494) β€” BM25 NDCG@10=62.46 vs multilingual dense=10.74 for unseen scripts

3. 🟠 Cross-Encoder Reranking

Before: Raw top-k chunks passed to LLM
After: BAAI/bge-reranker-v2-m3 multilingual cross-encoder reranks candidates, top-4 passed to LLM

Reference: RAG Hyperparameter Optimization (arxiv 2505.08445) β€” +6% precision, +10% recall

4. 🟠 HyDE (Hypothetical Document Embeddings)

Before: User's raw query used for retrieval
After: LLM generates a hypothetical dictionary/grammar entry β†’ that entry's embedding retrieves real entries

Especially powerful for Coptic: "what does ϫⲓβ²₯ⲉ mean?" β†’ LLM generates English dictionary entry β†’ retrieves actual CCL entry.

Reference: HyDE (arxiv 2212.10496) β€” +21% nDCG@10 over BM25 zero-shot

5. 🟑 Sentence-Level Chunking for PDFs

Before: 600-character slicing with 100-char overlap (splits mid-sentence, mid-entry)
After: Sentence-level chunking, 256 tokens, 0 overlap

Reference: Systematic Chunking Analysis (arxiv 2601.14123) β€” overlap provides zero measurable benefit (|Ξ”BERTScore| ≀ 0.004)

6. 🟑 Improved System Prompt

Added "Answer Quality Rules" section requiring the LLM to ground answers only in retrieved context, refuse to fabricate entries, and synthesize multi-chunk information coherently.

UI Changes

  • New "Retrieval Settings" section in sidebar with toggles for hybrid retrieval, HyDE, and reranking
  • Pipeline badges (BM25+Dense, HyDE, Reranked) shown under AI responses so users know which techniques were active
  • Updated requirements.txt with rank-bm25>=0.2.2

⚠️ Important Note: Re-ingestion Required

After merging, you MUST re-ingest all documents (CCL XML, grammar PDFs) because the embedding model changed from all-MiniLM (384-dim) to BGE-M3 (1024-dim). Existing ChromaDB embeddings are incompatible.

Files Changed

  • rag/chain.py β€” Core RAG pipeline with all 6 improvements
  • rag/ingest.py β€” BGE-M3 embeddings + sentence-level chunking
  • rag/status.py β€” Provider-aware status checks
  • ui/sidebar.py β€” New retrieval settings toggles
  • ui/chat_page.py β€” Pipeline info badges, pass new settings
  • requirements.txt β€” Added rank-bm25
georgtawadrous changed pull request status to open
georgtawadrous changed pull request status to merged

Sign up or log in to comment