Spaces:
Running
v2: Major RAG accuracy improvements β BGE-M3 embeddings, hybrid retrieval, reranking, HyDE
Summary
This PR significantly improves retrieval accuracy through 6 research-backed changes:
1. π΄ BGE-M3 Multilingual Embeddings (CRITICAL)
Before: sentence-transformers/all-MiniLM-L6-v2 β English-only, 384-dim, blind to Coptic script
After: BAAI/bge-m3 β 100+ languages, 1024-dim, 8192-token context
This is the single biggest accuracy bottleneck. all-MiniLM has zero Coptic vocabulary and cannot meaningfully embed Coptic text. BGE-M3 is built on XLM-RoBERTa trained on multilingual data including many scripts.
Reference: BGE M3-Embedding (arxiv 2402.03216) β MIRACL avg nDCG@10: 72.4 vs ~30 for all-MiniLM on non-English
2. π΄ Hybrid BM25 + Dense Retrieval
Before: Dense-only retrieval (semantic similarity only)
After: Reciprocal Rank Fusion of BM25 (40%) + Dense (60%)
BM25 catches exact Coptic word-form matches (β²₯β²±β²§β², β²β²β²©β²§β²) that even BGE-M3 may miss. Dense catches cross-lingual semantic similarity ("God" β β²β²β²©β²§β²).
Reference: Blended RAG (arxiv 2404.07220) β +5.8% NDCG@10 on NQ; Anveshana Sanskrit CLIR (arxiv 2505.19494) β BM25 NDCG@10=62.46 vs multilingual dense=10.74 for unseen scripts
3. π Cross-Encoder Reranking
Before: Raw top-k chunks passed to LLM
After: BAAI/bge-reranker-v2-m3 multilingual cross-encoder reranks candidates, top-4 passed to LLM
Reference: RAG Hyperparameter Optimization (arxiv 2505.08445) β +6% precision, +10% recall
4. π HyDE (Hypothetical Document Embeddings)
Before: User's raw query used for retrieval
After: LLM generates a hypothetical dictionary/grammar entry β that entry's embedding retrieves real entries
Especially powerful for Coptic: "what does Ο«β²β²₯β² mean?" β LLM generates English dictionary entry β retrieves actual CCL entry.
Reference: HyDE (arxiv 2212.10496) β +21% nDCG@10 over BM25 zero-shot
5. π‘ Sentence-Level Chunking for PDFs
Before: 600-character slicing with 100-char overlap (splits mid-sentence, mid-entry)
After: Sentence-level chunking, 256 tokens, 0 overlap
Reference: Systematic Chunking Analysis (arxiv 2601.14123) β overlap provides zero measurable benefit (|ΞBERTScore| β€ 0.004)
6. π‘ Improved System Prompt
Added "Answer Quality Rules" section requiring the LLM to ground answers only in retrieved context, refuse to fabricate entries, and synthesize multi-chunk information coherently.
UI Changes
- New "Retrieval Settings" section in sidebar with toggles for hybrid retrieval, HyDE, and reranking
- Pipeline badges (BM25+Dense, HyDE, Reranked) shown under AI responses so users know which techniques were active
- Updated
requirements.txtwithrank-bm25>=0.2.2
β οΈ Important Note: Re-ingestion Required
After merging, you MUST re-ingest all documents (CCL XML, grammar PDFs) because the embedding model changed from all-MiniLM (384-dim) to BGE-M3 (1024-dim). Existing ChromaDB embeddings are incompatible.
Files Changed
rag/chain.pyβ Core RAG pipeline with all 6 improvementsrag/ingest.pyβ BGE-M3 embeddings + sentence-level chunkingrag/status.pyβ Provider-aware status checksui/sidebar.pyβ New retrieval settings togglesui/chat_page.pyβ Pipeline info badges, pass new settingsrequirements.txtβ Added rank-bm25