Spaces:

georgtawadrous
/

thoth_app

Running

App Files Files Community

v2: Major RAG accuracy improvements — BGE-M3 embeddings, hybrid retrieval, reranking, HyDE

by georgtawadrous - opened Apr 25

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Owner Apr 25

Summary

This PR significantly improves retrieval accuracy through 6 research-backed changes:

1. 🔴 BGE-M3 Multilingual Embeddings (CRITICAL)

Before: sentence-transformers/all-MiniLM-L6-v2 — English-only, 384-dim, blind to Coptic script
After: BAAI/bge-m3 — 100+ languages, 1024-dim, 8192-token context

This is the single biggest accuracy bottleneck. all-MiniLM has zero Coptic vocabulary and cannot meaningfully embed Coptic text. BGE-M3 is built on XLM-RoBERTa trained on multilingual data including many scripts.

Reference: BGE M3-Embedding (arxiv 2402.03216) — MIRACL avg nDCG@10: 72.4 vs ~30 for all-MiniLM on non-English

2. 🔴 Hybrid BM25 + Dense Retrieval

Before: Dense-only retrieval (semantic similarity only)
After: Reciprocal Rank Fusion of BM25 (40%) + Dense (60%)

BM25 catches exact Coptic word-form matches (ⲥⲱⲧⲙ, ⲛⲟⲩⲧⲉ) that even BGE-M3 may miss. Dense catches cross-lingual semantic similarity ("God" → ⲛⲟⲩⲧⲉ).

Reference: Blended RAG (arxiv 2404.07220) — +5.8% NDCG@10 on NQ; Anveshana Sanskrit CLIR (arxiv 2505.19494) — BM25 NDCG@10=62.46 vs multilingual dense=10.74 for unseen scripts

3. 🟠 Cross-Encoder Reranking

Before: Raw top-k chunks passed to LLM
After: BAAI/bge-reranker-v2-m3 multilingual cross-encoder reranks candidates, top-4 passed to LLM

Reference: RAG Hyperparameter Optimization (arxiv 2505.08445) — +6% precision, +10% recall

4. 🟠 HyDE (Hypothetical Document Embeddings)

Before: User's raw query used for retrieval
After: LLM generates a hypothetical dictionary/grammar entry → that entry's embedding retrieves real entries

Especially powerful for Coptic: "what does ϫⲓⲥⲉ mean?" → LLM generates English dictionary entry → retrieves actual CCL entry.

Reference: HyDE (arxiv 2212.10496) — +21% nDCG@10 over BM25 zero-shot

5. 🟡 Sentence-Level Chunking for PDFs

Before: 600-character slicing with 100-char overlap (splits mid-sentence, mid-entry)
After: Sentence-level chunking, 256 tokens, 0 overlap

Reference: Systematic Chunking Analysis (arxiv 2601.14123) — overlap provides zero measurable benefit (|ΔBERTScore| ≤ 0.004)

6. 🟡 Improved System Prompt

Added "Answer Quality Rules" section requiring the LLM to ground answers only in retrieved context, refuse to fabricate entries, and synthesize multi-chunk information coherently.

UI Changes

New "Retrieval Settings" section in sidebar with toggles for hybrid retrieval, HyDE, and reranking
Pipeline badges (BM25+Dense, HyDE, Reranked) shown under AI responses so users know which techniques were active
Updated requirements.txt with rank-bm25>=0.2.2

⚠️ Important Note: Re-ingestion Required

After merging, you MUST re-ingest all documents (CCL XML, grammar PDFs) because the embedding model changed from all-MiniLM (384-dim) to BGE-M3 (1024-dim). Existing ChromaDB embeddings are incompatible.

Files Changed

rag/chain.py — Core RAG pipeline with all 6 improvements
rag/ingest.py — BGE-M3 embeddings + sentence-level chunking
rag/status.py — Provider-aware status checks
ui/sidebar.py — New retrieval settings toggles
ui/chat_page.py — Pipeline info badges, pass new settings
requirements.txt — Added rank-bm25

v2: BGE-M3 embeddings, hybrid BM25+Dense retrieval, HyDE, cross-encoder rerankingfb888b3e

v2: Sentence-level chunking, BGE-M3 embeddings, richer CCL parsing777c2391

v2: Provider-aware status checkscf7ea488

v2: Sidebar with retrieval settings (hybrid, HyDE, reranking toggles)42a16c44

v2: Chat page with pipeline badges and new RAG settingsa5354a6d

v2: Add rank-bm25 dependency for hybrid retrievalf3031792

georgtawadrous changed pull request status to open Apr 25

georgtawadrous changed pull request status to merged Apr 25

Fix Greek hallucination: auto-convert Greek codepoints to Coptic in LLM outputac7b7429

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment