REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
Paper • 2510.13999 • Published • 19
GGUF quantizations of 0xSero/gemma-4-21b-a4b-it-REAP, a 20% expert-pruned variant of google/gemma-4-26b-a4b-it using the REAP (Router-weighted Expert Activation Pruning) method.
| File | Quant | Size | BPW | Description |
|---|---|---|---|---|
gemma-4-21b-a4b-it-REAP-BF16.gguf |
BF16 | ~43 GB | 16.0 | Full precision, for re-quantization |
gemma-4-21b-a4b-it-REAP-Q8_0.gguf |
Q8_0 | ~22 GB | 8.0 | Near-lossless, large file |
gemma-4-21b-a4b-it-REAP-Q6_K.gguf |
Q6_K | ~17 GB | 6.56 | Near-lossless, recommended for high quality |
gemma-4-21b-a4b-it-REAP-Q5_K_M.gguf |
Q5_K_M | ~16 GB | 5.68 | High quality, larger size |
gemma-4-21b-a4b-it-REAP-Q5_K_S.gguf |
Q5_K_S | ~15 GB | 5.52 | High quality, slightly smaller |
gemma-4-21b-a4b-it-REAP-Q4_K_M.gguf |
Q4_K_M | ~14 GB | 4.89 | Recommended — best quality/size balance |
gemma-4-21b-a4b-it-REAP-Q4_K_S.gguf |
Q4_K_S | ~13 GB | 4.63 | 4-bit small |
gemma-4-21b-a4b-it-REAP-Q3_K_L.gguf |
Q3_K_L | ~11 GB | 4.27 | 3-bit large |
gemma-4-21b-a4b-it-REAP-Q3_K_M.gguf |
Q3_K_M | ~11 GB | 3.91 | 3-bit medium |
gemma-4-21b-a4b-it-REAP-Q3_K_S.gguf |
Q3_K_S | ~10 GB | 3.66 | 3-bit small |
gemma-4-21b-a4b-it-REAP-Q2_K.gguf |
Q2_K | ~9 GB | 2.96 | Smallest size, lowest quality |
| Property | Value |
|---|---|
| Architecture | Gemma 4 (hybrid sliding/full attention MoE) |
| Parameters | 21.34B total / ~4B active per token |
| Experts | 103 total / 8 active per token |
| Context Length | 262,144 tokens |
| Original dtype | BF16 |
| Quantization tool | llama.cpp |
| License | Gemma |
# 1. Convert BF16 SafeTensors → GGUF
python convert_hf_to_gguf.py 0xSero/gemma-4-21b-a4b-it-REAP \
--outfile gemma-4-21b-a4b-it-REAP-BF16.gguf \
--outtype bf16
# 2. Quantize (example: Q4_K_M)
llama-quantize gemma-4-21b-a4b-it-REAP-BF16.gguf \
gemma-4-21b-a4b-it-REAP-Q4_K_M.gguf Q4_K_M
llama-cli \
-m gemma-4-21b-a4b-it-REAP-Q4_K_M.gguf \
-ngl 99 -c 4096 \
-p "Your prompt here"
llama-server \
-m gemma-4-21b-a4b-it-REAP-Q4_K_M.gguf \
-ngl 99 -c 4096 \
--port 8080
Download the .gguf file and load it directly in your preferred local inference UI.
| Config | VRAM / RAM |
|---|---|
| Full GPU (Q4_K_M, recommended) | 16+ GB VRAM |
| Hybrid CPU+GPU (Q4_K_M) | 8 GB VRAM + 12 GB RAM |
| CPU only (Q4_K_M) | 18+ GB RAM |
0xSero/gemma-4-21b-a4b-it-REAP applies REAP expert pruning (arXiv:2510.13999) to remove 20% of MoE experts (25 of 128 per layer) from Gemma 4 26B-A4B-it, while preserving routing behavior. Active parameters per token remain unchanged at ~4B. The result is an ~18% smaller model with near-identical generation quality across coding, math, and reasoning benchmarks.
Gemma — see Google's Gemma Terms of Use.
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Base model
0xSero/gemma-4-21b-a4b-it-REAP