REAP the Experts: Why Pruning Prevails for One-Shot MoE compression
Paper • 2510.13999 • Published • 19
40% expert-pruned variant of Qwen3.5-122B-A10B using REAP (Routing-Enhanced Activation Pruning).
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-122B-A10B |
| Architecture | Qwen3.5 MoE (GDN + Full Attention) |
| Original Experts | 256 per layer |
| Pruned Experts | 154 per layer (40% removed) |
| Active Parameters | ~10B per token |
| Pruning Method | REAP with targeted refusal preservation |
| Preserve Threshold | 80% (super-expert protection) |
| Calibration | reap-calibration-data-v1 — 23k benchmark-free samples |
| Maintainer | 0xSero |
| Organization | Sybil Solutions |
| Project | REAP PR17 |
vllm serve 0xSero/Qwen3.5-122B-A10B-REAP-40 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--max-model-len 8192 \
--trust-remote-code \
--language-model-only \
--dtype bfloat16
Important: Use --language-model-only flag — this is a text-only checkpoint pruned from the multimodal base model.
REAP (Routing-Enhanced Activation Pruning) removes the least-activated experts from MoE models while preserving critical capabilities. It uses router activation patterns from a calibration dataset to identify dispensable experts, with special protection for safety-critical behaviors.
Same license as the base model (Qwen).
Base model
Qwen/Qwen3.5-122B-A10B