m51Lab-MiniMax-M2.7-REAP-139B-A10B

First publicly available REAP-40% pruned variant of MiniMax-M2.7, released by m51Lab on 2026-04-15.


MiniMax-M2.7 is a 229B-parameter Mixture-of-Experts LLM released 2026-04-12. This variant uses REAP (Router-weighted Expert Activation Pruning) to prune 40 % of experts per MoE block, reducing total parameters to ~139 B while preserving ~10 B active parameters per token. The result runs comfortably on systems the full model cannot reach (notably 96 GB Apple Silicon via GGUF).

Architecture

Property Value
Base model MiniMaxAI/MiniMax-M2.7
Transformer layers 62
Hidden size 3 072
Intermediate (expert) 1 536
MoE experts per block 154 (256 − 40 %)
Top-k routing 8
Active parameters / token ~10 B
Total parameters ~139 B
Max position embeddings 196 608
Vocabulary size 200 064
License Modified MIT (inherited)

Pruning parameters

  • Method: REAP (Lasby et al. 2025, arXiv:2510.13999)
  • Pruning rate: 40 % of experts per MoE block (256 → 154)
  • Seed: 42
  • Router renormalization: enabled
  • Calibration sequence length: 2 048 tokens
  • Effective samples: 6 144 packed across the three datasets below
  • Distance measure: cosine
  • Singleton super/outlier experts: disabled

Calibration dataset mix

Dataset Samples Purpose
theblackcat102/evol-codealpaca-v1 2 048 General coding
open-r1/Mixture-of-Thoughts[math] 2 048 Math / science / code reasoning
Salesforce/xlam-function-calling-60k 2 048 Single-turn tool calling

This mix mirrors Cerebras's public MiniMax-M2 / M2.1 / M2.5 REAP releases.

Evaluation

HumanEval pass@1 (on completed): 83.3 % (90 / 108)

For problems where the model completed its <think> reasoning within a 32 K-token generation budget, this variant (REAP-40 % pruned + Q4_K_M) solved 90 of 108 correctly — a strong quality signal for a 4-bit quantized, structurally pruned MoE.

Strict pass@1 (all 164 problems, cap-outs counted as fails): 54.9 %

56 of 164 problems exhausted the 32 K reasoning budget mid-<think> and are counted as fails under strict academic scoring. This is the production-deployment score if you constrain generation to 32 K tokens; allocate ≥64 K tokens to approach the 83 % ceiling.

Methodology: 2 × H100 80 GB, llama.cpp /v1/chat/completions, native <think> enabled, temperature=0.2, top_p=0.95, max_tokens=32000. No post-processing beyond HumanEval's canonical grading.

For continuity with prior quant comparisons: an earlier evaluation using raw /v1/completions + chat-prose stripping (non-canonical for reasoning models, bypasses <think>) reported 65.2 % (107 / 164). The numbers above use the canonical chat-completion path.

Smoke test (pre-publish, 5 diverse prompts)

# Prompt type Verdict
1 Trivial arithmetic PASS
2 Python Fibonacci PASS
3 Norwegian response PASS
4 MoE semantic explanation PASS
5 JSON tool-call echo PASS

5 / 5 PASS. Confirms out-of-box inference quality.

Known minor imperfection

During integrity audit of the 62-layer bias-correction tensor fix, one layer (layer 0) had expert keep-indices that differed slightly from the REAP-retained set (86 of 154 positions). The magnitude of the resulting bias mismatch is bounded by the layer-0 bias natural variance (max |Δ| = 0.75 on values in [8.06, 8.88]), so the impact on routing is negligible — confirmed by the 5/5 smoke test above. All other 61 layers are bit-perfect. Full analysis in the reproducibility log.

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
    torch_dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
    trust_remote_code=True,
)

Recommended generation parameters: temperature=1.0, top_p=0.95, top_k=40.

For consumer hardware (96 GB Apple Silicon, multi-GPU rigs), use the GGUF quantizations.

Reproducibility

  • REAP pin: CerebrasResearch/reap@2b114e71 with patches for MiniMaxM2ForCausalLM registration (src/reap/model_util.py + src/reap/observer.py).
  • llama.cpp: post-PR #16831 (MiniMaxM2 arch merged). Built with CUDA, sm_90.
  • transformers: pinned to 4.55.0. Do not upgrade to 5.x (import reorganization breaks REAP).
  • Stage timings (8×H200 SXM):
    • Dequant FP8 → BF16: 14 min
    • REAP forward+save (Stage 1): 9 h (4 097 samples @ 41 s/sample effective)
    • GGUF convert: 20 min
    • imatrix calibration: 3 h 03 min (488 chunks × 2 048 tokens)
    • Quantization per variant: 15-45 min (parallel-3)

Citation

@article{lasby2025reap,
  title   = {REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author  = {Lasby, Mike and Hussein, Ahmed and Spyra, Jayden and Chkroun, Ivan
             and Suleiman, Oriol Sans and Ioannou, Nikoli and Hyder, Ammar Ali
             and Jacobs, Sam and Chaturvedi, Sachin and Mishra, Shreyanshu
             and Aboutalebi, Hossein and Rugol, Vasileios},
  journal = {arXiv preprint arXiv:2510.13999},
  year    = {2025}
}

@misc{minimax_m2_7,
  title  = {MiniMax-M2.7},
  author = {MiniMax AI},
  year   = {2026},
  url    = {https://huggingface.co/MiniMaxAI/MiniMax-M2.7}
}

Acknowledgements

  • Cerebras Research for the REAP repository and prior MiniMax M2/M2.1/M2.5 REAP releases that informed this work.
  • MiniMax AI for the base MiniMax-M2.7 model.
  • ubergarm and Unsloth for MiniMax-M2.7 GGUF conventions and per-tensor recipes that informed our MoE-aware quant variant.

License

Inherits the Modified MIT License from MiniMaxAI/MiniMax-M2.7.


Published by m51Lab — open-source LLM contributions from the M51 AI OS group.

Downloads last month
99
Safetensors
Model size
139B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B

Finetuned
(26)
this model
Quantizations
6 models

Paper for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B