Instructions to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
- SGLang
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with Docker Model Runner:
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
m51Lab-MiniMax-M2.7-REAP-139B-A10B
First publicly available REAP-40% pruned variant of MiniMax-M2.7, released by m51Lab on 2026-04-15.
MiniMax-M2.7 is a 229B-parameter Mixture-of-Experts LLM released 2026-04-12. This variant uses REAP (Router-weighted Expert Activation Pruning) to prune 40 % of experts per MoE block, reducing total parameters to ~139 B while preserving ~10 B active parameters per token. The result runs comfortably on systems the full model cannot reach (notably 96 GB Apple Silicon via GGUF).
Architecture
| Property | Value |
|---|---|
| Base model | MiniMaxAI/MiniMax-M2.7 |
| Transformer layers | 62 |
| Hidden size | 3 072 |
| Intermediate (expert) | 1 536 |
| MoE experts per block | 154 (256 − 40 %) |
| Top-k routing | 8 |
| Active parameters / token | ~10 B |
| Total parameters | ~139 B |
| Max position embeddings | 196 608 |
| Vocabulary size | 200 064 |
| License | Modified MIT (inherited) |
Pruning parameters
- Method: REAP (Lasby et al. 2025, arXiv:2510.13999)
- Pruning rate: 40 % of experts per MoE block (256 → 154)
- Seed: 42
- Router renormalization: enabled
- Calibration sequence length: 2 048 tokens
- Effective samples: 6 144 packed across the three datasets below
- Distance measure: cosine
- Singleton super/outlier experts: disabled
Calibration dataset mix
| Dataset | Samples | Purpose |
|---|---|---|
theblackcat102/evol-codealpaca-v1 |
2 048 | General coding |
open-r1/Mixture-of-Thoughts[math] |
2 048 | Math / science / code reasoning |
Salesforce/xlam-function-calling-60k |
2 048 | Single-turn tool calling |
This mix mirrors Cerebras's public MiniMax-M2 / M2.1 / M2.5 REAP releases.
Evaluation
HumanEval pass@1 (on completed): 83.3 % (90 / 108)
For problems where the model completed its <think> reasoning within a 32 K-token generation budget, this variant (REAP-40 % pruned + Q4_K_M) solved 90 of 108 correctly — a strong quality signal for a 4-bit quantized, structurally pruned MoE.
Strict pass@1 (all 164 problems, cap-outs counted as fails): 54.9 %
56 of 164 problems exhausted the 32 K reasoning budget mid-<think> and are counted as fails under strict academic scoring. This is the production-deployment score if you constrain generation to 32 K tokens; allocate ≥64 K tokens to approach the 83 % ceiling.
Methodology: 2 × H100 80 GB, llama.cpp /v1/chat/completions, native <think> enabled, temperature=0.2, top_p=0.95, max_tokens=32000. No post-processing beyond HumanEval's canonical grading.
For continuity with prior quant comparisons: an earlier evaluation using raw /v1/completions + chat-prose stripping (non-canonical for reasoning models, bypasses <think>) reported 65.2 % (107 / 164). The numbers above use the canonical chat-completion path.
Smoke test (pre-publish, 5 diverse prompts)
| # | Prompt type | Verdict |
|---|---|---|
| 1 | Trivial arithmetic | PASS |
| 2 | Python Fibonacci | PASS |
| 3 | Norwegian response | PASS |
| 4 | MoE semantic explanation | PASS |
| 5 | JSON tool-call echo | PASS |
5 / 5 PASS. Confirms out-of-box inference quality.
Known minor imperfection
During integrity audit of the 62-layer bias-correction tensor fix, one layer (layer 0) had expert keep-indices that differed slightly from the REAP-retained set (86 of 154 positions). The magnitude of the resulting bias mismatch is bounded by the layer-0 bias natural variance (max |Δ| = 0.75 on values in [8.06, 8.88]), so the impact on routing is negligible — confirmed by the 5/5 smoke test above. All other 61 layers are bit-perfect. Full analysis in the reproducibility log.
Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
torch_dtype="bfloat16",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
trust_remote_code=True,
)
Recommended generation parameters: temperature=1.0, top_p=0.95, top_k=40.
For consumer hardware (96 GB Apple Silicon, multi-GPU rigs), use the GGUF quantizations.
Reproducibility
- REAP pin:
CerebrasResearch/reap@2b114e71with patches forMiniMaxM2ForCausalLMregistration (src/reap/model_util.py+src/reap/observer.py). - llama.cpp: post-PR #16831 (MiniMaxM2 arch merged). Built with CUDA, sm_90.
- transformers: pinned to
4.55.0. Do not upgrade to 5.x (import reorganization breaks REAP). - Stage timings (8×H200 SXM):
- Dequant FP8 → BF16: 14 min
- REAP forward+save (Stage 1): 9 h (4 097 samples @ 41 s/sample effective)
- GGUF convert: 20 min
- imatrix calibration: 3 h 03 min (488 chunks × 2 048 tokens)
- Quantization per variant: 15-45 min (parallel-3)
Citation
@article{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
author = {Lasby, Mike and Hussein, Ahmed and Spyra, Jayden and Chkroun, Ivan
and Suleiman, Oriol Sans and Ioannou, Nikoli and Hyder, Ammar Ali
and Jacobs, Sam and Chaturvedi, Sachin and Mishra, Shreyanshu
and Aboutalebi, Hossein and Rugol, Vasileios},
journal = {arXiv preprint arXiv:2510.13999},
year = {2025}
}
@misc{minimax_m2_7,
title = {MiniMax-M2.7},
author = {MiniMax AI},
year = {2026},
url = {https://huggingface.co/MiniMaxAI/MiniMax-M2.7}
}
Acknowledgements
- Cerebras Research for the REAP repository and prior MiniMax M2/M2.1/M2.5 REAP releases that informed this work.
- MiniMax AI for the base MiniMax-M2.7 model.
- ubergarm and Unsloth for MiniMax-M2.7 GGUF conventions and per-tensor recipes that informed our MoE-aware quant variant.
License
Inherits the Modified MIT License from MiniMaxAI/MiniMax-M2.7.
Published by m51Lab — open-source LLM contributions from the M51 AI OS group.
- Downloads last month
- 99