Instructions to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B

SGLang

How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B with Docker Model Runner:
```
docker model run hf.co/dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

m51Lab-MiniMax-M2.7-REAP-139B-A10B

First publicly available REAP-40% pruned variant of MiniMax-M2.7, released by m51Lab on 2026-04-15.

MiniMax-M2.7 is a 229B-parameter Mixture-of-Experts LLM released 2026-04-12. This variant uses REAP (Router-weighted Expert Activation Pruning) to prune 40 % of experts per MoE block, reducing total parameters to ~139 B while preserving ~10 B active parameters per token. The result runs comfortably on systems the full model cannot reach (notably 96 GB Apple Silicon via GGUF).

Architecture

Property	Value
Base model	`MiniMaxAI/MiniMax-M2.7`
Transformer layers	62
Hidden size	3 072
Intermediate (expert)	1 536
MoE experts per block	154 (256 − 40 %)
Top-k routing	8
Active parameters / token	~10 B
Total parameters	~139 B
Max position embeddings	196 608
Vocabulary size	200 064
License	Modified MIT (inherited)

Pruning parameters

Method: REAP (Lasby et al. 2025, arXiv:2510.13999)
Pruning rate: 40 % of experts per MoE block (256 → 154)
Seed: 42
Router renormalization: enabled
Calibration sequence length: 2 048 tokens
Effective samples: 6 144 packed across the three datasets below
Distance measure: cosine
Singleton super/outlier experts: disabled

Calibration dataset mix

Dataset	Samples	Purpose
`theblackcat102/evol-codealpaca-v1`	2 048	General coding
`open-r1/Mixture-of-Thoughts[math]`	2 048	Math / science / code reasoning
`Salesforce/xlam-function-calling-60k`	2 048	Single-turn tool calling

This mix mirrors Cerebras's public MiniMax-M2 / M2.1 / M2.5 REAP releases.

Evaluation

HumanEval pass@1 (on completed): 83.3 % (90 / 108)

For problems where the model completed its <think> reasoning within a 32 K-token generation budget, this variant (REAP-40 % pruned + Q4_K_M) solved 90 of 108 correctly — a strong quality signal for a 4-bit quantized, structurally pruned MoE.

Strict pass@1 (all 164 problems, cap-outs counted as fails): 54.9 %

56 of 164 problems exhausted the 32 K reasoning budget mid-<think> and are counted as fails under strict academic scoring. This is the production-deployment score if you constrain generation to 32 K tokens; allocate ≥64 K tokens to approach the 83 % ceiling.

Methodology: 2 × H100 80 GB, llama.cpp /v1/chat/completions, native <think> enabled, temperature=0.2, top_p=0.95, max_tokens=32000. No post-processing beyond HumanEval's canonical grading.

For continuity with prior quant comparisons: an earlier evaluation using raw /v1/completions + chat-prose stripping (non-canonical for reasoning models, bypasses <think>) reported 65.2 % (107 / 164). The numbers above use the canonical chat-completion path.

Smoke test (pre-publish, 5 diverse prompts)

#	Prompt type	Verdict
1	Trivial arithmetic	PASS
2	Python Fibonacci	PASS
3	Norwegian response	PASS
4	MoE semantic explanation	PASS
5	JSON tool-call echo	PASS

5 / 5 PASS. Confirms out-of-box inference quality.

Known minor imperfection

During integrity audit of the 62-layer bias-correction tensor fix, one layer (layer 0) had expert keep-indices that differed slightly from the REAP-retained set (86 of 154 positions). The magnitude of the resulting bias mismatch is bounded by the layer-0 bias natural variance (max |Δ| = 0.75 on values in [8.06, 8.88]), so the impact on routing is negligible — confirmed by the 5/5 smoke test above. All other 61 layers are bit-perfect. Full analysis in the reproducibility log.

Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
    torch_dtype="bfloat16",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B",
    trust_remote_code=True,
)

Recommended generation parameters: temperature=1.0, top_p=0.95, top_k=40.

For consumer hardware (96 GB Apple Silicon, multi-GPU rigs), use the GGUF quantizations.

Reproducibility

REAP pin: CerebrasResearch/reap@2b114e71 with patches for MiniMaxM2ForCausalLM registration (src/reap/model_util.py + src/reap/observer.py).
llama.cpp: post-PR #16831 (MiniMaxM2 arch merged). Built with CUDA, sm_90.
transformers: pinned to 4.55.0. Do not upgrade to 5.x (import reorganization breaks REAP).
Stage timings (8×H200 SXM):
- Dequant FP8 → BF16: 14 min
- REAP forward+save (Stage 1): 9 h (4 097 samples @ 41 s/sample effective)
- GGUF convert: 20 min
- imatrix calibration: 3 h 03 min (488 chunks × 2 048 tokens)
- Quantization per variant: 15-45 min (parallel-3)

Citation

@article{lasby2025reap,
  title   = {REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author  = {Lasby, Mike and Hussein, Ahmed and Spyra, Jayden and Chkroun, Ivan
             and Suleiman, Oriol Sans and Ioannou, Nikoli and Hyder, Ammar Ali
             and Jacobs, Sam and Chaturvedi, Sachin and Mishra, Shreyanshu
             and Aboutalebi, Hossein and Rugol, Vasileios},
  journal = {arXiv preprint arXiv:2510.13999},
  year    = {2025}
}

@misc{minimax_m2_7,
  title  = {MiniMax-M2.7},
  author = {MiniMax AI},
  year   = {2026},
  url    = {https://huggingface.co/MiniMaxAI/MiniMax-M2.7}
}

Acknowledgements

Cerebras Research for the REAP repository and prior MiniMax M2/M2.1/M2.5 REAP releases that informed this work.
MiniMax AI for the base MiniMax-M2.7 model.
ubergarm and Unsloth for MiniMax-M2.7 GGUF conventions and per-tensor recipes that informed our MoE-aware quant variant.

License

Inherits the Modified MIT License from MiniMaxAI/MiniMax-M2.7.

Published by m51Lab — open-source LLM contributions from the M51 AI OS group.

Downloads last month: 99

Safetensors

Model size

139B params

Tensor type

BF16

Model tree for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B

Base model

MiniMaxAI/MiniMax-M2.7

Finetuned

(26)

this model

Quantizations

6 models

Paper for dervig/m51Lab-MiniMax-M2.7-REAP-139B-A10B

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 19