Qwen3.6-VL-REAP-26B-A3B
REAP-pruned version of Qwen/Qwen3.6-35B-A3B. 25% of routed experts removed per layer (256 → 192), reducing total parameters from ~35B to ~27B while keeping all 40 MoE layers, the shared expert, and the full vision-language encoder intact. BF16 precision, no quantization.
35B → 27B | 192 experts/layer | ~3B active per token | VL preserved
Why This Exists
This is the intermediate checkpoint in the REAP+W4A16 pipeline. We release it separately so others can:
- Quantize it differently — GPTQ, AWQ, GGUF, FP8, whatever fits your deployment.
- Fine-tune on it — the pruned model is a valid BF16 checkpoint, ready for LoRA/QLoRA.
- Benchmark the pruning in isolation — separate REAP quality impact from quantization impact.
For the ready-to-deploy quantized version, see atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16.
Model Specifications
| Property | Original | This Model (REAP Pruned) |
|---|---|---|
| Total Parameters | ~35B | ~27B |
| Active Parameters | ~3B | ~3B |
| Experts per Layer | 256 | 192 |
| Routed per Token | 8 | 8 |
| Shared Expert | 1/layer (preserved) | 1/layer (preserved) |
| Layers | 40 (30 GDN + 10 full attn) | 40 (30 GDN + 10 full attn) |
| Vision Encoder | Yes | Yes (unmodified) |
| Precision | BF16 | BF16 |
| Disk Size | ~67 GB | ~50 GB |
| Context | 262K | 262K |
Calibration Dataset
Expert pruning quality depends heavily on what data the model sees during saliency measurement. We built a composite calibration set of 24,576 samples at seqlen 16,384 tilted toward Qwen3.6's headline capability: agentic coding and tool use.
| Source | Samples | Why |
|---|---|---|
| SWE-bench/SWE-smith-trajectories (tool split) | 6,144 | Agentic multi-turn. Full SWE-bench trajectories with tool calls, file edits, and test runs. This is the closest proxy to real-world agentic coding — the primary use case we're optimizing for. |
| Salesforce/xlam-function-calling-60k | 6,144 | Single-turn tool calling. Structured function definitions + invocations. Ensures the experts responsible for tool-use formatting survive pruning. |
| theblackcat102/evol-codealpaca-v1 | 4,096 | General coding. Evolved instruction-following across languages and difficulty levels. Breadth coverage so we don't over-specialize on agentic patterns. |
| open-r1/Mixture-of-Thoughts (code) | 2,730 | Code reasoning. Long chain-of-thought traces for programming problems. Preserves the model's ability to reason step-by-step through code. |
| open-r1/Mixture-of-Thoughts (math) | 2,730 | Math reasoning. Ensures pruning doesn't disproportionately kill the experts activated during mathematical reasoning — a known risk with code-only calibration. |
| open-r1/Mixture-of-Thoughts (science) | 2,732 | Science reasoning. Same rationale as math — broader domain coverage keeps the model general-purpose even though the primary target is coding. |
Design rationale: 50% of samples (12,288) are directly coding/tool-use. The other 50% are reasoning across domains. This split is intentional — at 25% expert pruning, we're only removing the tail of the saliency distribution. The diverse calibration ensures we're measuring true "least-useful" experts, not just "least-useful for code." Prior work at 50% pruning with pile-10k calibration showed degradation on tool-use tasks specifically because the calibration didn't exercise tool-calling experts enough.
Pruning Details
- Method: REAP (Router-Expert Activation Pruning)
- Compression: 25% expert removal (256 → 192 routed experts per layer)
- Shared expert: Never pruned — always retained as-is
- Saliency metric:
softmax(router_logits) × ‖expert_output‖₂averaged per expert across ~19.6K calibration sequences - Router renormalization: Disabled — gate weights kept as-sliced. The softmax naturally rebalances over the remaining experts. (Early experiments with rescaling
gate.weightby √(192/256) significantly degraded routing discrimination.) - Seed: 42
Per-layer expert selections and saliency scores are in reap_metadata.json.
What's Different from atbender's Qwen3.5 Work
- 25% prune (not 50%) — our Qwen3.5 work at 50% showed measurable degradation on tool-use. 25% is much safer.
- Full pipeline from scratch — not downstream of OpenMOSE or any pre-pruned checkpoint. We own every step.
- Agentic-coding calibration — purpose-built composite dataset (see above), not generic pile-10k.
- Vision encoder preserved — the Qwen3.5 pruned models had VL but it was incidental. Here it's a design goal.
Important: dtype Must Be bfloat16
The GDN (Gated Delta Network) linear attention layers overflow float16 (max 65504). Always use dtype=torch.bfloat16. float16 produces NaN outputs.
Prerequisites
Qwen3.6 uses the qwen3_5_moe architecture, which landed in transformers main after the last tagged release. Install from source:
pip install "git+https://github.com/huggingface/transformers.git@main"
pip install "torch>=2.7" --index-url https://download.pytorch.org/whl/cu128 # or cu126/cu121
pip install accelerate torchvision
# Optional but recommended (10x faster GDN linear attention):
pip install flash-linear-attention causal-conv1d einops
Usage
Text only (load as CausalLM)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"atbender/Qwen3.6-VL-REAP-26B-A3B",
dtype=torch.bfloat16, # MUST be bfloat16
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B", trust_remote_code=True)
messages = [{"role": "user", "content": "Write a Python function to sort a list of dicts by key."}]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True,
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Vision-language (load as ConditionalGeneration)
The model's architecture is Qwen3_5MoeForConditionalGeneration. Use AutoModelForImageTextToText to load the full VL wrapper (language model + vision encoder). The BF16 vision tower is intact from the base model.
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import torch
processor = AutoProcessor.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
"atbender/Qwen3.6-VL-REAP-26B-A3B",
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
image = Image.open("path/to/image.jpg")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image."},
],
}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(processor.tokenizer.decode(
outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
vLLM
Tested with vLLM ≥ 0.19.1 (vllm/vllm-openai:latest). The image registers Qwen3_5MoeForConditionalGeneration natively.
Serve
docker run --gpus all --rm -p 8000:8000 \\
-v ~/.cache/huggingface:/root/.cache/huggingface \\
vllm/vllm-openai:latest \\
atbender/Qwen3.6-VL-REAP-26B-A3B \\
--tensor-parallel-size 1 \\
--max-model-len 32768 \\
--dtype bfloat16 \\
--trust-remote-code \\
--reasoning-parser qwen3 \\
--enable-auto-tool-choice \\
--tool-call-parser qwen3_coder
This is the BF16 checkpoint (50 GB weights). At 32K context you need ~60 GB VRAM; consider 15 GiB VRAM on load).--tensor-parallel-size 2 on dual 48 GB cards. For single-GPU consumer deployment (24 GB), use the W4A16 sibling atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 (
Client usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
r = client.chat.completions.create(
model="atbender/Qwen3.6-VL-REAP-26B-A3B",
messages=[{"role": "user", "content": "What is 17 * 23?"}],
max_tokens=1024,
)
print(r.choices[0].message.reasoning) # <think> block
print(r.choices[0].message.content) # final answer
Vision + tool-calling examples: see the W4A16 sibling card — client usage is identical (same OpenAI-compatible API, same endpoints).
Hardware
- Built on: Single NVIDIA RTX Pro 6000 (Blackwell, 96 GB VRAM). Pruning takes ~13–14h for the full 24K-sample composite calibration at seqlen 16,384.
- Runs on: BF16 model weights are ~50 GB. With KV cache for 32K context you need ~60 GB VRAM minimum — so an H100 80GB, A100 80GB, Pro 6000, or 2× 48 GB cards with tensor parallel. For single-GPU consumer use, quantize to W4A16 (see atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16).
Limitations
- No post-prune fine-tuning — quality loss is unrecovered. Target: ≥95% retention on coding/tool tasks vs baseline.
- Calibration was English + code-heavy. Long-tail languages may degrade.
- Vision encoder is structurally preserved but not re-validated on VL benchmarks.
Recipe
python reap_prune.py \\
--model-id Qwen/Qwen3.6-35B-A3B \\
--target-experts 192 \\
--dataset composite \\
--seqlen 16384 \\
--seed 42
Full script: reap_prune.py
Credits
- Cerebras Research — REAP method
- Qwen Team (Alibaba) — Base model
- OpenMOSE & 0xSero — Prior art on the REAP recipe for Qwen MoE models
Citation
@article{lasby2025reap,
title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
author={Lasby, Mike and others},
year={2025},
url={https://github.com/CerebrasResearch/reap}
}
@misc{autoround2024,
title={AutoRound: Advanced Weight Quantization},
author={Intel Corporation},
year={2024},
howpublished={\url{https://github.com/intel/auto-round}}
}
- Downloads last month
- 591
Model tree for atbender/Qwen3.6-VL-REAP-26B-A3B
Base model
Qwen/Qwen3.6-35B-A3B