Qwen3.6-VL-REAP-26B-A3B

REAP-pruned version of Qwen/Qwen3.6-35B-A3B. 25% of routed experts removed per layer (256 → 192), reducing total parameters from ~35B to ~27B while keeping all 40 MoE layers, the shared expert, and the full vision-language encoder intact. BF16 precision, no quantization.

35B → 27B | 192 experts/layer | ~3B active per token | VL preserved

Why This Exists

This is the intermediate checkpoint in the REAP+W4A16 pipeline. We release it separately so others can:

  • Quantize it differently — GPTQ, AWQ, GGUF, FP8, whatever fits your deployment.
  • Fine-tune on it — the pruned model is a valid BF16 checkpoint, ready for LoRA/QLoRA.
  • Benchmark the pruning in isolation — separate REAP quality impact from quantization impact.

For the ready-to-deploy quantized version, see atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16.

Model Specifications

Property Original This Model (REAP Pruned)
Total Parameters ~35B ~27B
Active Parameters ~3B ~3B
Experts per Layer 256 192
Routed per Token 8 8
Shared Expert 1/layer (preserved) 1/layer (preserved)
Layers 40 (30 GDN + 10 full attn) 40 (30 GDN + 10 full attn)
Vision Encoder Yes Yes (unmodified)
Precision BF16 BF16
Disk Size ~67 GB ~50 GB
Context 262K 262K

Calibration Dataset

Expert pruning quality depends heavily on what data the model sees during saliency measurement. We built a composite calibration set of 24,576 samples at seqlen 16,384 tilted toward Qwen3.6's headline capability: agentic coding and tool use.

Source Samples Why
SWE-bench/SWE-smith-trajectories (tool split) 6,144 Agentic multi-turn. Full SWE-bench trajectories with tool calls, file edits, and test runs. This is the closest proxy to real-world agentic coding — the primary use case we're optimizing for.
Salesforce/xlam-function-calling-60k 6,144 Single-turn tool calling. Structured function definitions + invocations. Ensures the experts responsible for tool-use formatting survive pruning.
theblackcat102/evol-codealpaca-v1 4,096 General coding. Evolved instruction-following across languages and difficulty levels. Breadth coverage so we don't over-specialize on agentic patterns.
open-r1/Mixture-of-Thoughts (code) 2,730 Code reasoning. Long chain-of-thought traces for programming problems. Preserves the model's ability to reason step-by-step through code.
open-r1/Mixture-of-Thoughts (math) 2,730 Math reasoning. Ensures pruning doesn't disproportionately kill the experts activated during mathematical reasoning — a known risk with code-only calibration.
open-r1/Mixture-of-Thoughts (science) 2,732 Science reasoning. Same rationale as math — broader domain coverage keeps the model general-purpose even though the primary target is coding.

Design rationale: 50% of samples (12,288) are directly coding/tool-use. The other 50% are reasoning across domains. This split is intentional — at 25% expert pruning, we're only removing the tail of the saliency distribution. The diverse calibration ensures we're measuring true "least-useful" experts, not just "least-useful for code." Prior work at 50% pruning with pile-10k calibration showed degradation on tool-use tasks specifically because the calibration didn't exercise tool-calling experts enough.

Pruning Details

  • Method: REAP (Router-Expert Activation Pruning)
  • Compression: 25% expert removal (256 → 192 routed experts per layer)
  • Shared expert: Never pruned — always retained as-is
  • Saliency metric: softmax(router_logits) × ‖expert_output‖₂ averaged per expert across ~19.6K calibration sequences
  • Router renormalization: Disabled — gate weights kept as-sliced. The softmax naturally rebalances over the remaining experts. (Early experiments with rescaling gate.weight by √(192/256) significantly degraded routing discrimination.)
  • Seed: 42

Per-layer expert selections and saliency scores are in reap_metadata.json.

What's Different from atbender's Qwen3.5 Work

  • 25% prune (not 50%) — our Qwen3.5 work at 50% showed measurable degradation on tool-use. 25% is much safer.
  • Full pipeline from scratch — not downstream of OpenMOSE or any pre-pruned checkpoint. We own every step.
  • Agentic-coding calibration — purpose-built composite dataset (see above), not generic pile-10k.
  • Vision encoder preserved — the Qwen3.5 pruned models had VL but it was incidental. Here it's a design goal.

Important: dtype Must Be bfloat16

The GDN (Gated Delta Network) linear attention layers overflow float16 (max 65504). Always use dtype=torch.bfloat16. float16 produces NaN outputs.

Prerequisites

Qwen3.6 uses the qwen3_5_moe architecture, which landed in transformers main after the last tagged release. Install from source:

pip install "git+https://github.com/huggingface/transformers.git@main"
pip install "torch>=2.7" --index-url https://download.pytorch.org/whl/cu128  # or cu126/cu121
pip install accelerate torchvision

# Optional but recommended (10x faster GDN linear attention):
pip install flash-linear-attention causal-conv1d einops

Usage

Text only (load as CausalLM)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "atbender/Qwen3.6-VL-REAP-26B-A3B",
    dtype=torch.bfloat16,              # MUST be bfloat16
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B", trust_remote_code=True)

messages = [{"role": "user", "content": "Write a Python function to sort a list of dicts by key."}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True,
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=1024, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Vision-language (load as ConditionalGeneration)

The model's architecture is Qwen3_5MoeForConditionalGeneration. Use AutoModelForImageTextToText to load the full VL wrapper (language model + vision encoder). The BF16 vision tower is intact from the base model.

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("atbender/Qwen3.6-VL-REAP-26B-A3B", trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    "atbender/Qwen3.6-VL-REAP-26B-A3B",
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

image = Image.open("path/to/image.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image."},
    ],
}]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(processor.tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

vLLM

Tested with vLLM ≥ 0.19.1 (vllm/vllm-openai:latest). The image registers Qwen3_5MoeForConditionalGeneration natively.

Serve

docker run --gpus all --rm -p 8000:8000 \\
    -v ~/.cache/huggingface:/root/.cache/huggingface \\
    vllm/vllm-openai:latest \\
    atbender/Qwen3.6-VL-REAP-26B-A3B \\
        --tensor-parallel-size 1 \\
        --max-model-len 32768 \\
        --dtype bfloat16 \\
        --trust-remote-code \\
        --reasoning-parser qwen3 \\
        --enable-auto-tool-choice \\
        --tool-call-parser qwen3_coder

This is the BF16 checkpoint (50 GB weights). At 32K context you need ~60 GB VRAM; consider --tensor-parallel-size 2 on dual 48 GB cards. For single-GPU consumer deployment (24 GB), use the W4A16 sibling atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16 (15 GiB VRAM on load).

Client usage

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
r = client.chat.completions.create(
    model="atbender/Qwen3.6-VL-REAP-26B-A3B",
    messages=[{"role": "user", "content": "What is 17 * 23?"}],
    max_tokens=1024,
)
print(r.choices[0].message.reasoning)   # <think> block
print(r.choices[0].message.content)     # final answer

Vision + tool-calling examples: see the W4A16 sibling card — client usage is identical (same OpenAI-compatible API, same endpoints).

Hardware

  • Built on: Single NVIDIA RTX Pro 6000 (Blackwell, 96 GB VRAM). Pruning takes ~13–14h for the full 24K-sample composite calibration at seqlen 16,384.
  • Runs on: BF16 model weights are ~50 GB. With KV cache for 32K context you need ~60 GB VRAM minimum — so an H100 80GB, A100 80GB, Pro 6000, or 2× 48 GB cards with tensor parallel. For single-GPU consumer use, quantize to W4A16 (see atbender/Qwen3.6-VL-REAP-26B-A3B-W4A16).

Limitations

  • No post-prune fine-tuning — quality loss is unrecovered. Target: ≥95% retention on coding/tool tasks vs baseline.
  • Calibration was English + code-heavy. Long-tail languages may degrade.
  • Vision encoder is structurally preserved but not re-validated on VL benchmarks.

Recipe

python reap_prune.py \\
    --model-id Qwen/Qwen3.6-35B-A3B \\
    --target-experts 192 \\
    --dataset composite \\
    --seqlen 16384 \\
    --seed 42

Full script: reap_prune.py

Credits

Citation

@article{lasby2025reap,
  title={REAP: Router-weighted Expert Activation Pruning for Scalable Mixture-of-Experts Compression},
  author={Lasby, Mike and others},
  year={2025},
  url={https://github.com/CerebrasResearch/reap}
}

@misc{autoround2024,
  title={AutoRound: Advanced Weight Quantization},
  author={Intel Corporation},
  year={2024},
  howpublished={\url{https://github.com/intel/auto-round}}
}
Downloads last month
591
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for atbender/Qwen3.6-VL-REAP-26B-A3B

Finetuned
(71)
this model

Paper for atbender/Qwen3.6-VL-REAP-26B-A3B