Qwen3.6-35B-A3B_INT4 โ PTQ Quantized (W4A16)
Overview
This repository provides a Post-Training Quantized (PTQ) version of:
Base Model: Qwen/Qwen3.6-35B-A3B
Quantized By: TheHouseOfTheDude
This is a true PTQ quantization:
- No calibration dataset
- One-shot quantization
- Fast and deterministic pipeline
Quantization Details
- Scheme: W4A16
- Weights: INT4 (per-channel symmetric)
- Activations: FP16 / BF16
- Method: llmcompressor.oneshot
- Targets: Linear layers only
Ignored layers:
- lm_head
- visual modules
- linear_attn
- mtp
- mlp.gate
- mlp.shared_expert_gate
KLD Results
Mean KLD: 0.071393
low divergence โ strong fidelity to original model. (0.01 Better than a IQ3_S)
Implementation Notes
- Uses AutoModelForImageTextToText for correct vLLM weight paths
- Includes Transformers v5 key remapping fix
- No calibration dataset used
Usage (vLLM)
pip install -U vllm
vllm serve TheHouseOfTheDude/Qwen3.6-35B-A3B_INT4 \
--quantization compressed-tensors \
--tensor-parallel-size 4 \
--dtype bfloat16
Notes
- Requires compressed-tensors runtime
- Not compatible with vanilla Transformers loading
- Optimized for production inference
Credits
- Qwen/Qwen3.6-35B-A3B
- TheHouseOfTheDude
Model tree for TheHouseOfTheDude/Qwen3.6-35B-A3B_INT4
Base model
Qwen/Qwen3.6-35B-A3B