PolarQuant Unified (Weights Q5 + KV Cache Q3)
Collection
Hadamard + Lloyd-Max Q5 weights + torchao INT4 + Q3 bit-packed KV cache. Zero speed penalty. Consumer GPU ready. β’ 15 items β’ Updated β’ 1
9B hybrid model (Qwen3.5 architecture) quantized to INT4 with GPTQ calibration. Loads natively in vLLM with Marlin kernel. 113 tok/s on RTX 3090.
We found the optimal config:
group_size=64+ FOEM = 67.07% HumanEval (vs 66.87% BF16)π Download PolarQuant v7 (gs64+FOEM) β same Marlin kernel, 8.7 GB
| Method | HumanEval | Size | Kernel |
|---|---|---|---|
| PolarQuant v7 (gs64+FOEM) | 67.07% | 8.7 GB | Marlin |
| BF16 Base | 66.87% | 19.3 GB | β |
| FOEM INT4 gs128 (Arien0) | 62.80% | 8.6 GB | Marlin |
| This model (GPTQ gs128) | 60.98% | 8.6 GB | Marlin |
| Naive INT4 (old) | 55.49% | 6.5 GB | Marlin |
| Metric | GPTQ INT4 | BF16 Original | Improvement |
|---|---|---|---|
| HumanEval | 60.98% | 66.87% | -5.9pp (calibrated) |
| Speed | 113 tok/s | ~40 tok/s | 2.8x faster |
| Size | 8.6 GB | 18 GB | 2.1x smaller |
| WikiText-2 PPL | 6.56 | 6.37 | +0.19 |
Previously naive INT4 scored 55.49% β GPTQ calibration improved by +5.5pp.
pip install vllm
vllm serve caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5 \
--language-model-only \
--enforce-eager
No plugins, no custom code. Just vLLM.
from vllm import LLM, SamplingParams
llm = LLM(
model="caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5",
trust_remote_code=True,
enforce_eager=True,
)
output = llm.generate(
["Write a Python function for binary search."],
SamplingParams(max_tokens=256, temperature=0.7),
)
print(output[0].outputs[0].text)
| # | Method | HumanEval | Notes |
|---|---|---|---|
| 1 | Naive INT4 (RTN) | 55.49% | Round-to-nearest, no calibration |
| 2 | This model (GPTQ gs128) | 60.98% | Calibrated, desc_act=True |
| 3 | FOEM gs128 | 61.59% | +FOEM error correction |
| 4 | FOEM gs128 (Arien0) | 62.80% | Different calibration data |
| 5 | BF16 Base | 66.87% | Original unquantized |
| 6 | PolarQuant v7 gs64+FOEM | 67.07% | BEATS BF16! |
Confirmed by @Arien0:
| Metric | Value |
|---|---|
| Throughput | 113 tok/s |
| Kernel | Marlin (gptq_marlin) |
| VRAM | ~8 GB |
| Property | Value |
|---|---|
| Base Model | Jackrong/Qwopus3.5-9B-v3 |
| Architecture | Qwen3.5 β hybrid (linear attention + full attention) |
| Parameters | 9B |
| Layers | 32 (24 linear attention + 8 full attention) |
| Hidden Size | 4096 |
| Property | Value |
|---|---|
| Method | GPTQ (calibrated) |
| Tool | GPTQModel v6.0.3 |
| Bits | 4 |
| Group Size | 128 |
| Symmetric | Yes |
| desc_act | True (activation order) |
| Calibration | 512 samples from neuralmagic/LLM_compression_calibration |
| Format | GPTQ (native vLLM Marlin kernel) |
π‘ Want better quality? Use PolarQuant v7 with gs64+FOEM for 67.07% HumanEval.
| Flag | Why |
|---|---|
--language-model-only |
Skips vision encoder (4304 dim not Marlin-compatible) |
--enforce-eager |
Recommended for stability |
@article{vicentino2026polarquant,
title={PolarQuant: Hadamard-Rotated Post-Training Quantization},
author={Vicentino, Caio},
journal={arXiv preprint arXiv:2603.29078},
year={2026}
}
Base model
Qwen/Qwen3.5-9B-Base