🧊 Qwopus3.5-9B-v3 — GPTQ Calibrated INT4

9B hybrid model (Qwen3.5 architecture) quantized to INT4 with GPTQ calibration. Loads natively in vLLM with Marlin kernel. 113 tok/s on RTX 3090.

🏆 NEW: PolarQuant v7 — INT4 that BEATS BF16

We found the optimal config: group_size=64 + FOEM = 67.07% HumanEval (vs 66.87% BF16)

👉 Download PolarQuant v7 (gs64+FOEM) — same Marlin kernel, 8.7 GB

Method	HumanEval	Size	Kernel
PolarQuant v7 (gs64+FOEM)	67.07%	8.7 GB	Marlin
BF16 Base	66.87%	19.3 GB	—
FOEM INT4 gs128 (Arien0)	62.80%	8.6 GB	Marlin
This model (GPTQ gs128)	60.98%	8.6 GB	Marlin
Naive INT4 (old)	55.49%	6.5 GB	Marlin

📊 This Model's Benchmarks

Metric	GPTQ INT4	BF16 Original	Improvement
HumanEval	60.98%	66.87%	-5.9pp (calibrated)
Speed	113 tok/s	~40 tok/s	2.8x faster
Size	8.6 GB	18 GB	2.1x smaller
WikiText-2 PPL	6.56	6.37	+0.19

Previously naive INT4 scored 55.49% — GPTQ calibration improved by +5.5pp.

🚀 Quick Start

pip install vllm

vllm serve caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5 \
  --language-model-only \
  --enforce-eager

No plugins, no custom code. Just vLLM.

Python API

from vllm import LLM, SamplingParams

llm = LLM(
    model="caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5",
    trust_remote_code=True,
    enforce_eager=True,
)

output = llm.generate(
    ["Write a Python function for binary search."],
    SamplingParams(max_tokens=256, temperature=0.7),
)
print(output[0].outputs[0].text)

📈 HumanEval Evolution

#	Method	HumanEval	Notes
1	Naive INT4 (RTN)	55.49%	Round-to-nearest, no calibration
2	This model (GPTQ gs128)	60.98%	Calibrated, desc_act=True
3	FOEM gs128	61.59%	+FOEM error correction
4	FOEM gs128 (Arien0)	62.80%	Different calibration data
5	BF16 Base	66.87%	Original unquantized
6	PolarQuant v7 gs64+FOEM	67.07%	BEATS BF16!

Speed (RTX 3090, 24 GB)

Confirmed by @Arien0:

Metric	Value
Throughput	113 tok/s
Kernel	Marlin (gptq_marlin)
VRAM	~8 GB

🔧 Architecture

Property	Value
Base Model	Jackrong/Qwopus3.5-9B-v3
Architecture	Qwen3.5 — hybrid (linear attention + full attention)
Parameters	9B
Layers	32 (24 linear attention + 8 full attention)
Hidden Size	4096

🔬 Quantization Details

Property	Value
Method	GPTQ (calibrated)
Tool	GPTQModel v6.0.3
Bits	4
Group Size	128
Symmetric	Yes
desc_act	True (activation order)
Calibration	512 samples from neuralmagic/LLM_compression_calibration
Format	GPTQ (native vLLM Marlin kernel)

💡 Want better quality? Use PolarQuant v7 with gs64+FOEM for 67.07% HumanEval.

⚙️ Key Flags

Flag	Why
`--language-model-only`	Skips vision encoder (4304 dim not Marlin-compatible)
`--enforce-eager`	Recommended for stability

🔗 Links

🏆 PolarQuant v7 (BEST) — 67.07% HumanEval, beats BF16
📜 Paper: PolarQuant — arXiv:2603.29078
💻 GitHub: polarengine-vllm
📦 PyPI: pip install polarquant
🔧 Expert Offloading: vllm-expert-offload — LFRU cache for consumer GPUs

📖 Citation

@article{vicentino2026polarquant,
  title={PolarQuant: Hadamard-Rotated Post-Training Quantization},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}