🧊 Qwopus3.5-9B-v3 β€” GPTQ Calibrated INT4

9B hybrid model (Qwen3.5 architecture) quantized to INT4 with GPTQ calibration. Loads natively in vLLM with Marlin kernel. 113 tok/s on RTX 3090.

πŸ† NEW: PolarQuant v7 β€” INT4 that BEATS BF16

We found the optimal config: group_size=64 + FOEM = 67.07% HumanEval (vs 66.87% BF16)

πŸ‘‰ Download PolarQuant v7 (gs64+FOEM) β€” same Marlin kernel, 8.7 GB

Method HumanEval Size Kernel
PolarQuant v7 (gs64+FOEM) 67.07% 8.7 GB Marlin
BF16 Base 66.87% 19.3 GB β€”
FOEM INT4 gs128 (Arien0) 62.80% 8.6 GB Marlin
This model (GPTQ gs128) 60.98% 8.6 GB Marlin
Naive INT4 (old) 55.49% 6.5 GB Marlin

πŸ“Š This Model's Benchmarks

Benchmarks

Metric GPTQ INT4 BF16 Original Improvement
HumanEval 60.98% 66.87% -5.9pp (calibrated)
Speed 113 tok/s ~40 tok/s 2.8x faster
Size 8.6 GB 18 GB 2.1x smaller
WikiText-2 PPL 6.56 6.37 +0.19

Previously naive INT4 scored 55.49% β€” GPTQ calibration improved by +5.5pp.


πŸš€ Quick Start

pip install vllm

vllm serve caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5 \
  --language-model-only \
  --enforce-eager

No plugins, no custom code. Just vLLM.

Python API

from vllm import LLM, SamplingParams

llm = LLM(
    model="caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5",
    trust_remote_code=True,
    enforce_eager=True,
)

output = llm.generate(
    ["Write a Python function for binary search."],
    SamplingParams(max_tokens=256, temperature=0.7),
)
print(output[0].outputs[0].text)

πŸ“ˆ HumanEval Evolution

# Method HumanEval Notes
1 Naive INT4 (RTN) 55.49% Round-to-nearest, no calibration
2 This model (GPTQ gs128) 60.98% Calibrated, desc_act=True
3 FOEM gs128 61.59% +FOEM error correction
4 FOEM gs128 (Arien0) 62.80% Different calibration data
5 BF16 Base 66.87% Original unquantized
6 PolarQuant v7 gs64+FOEM 67.07% BEATS BF16!

Speed (RTX 3090, 24 GB)

Confirmed by @Arien0:

Metric Value
Throughput 113 tok/s
Kernel Marlin (gptq_marlin)
VRAM ~8 GB

πŸ”§ Architecture

Property Value
Base Model Jackrong/Qwopus3.5-9B-v3
Architecture Qwen3.5 β€” hybrid (linear attention + full attention)
Parameters 9B
Layers 32 (24 linear attention + 8 full attention)
Hidden Size 4096

πŸ”¬ Quantization Details

Property Value
Method GPTQ (calibrated)
Tool GPTQModel v6.0.3
Bits 4
Group Size 128
Symmetric Yes
desc_act True (activation order)
Calibration 512 samples from neuralmagic/LLM_compression_calibration
Format GPTQ (native vLLM Marlin kernel)

πŸ’‘ Want better quality? Use PolarQuant v7 with gs64+FOEM for 67.07% HumanEval.


βš™οΈ Key Flags

Flag Why
--language-model-only Skips vision encoder (4304 dim not Marlin-compatible)
--enforce-eager Recommended for stability

πŸ”— Links

πŸ“– Citation

@article{vicentino2026polarquant,
  title={PolarQuant: Hadamard-Rotated Post-Training Quantization},
  author={Vicentino, Caio},
  journal={arXiv preprint arXiv:2603.29078},
  year={2026}
}

πŸ™ Acknowledgements

  • Arien0 for independent benchmarking and HumanEval testing
  • GPTQModel team for FOEM implementation
  • vLLM team for Marlin kernel support

Downloads last month
2,255
Safetensors
Model size
9B params
Tensor type
F32
Β·
BF16
Β·
F16
Β·
I8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5

Finetuned
Qwen/Qwen3.5-9B
Quantized
(13)
this model

Collection including caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5

Paper for caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5

Evaluation results