caiovicentino1 commited on
Commit
90efc7c
·
verified ·
1 Parent(s): 67ac667

update: model card with GPTQ benchmarks, HumanEval 60.98%, charts

Browse files
Files changed (1) hide show
  1. README.md +116 -61
README.md CHANGED
@@ -1,102 +1,157 @@
1
  ---
2
  license: apache-2.0
3
  base_model: Jackrong/Qwopus3.5-9B-v3
4
- language:
5
- - en
6
- - zh
7
- - ko
8
- - ja
9
  tags:
10
  - polarquant
11
- - quantized
12
- - compressed-tensors
13
  - int4
14
- - marlin
15
  - vllm
 
16
  pipeline_tag: text-generation
17
- arxiv: "2603.29078"
18
- library_name: transformers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
 
21
- # Qwopus3.5-9B-v3 — PolarQuant INT4
 
 
22
 
23
- **Native vLLM. Marlin kernel. Zero plugin.**
24
 
25
- PolarQuant Q5 preprocessing produces **better INT4 weights** than direct quantization stored in CompressedTensors format for native vLLM inference.
 
 
 
 
 
 
 
 
26
 
27
- ## Quick Start — vLLM (one command)
28
 
29
  ```bash
30
  pip install vllm
31
- vllm serve caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5 --language-model-only --enforce-eager
 
 
 
32
  ```
33
 
34
- That's it. No plugin, no `pip install polarquant`, no custom code.
35
 
36
- **Tested results:**
37
 
38
- | GPU | tok/s |
39
- |-----|-------|
40
- | A100 80GB | **168 tok/s** (9B) |
41
- | RTX PRO 6000 96GB | **44 tok/s** (9B) / **18 tok/s** (27B) |
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- ## Quick Start — HuggingFace Transformers
44
 
45
- ```bash
46
- pip install polarquant
47
- ```
48
 
49
- ```python
50
- import polarengine_vllm # auto-registers with transformers
51
- from transformers import AutoModelForCausalLM, AutoTokenizer
52
 
53
- model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5", device_map="auto", trust_remote_code=True)
54
- tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5", trust_remote_code=True)
 
 
 
 
55
 
56
- inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
57
- out = model.generate(**inputs, max_new_tokens=100)
58
- print(tokenizer.decode(out[0], skip_special_tokens=True))
59
- ```
60
 
 
61
 
62
- ## Consumer GPU Compatibility
 
 
 
 
63
 
64
- | GPU | VRAM | Works? | Expected tok/s |
65
- |-----|------|--------|---------------|
66
- | RTX 4060 | 8 GB | YES | ~20 |
67
- | RTX 3060/4070 | 12 GB | YES | ~30 |
68
- | RTX 4080 | 16 GB | YES | ~35 |
69
- | RTX 4090 | 24 GB | YES | ~40 |
70
- | A100 | 80 GB | YES | ~168 |
71
 
72
- ## Why PolarQuant INT4 is Better
73
 
74
- Standard INT4 (GPTQ/AWQ) quantizes weights directly — outliers cause errors.
 
 
 
 
 
 
75
 
76
- PolarQuant adds a **preprocessing step**:
77
 
78
- 1. **Hadamard rotation** — distributes weight energy uniformly (eliminates outliers)
79
- 2. **Lloyd-Max Q5** — MSE-optimal quantization for the resulting Gaussian distribution
80
- 3. **Dequant → INT4** — the cleaned weights produce better INT4 than direct quantization
81
 
82
- | Method | PPL (lower = better) |
83
- |--------|---------------------|
84
- | BF16 baseline | 6.37 |
85
- | **PolarQuant → INT4** | **6.56** |
86
- | Direct INT4 | 6.68 |
 
 
 
 
 
87
 
88
- **Same speed as GPTQ/AWQ, better quality.**
89
 
90
- ## Important Flags
91
 
92
  | Flag | Why |
93
  |------|-----|
94
- | `--language-model-only` | Qwen3.5 is multimodal — this skips the vision encoder (we only quantized text) |
95
- | `--enforce-eager` | Required on Blackwell GPUs (cc 12.0). Optional on A100/H100 (faster without it) |
 
 
96
 
97
  ## Links
98
 
99
- - Paper: [arxiv.org/abs/2603.29078](https://arxiv.org/abs/2603.29078)
100
- - GitHub: [github.com/caiovicentino/polarengine-vllm](https://github.com/caiovicentino/polarengine-vllm)
101
- - PyPI: `pip install polarquant`
102
- - Base model: [Jackrong/Qwopus3.5-9B-v3](https://huggingface.co/Jackrong/Qwopus3.5-9B-v3)
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  base_model: Jackrong/Qwopus3.5-9B-v3
 
 
 
 
 
4
  tags:
5
  - polarquant
6
+ - gptq
 
7
  - int4
8
+ - qwen3.5
9
  - vllm
10
+ - marlin
11
  pipeline_tag: text-generation
12
+ model-index:
13
+ - name: Qwopus3.5-9B-v3-PolarQuant-Q5
14
+ results:
15
+ - task:
16
+ type: text-generation
17
+ dataset:
18
+ name: HumanEval
19
+ type: openai_humaneval
20
+ metrics:
21
+ - name: pass@1
22
+ type: pass@1
23
+ value: 60.98
24
+ - task:
25
+ type: text-generation
26
+ dataset:
27
+ name: WikiText-2
28
+ type: wikitext
29
+ metrics:
30
+ - name: Perplexity
31
+ type: perplexity
32
+ value: 6.56
33
  ---
34
 
35
+ # Qwopus3.5-9B-v3 — GPTQ Calibrated INT4
36
+
37
+ > **9B hybrid model (Qwen3.5 architecture) quantized to INT4** with GPTQ calibration. Loads natively in vLLM with Marlin kernel. 113 tok/s on RTX 3090.
38
 
39
+ ![Benchmarks](benchmarks.png)
40
 
41
+ | Metric | GPTQ INT4 | BF16 Original | Improvement |
42
+ |--------|-----------|---------------|-------------|
43
+ | **HumanEval** | **60.98%** | 66.87% | -5.9pp (calibrated) |
44
+ | **Speed** | **113 tok/s** | ~40 tok/s | **2.8x faster** |
45
+ | **Size** | **8.6 GB** | 18 GB | **2.1x smaller** |
46
+
47
+ Previously naive INT4 scored 55.49% — GPTQ calibration improved by **+5.5pp**.
48
+
49
+ ---
50
 
51
+ ## Quick Start
52
 
53
  ```bash
54
  pip install vllm
55
+
56
+ vllm serve caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5 \
57
+ --language-model-only \
58
+ --enforce-eager
59
  ```
60
 
61
+ No plugins, no custom code. Just vLLM.
62
 
63
+ ### Python API
64
 
65
+ ```python
66
+ from vllm import LLM, SamplingParams
67
+
68
+ llm = LLM(
69
+ model="caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5",
70
+ trust_remote_code=True,
71
+ enforce_eager=True,
72
+ )
73
+
74
+ output = llm.generate(
75
+ ["Write a Python function for binary search."],
76
+ SamplingParams(max_tokens=256, temperature=0.7),
77
+ )
78
+ print(output[0].outputs[0].text)
79
+ ```
80
 
81
+ ---
82
 
83
+ ## Benchmarks
 
 
84
 
85
+ ### HumanEval (Pass@1)
 
 
86
 
87
+ | Model | Pass@1 | Method |
88
+ |-------|--------|--------|
89
+ | BF16 Original | 66.87% | No quantization |
90
+ | FOEM INT4 | 62.80% | Fine-tuned Error Minimization |
91
+ | **GPTQ INT4 (ours)** | **60.98%** | GPTQ calibrated, desc_act=True |
92
+ | Naive INT4 (old) | 55.49% | Round-to-nearest, no calibration |
93
 
94
+ ### Speed (RTX 3090, 24 GB)
 
 
 
95
 
96
+ Confirmed by [@Arien0](https://huggingface.co/caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5/discussions/1):
97
 
98
+ | Metric | Value |
99
+ |--------|-------|
100
+ | **Throughput** | **113 tok/s** |
101
+ | Kernel | Marlin (gptq_marlin) |
102
+ | VRAM | ~8 GB |
103
 
104
+ ---
 
 
 
 
 
 
105
 
106
+ ## Architecture
107
 
108
+ | Property | Value |
109
+ |----------|-------|
110
+ | **Base Model** | [Jackrong/Qwopus3.5-9B-v3](https://huggingface.co/Jackrong/Qwopus3.5-9B-v3) |
111
+ | **Architecture** | Qwen3.5 — hybrid (linear attention + full attention) |
112
+ | **Parameters** | 9B |
113
+ | **Layers** | 32 (24 linear attention + 8 full attention) |
114
+ | **Hidden Size** | 4096 |
115
 
116
+ ---
117
 
118
+ ## Quantization Details
 
 
119
 
120
+ | Property | Value |
121
+ |----------|-------|
122
+ | **Method** | GPTQ (calibrated) |
123
+ | **Tool** | [GPTQModel v6.0.3](https://github.com/ModelCloud/GPTQModel) |
124
+ | **Bits** | 4 |
125
+ | **Group Size** | 128 |
126
+ | **Symmetric** | Yes |
127
+ | **desc_act** | True (activation order) |
128
+ | **Calibration** | 512 samples from [neuralmagic/LLM_compression_calibration](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration) |
129
+ | **Format** | GPTQ (native vLLM Marlin kernel) |
130
 
131
+ ---
132
 
133
+ ## Key Flags
134
 
135
  | Flag | Why |
136
  |------|-----|
137
+ | `--language-model-only` | Skips vision encoder (4304 dim not Marlin-compatible) |
138
+ | `--enforce-eager` | Recommended for stability |
139
+
140
+ ---
141
 
142
  ## Links
143
 
144
+ - **Paper**: [PolarQuant — Hadamard-Rotated Post-Training Quantization](https://arxiv.org/abs/2603.29078)
145
+ - **GitHub**: [polarengine-vllm](https://github.com/caiovicentino/polarengine-vllm)
146
+ - **Expert Offloading**: [vllm-expert-offload](https://github.com/caiovicentino/vllm-expert-offload) LFRU cache for consumer GPUs
147
+
148
+ ## Citation
149
+
150
+ ```bibtex
151
+ @article{vicentino2026polarquant,
152
+ title={PolarQuant: Hadamard-Rotated Post-Training Quantization},
153
+ author={Vicentino, Caio},
154
+ journal={arXiv preprint arXiv:2603.29078},
155
+ year={2026}
156
+ }
157
+ ```