caiovicentino1
/

Qwopus3.5-9B-v3-PolarQuant-Q5

@@ -1,102 +1,157 @@
 ---
 license: apache-2.0
 base_model: Jackrong/Qwopus3.5-9B-v3
-language:
-  - en
-  - zh
-  - ko
-  - ja
 tags:
   - polarquant
-  - quantized
-  - compressed-tensors
   - int4
-  - marlin
   - vllm
 pipeline_tag: text-generation
-arxiv: "2603.29078"
-library_name: transformers
 ---
-# Qwopus3.5-9B-v3 — PolarQuant INT4
-**Native vLLM. Marlin kernel. Zero plugin.**
-PolarQuant Q5 preprocessing produces **better INT4 weights** than direct quantization — stored in CompressedTensors format for native vLLM inference.
-## Quick Start — vLLM (one command)
 ```bash
 pip install vllm
-vllm serve caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5 --language-model-only --enforce-eager
 ```
-That's it. No plugin, no `pip install polarquant`, no custom code.
-**Tested results:**
-| GPU | tok/s |
-|-----|-------|
-| A100 80GB | **168 tok/s** (9B) |
-| RTX PRO 6000 96GB | **44 tok/s** (9B) / **18 tok/s** (27B) |
-## Quick Start — HuggingFace Transformers
-```bash
-pip install polarquant
-```
-```python
-import polarengine_vllm  # auto-registers with transformers
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained("caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5", device_map="auto", trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained("caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5", trust_remote_code=True)
-inputs = tokenizer("Hello!", return_tensors="pt").to("cuda")
-out = model.generate(**inputs, max_new_tokens=100)
-print(tokenizer.decode(out[0], skip_special_tokens=True))
-```
-## Consumer GPU Compatibility
-| GPU | VRAM | Works? | Expected tok/s |
-|-----|------|--------|---------------|
-| RTX 4060 | 8 GB | YES | ~20 |
-| RTX 3060/4070 | 12 GB | YES | ~30 |
-| RTX 4080 | 16 GB | YES | ~35 |
-| RTX 4090 | 24 GB | YES | ~40 |
-| A100 | 80 GB | YES | ~168 |
-## Why PolarQuant INT4 is Better
-Standard INT4 (GPTQ/AWQ) quantizes weights directly — outliers cause errors.
-PolarQuant adds a **preprocessing step**:
-1. **Hadamard rotation** — distributes weight energy uniformly (eliminates outliers)
-2. **Lloyd-Max Q5** — MSE-optimal quantization for the resulting Gaussian distribution
-3. **Dequant → INT4** — the cleaned weights produce better INT4 than direct quantization
-| Method | PPL (lower = better) |
-|--------|---------------------|
-| BF16 baseline | 6.37 |
-| **PolarQuant → INT4** | **6.56** |
-| Direct INT4 | 6.68 |
-**Same speed as GPTQ/AWQ, better quality.**
-## Important Flags
 | Flag | Why |
 |------|-----|
-| `--language-model-only` | Qwen3.5 is multimodal — this skips the vision encoder (we only quantized text) |
-| `--enforce-eager` | Required on Blackwell GPUs (cc 12.0). Optional on A100/H100 (faster without it) |
 ## Links
-- Paper: [arxiv.org/abs/2603.29078](https://arxiv.org/abs/2603.29078)
-- GitHub: [github.com/caiovicentino/polarengine-vllm](https://github.com/caiovicentino/polarengine-vllm)
-- PyPI: `pip install polarquant`
-- Base model: [Jackrong/Qwopus3.5-9B-v3](https://huggingface.co/Jackrong/Qwopus3.5-9B-v3)

 ---
 license: apache-2.0
 base_model: Jackrong/Qwopus3.5-9B-v3
 tags:
   - polarquant
+  - gptq
   - int4
+  - qwen3.5
   - vllm
+  - marlin
 pipeline_tag: text-generation
+model-index:
+  - name: Qwopus3.5-9B-v3-PolarQuant-Q5
+    results:
+      - task:
+          type: text-generation
+        dataset:
+          name: HumanEval
+          type: openai_humaneval
+        metrics:
+          - name: pass@1
+            type: pass@1
+            value: 60.98
+      - task:
+          type: text-generation
+        dataset:
+          name: WikiText-2
+          type: wikitext
+        metrics:
+          - name: Perplexity
+            type: perplexity
+            value: 6.56
 ---
+# Qwopus3.5-9B-v3 — GPTQ Calibrated INT4
+> **9B hybrid model (Qwen3.5 architecture) quantized to INT4** with GPTQ calibration. Loads natively in vLLM with Marlin kernel. 113 tok/s on RTX 3090.
+![Benchmarks](benchmarks.png)
+| Metric | GPTQ INT4 | BF16 Original | Improvement |
+|--------|-----------|---------------|-------------|
+| **HumanEval** | **60.98%** | 66.87% | -5.9pp (calibrated) |
+| **Speed** | **113 tok/s** | ~40 tok/s | **2.8x faster** |
+| **Size** | **8.6 GB** | 18 GB | **2.1x smaller** |
+Previously naive INT4 scored 55.49% — GPTQ calibration improved by **+5.5pp**.
+---
+## Quick Start
 ```bash
 pip install vllm
+vllm serve caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5 \
+  --language-model-only \
+  --enforce-eager
 ```
+No plugins, no custom code. Just vLLM.
+### Python API
+```python
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model="caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5",
+    trust_remote_code=True,
+    enforce_eager=True,
+)
+output = llm.generate(
+    ["Write a Python function for binary search."],
+    SamplingParams(max_tokens=256, temperature=0.7),
+)
+print(output[0].outputs[0].text)
+```
+---
+## Benchmarks
+### HumanEval (Pass@1)
+| Model | Pass@1 | Method |
+|-------|--------|--------|
+| BF16 Original | 66.87% | No quantization |
+| FOEM INT4 | 62.80% | Fine-tuned Error Minimization |
+| **GPTQ INT4 (ours)** | **60.98%** | GPTQ calibrated, desc_act=True |
+| Naive INT4 (old) | 55.49% | Round-to-nearest, no calibration |
+### Speed (RTX 3090, 24 GB)
+Confirmed by [@Arien0](https://huggingface.co/caiovicentino1/Qwopus3.5-9B-v3-PolarQuant-Q5/discussions/1):
+| Metric | Value |
+|--------|-------|
+| **Throughput** | **113 tok/s** |
+| Kernel | Marlin (gptq_marlin) |
+| VRAM | ~8 GB |
+---
+## Architecture
+| Property | Value |
+|----------|-------|
+| **Base Model** | [Jackrong/Qwopus3.5-9B-v3](https://huggingface.co/Jackrong/Qwopus3.5-9B-v3) |
+| **Architecture** | Qwen3.5 — hybrid (linear attention + full attention) |
+| **Parameters** | 9B |
+| **Layers** | 32 (24 linear attention + 8 full attention) |
+| **Hidden Size** | 4096 |
+---
+## Quantization Details
+| Property | Value |
+|----------|-------|
+| **Method** | GPTQ (calibrated) |
+| **Tool** | [GPTQModel v6.0.3](https://github.com/ModelCloud/GPTQModel) |
+| **Bits** | 4 |
+| **Group Size** | 128 |
+| **Symmetric** | Yes |
+| **desc_act** | True (activation order) |
+| **Calibration** | 512 samples from [neuralmagic/LLM_compression_calibration](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration) |
+| **Format** | GPTQ (native vLLM Marlin kernel) |
+---
+## Key Flags
 | Flag | Why |
 |------|-----|
+| `--language-model-only` | Skips vision encoder (4304 dim not Marlin-compatible) |
+| `--enforce-eager` | Recommended for stability |
+---
 ## Links
+- **Paper**: [PolarQuant — Hadamard-Rotated Post-Training Quantization](https://arxiv.org/abs/2603.29078)
+- **GitHub**: [polarengine-vllm](https://github.com/caiovicentino/polarengine-vllm)
+- **Expert Offloading**: [vllm-expert-offload](https://github.com/caiovicentino/vllm-expert-offload) — LFRU cache for consumer GPUs
+## Citation
+```bibtex
+@article{vicentino2026polarquant,
+  title={PolarQuant: Hadamard-Rotated Post-Training Quantization},
+  author={Vicentino, Caio},
+  journal={arXiv preprint arXiv:2603.29078},
+  year={2026}
+}
+```