GLM-4.5-Air-Derestricted ExLlamaV3 Quantizations

ExLlamaV3 quantizations of GLM-4.5-Air-Derestricted with tensor-level (L3) optimization and boosted attention layers.

Using the provided measurement.json file and base quants, additional optimized quantizations can be made in seconds at any reasonable bpw. All work done with ExLlamaV3 v0.0.18.

Optimized

VRAM-targeted quants using exl3's measure.py → optimize.py → recompile.py pipeline with attention boost.

Branch	Size	bpw	Target
3.15bpw-h6-opt	41 GB	3.15	48GB @ 128k
4.37bpw-h6-opt	56 GB	4.37	64GB @ 128k
5.00bpw-h6-opt	64 GB	5.00	72GB @ 128k
6.33bpw-h6-opt	80 GB	6.33	96GB @ 128k

Note: The 6.33bpw quant hit the optimization ceiling - targeting 6.94bpw produced 6.18bpw pre-boost output (6.33 after attention boost), indicating all beneficial tensor swaps exhausted.

Base

Branch	Size	bpw
2.0bpw-h6	27 GB	2.0
3.0bpw-h6	39 GB	3.0
4.0bpw-h6	52 GB	4.0
5.0bpw-h6	64 GB	5.0
6.0bpw-h6	76 GB	6.0
7.0bpw-h6	88 GB	7.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for amanwalksdownthestreet/GLM-4.5-Air-Derestricted-exl3

Base model

zai-org/GLM-4.5-Air

Finetuned

ArliAI/GLM-4.5-Air-Derestricted

Quantized

(23)

this model