GLM-4.5-Air-Derestricted ExLlamaV3 Quantizations

ExLlamaV3 quantizations of GLM-4.5-Air-Derestricted with tensor-level (L3) optimization and boosted attention layers.

Using the provided measurement.json file and base quants, additional optimized quantizations can be made in seconds at any reasonable bpw. All work done with ExLlamaV3 v0.0.18.

Optimized

VRAM-targeted quants using exl3's measure.py โ†’ optimize.py โ†’ recompile.py pipeline with attention boost.

Branch Size bpw Target
3.15bpw-h6-opt 41 GB 3.15 48GB @ 128k
4.37bpw-h6-opt 56 GB 4.37 64GB @ 128k
5.00bpw-h6-opt 64 GB 5.00 72GB @ 128k
6.33bpw-h6-opt 80 GB 6.33 96GB @ 128k

Note: The 6.33bpw quant hit the optimization ceiling - targeting 6.94bpw produced 6.18bpw pre-boost output (6.33 after attention boost), indicating all beneficial tensor swaps exhausted.

Base

Branch Size bpw
2.0bpw-h6 27 GB 2.0
3.0bpw-h6 39 GB 3.0
4.0bpw-h6 52 GB 4.0
5.0bpw-h6 64 GB 5.0
6.0bpw-h6 76 GB 6.0
7.0bpw-h6 88 GB 7.0
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for amanwalksdownthestreet/GLM-4.5-Air-Derestricted-exl3

Quantized
(23)
this model