GLM-4.5-Air-Derestricted ExLlamaV3 Quantizations
ExLlamaV3 quantizations of GLM-4.5-Air-Derestricted with tensor-level (L3) optimization and boosted attention layers.
Using the provided measurement.json file and base quants, additional optimized quantizations can be made in seconds at any reasonable bpw. All work done with ExLlamaV3 v0.0.18.
Optimized
VRAM-targeted quants using exl3's measure.py โ optimize.py โ recompile.py pipeline with attention boost.
| Branch | Size | bpw | Target |
|---|---|---|---|
| 3.15bpw-h6-opt | 41 GB | 3.15 | 48GB @ 128k |
| 4.37bpw-h6-opt | 56 GB | 4.37 | 64GB @ 128k |
| 5.00bpw-h6-opt | 64 GB | 5.00 | 72GB @ 128k |
| 6.33bpw-h6-opt | 80 GB | 6.33 | 96GB @ 128k |
Note: The 6.33bpw quant hit the optimization ceiling - targeting 6.94bpw produced 6.18bpw pre-boost output (6.33 after attention boost), indicating all beneficial tensor swaps exhausted.