gemma-4-e4b-it-OptiQ-4bit
Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon
This is a mixed-precision quantized version of google/gemma-4-e4b-it in MLX format. OptiQ measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths, preserving model quality where it matters most.
Quantization Details
| Property | Value |
|---|---|
| Target BPW | 4.5 |
| Achieved BPW | 4.50 |
| Candidate bits | 4, 8 |
| Layers at 4-bit | 444 |
| Layers at 8-bit | 149 |
| Total quantized layers | 593 |
| Group size | 64 |
Benchmark Results
GSM8K (200 samples, 3-shot chain-of-thought):
| Model | GSM8K Accuracy |
|---|---|
| OptiQ mixed (4.5 BPW) | 55.5% |
| Uniform 4-bit | 23.5% |
OptiQ more than doubles the accuracy of uniform 4-bit quantization (+32.0 percentage points, 2.4x improvement).
Usage
This model works with standard mlx-lm:
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/gemma-4-e4b-it-OptiQ-4bit")
prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)
TurboQuant KV Cache (Optional)
For better long-context performance, install mlx-optiq:
pip install mlx-optiq
Article
For more details on the methodology and results, see: Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon
Credits
- Quantization method: mlx-optiq by Thin Signal
- Base model: google/gemma-4-e4b-it by Google
- Runtime: MLX by Apple
- Downloads last month
- 816
Model size
8B params
Tensor type
BF16
·
U32 ·
Hardware compatibility
Log In to add your hardware
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support