gemma-4-e4b-it-OptiQ-4bit

Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon

This is a mixed-precision quantized version of google/gemma-4-e4b-it in MLX format. OptiQ measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths, preserving model quality where it matters most.

Quantization Details

Property Value
Target BPW 4.5
Achieved BPW 4.50
Candidate bits 4, 8
Layers at 4-bit 444
Layers at 8-bit 149
Total quantized layers 593
Group size 64

Benchmark Results

GSM8K (200 samples, 3-shot chain-of-thought):

Model GSM8K Accuracy
OptiQ mixed (4.5 BPW) 55.5%
Uniform 4-bit 23.5%

OptiQ more than doubles the accuracy of uniform 4-bit quantization (+32.0 percentage points, 2.4x improvement).

Usage

This model works with standard mlx-lm:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-4-e4b-it-OptiQ-4bit")

prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)

TurboQuant KV Cache (Optional)

For better long-context performance, install mlx-optiq:

pip install mlx-optiq

Article

For more details on the methodology and results, see: Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Credits

Downloads last month
816
Safetensors
Model size
8B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support