gemma-4-e4b-it-OptiQ-4bit

Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon

This is a mixed-precision quantized version of google/gemma-4-e4b-it in MLX format. OptiQ measures each layer's sensitivity via KL divergence and assigns optimal per-layer bit-widths, preserving model quality where it matters most.

Quantization Details

Property	Value
Target BPW	4.5
Achieved BPW	4.50
Candidate bits	4, 8
Layers at 4-bit	444
Layers at 8-bit	149
Total quantized layers	593
Group size	64

Benchmark Results

GSM8K (200 samples, 3-shot chain-of-thought):

Model	GSM8K Accuracy
OptiQ mixed (4.5 BPW)	55.5%
Uniform 4-bit	23.5%

OptiQ more than doubles the accuracy of uniform 4-bit quantization (+32.0 percentage points, 2.4x improvement).

Usage

This model works with standard mlx-lm:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-4-e4b-it-OptiQ-4bit")

prompt = "Explain quantum computing in simple terms:"
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
print(response)

TurboQuant KV Cache (Optional)

For better long-context performance, install mlx-optiq:

pip install mlx-optiq

Article

For more details on the methodology and results, see: Not All Layers Are Equal: Mixed-Precision Quantization for Weights and KV Cache on Apple Silicon

Credits

Quantization method: mlx-optiq by Thin Signal
Base model: google/gemma-4-e4b-it by Google
Runtime: MLX by Apple

Downloads last month: 816

Safetensors

Model size

8B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support