gemma-4-A4B-98e-v3-it-GGUF

GGUF quantizations of ManniX-ITA/gemma-4-A4B-98e-v3-it — Gemma 4 26B pruned to 98 experts per layer (from 128).

Zero GPQA degradation despite dropping 30 experts per layer (23.4% of MoE capacity).

All quants made using imatrix with calibration data v5. Includes ContribDynamic (CD) quants with per-layer dynamic quantization based on expert contribution analysis.

GPQA Diamond (198 questions, Q6_K, full CoT reasoning)

Model	Experts/Layer	GPQA Diamond (flex)	Delta
Gemma 4 26B-A4B-it (original)	128	75.25%	baseline
109e v3	109	71.72%	-3.53 pp
98e v3 (this model)	98	75.25%	+0.00 pp

Available Quantizations

Standard Quants

Quantization	File	Size
Q8_0	gemma-4-A4B-98e-v3-it-Q8_0.gguf	19.71 GB
Q6_K_L	gemma-4-A4B-98e-v3-it-Q6_K_L.gguf	16.75 GB
Q6_K	gemma-4-A4B-98e-v3-it-Q6_K.gguf	16.58 GB
Q5_K_L	gemma-4-A4B-98e-v3-it-Q5_K_L.gguf	14.20 GB
Q5_K_M	gemma-4-A4B-98e-v3-it-Q5_K_M.gguf	14.04 GB
Q5_K_S	gemma-4-A4B-98e-v3-it-Q5_K_S.gguf	13.21 GB
Q4_K_L	gemma-4-A4B-98e-v3-it-Q4_K_L.gguf	12.50 GB
Q4_K_M	gemma-4-A4B-98e-v3-it-Q4_K_M.gguf	12.33 GB
Q4_1	gemma-4-A4B-98e-v3-it-Q4_1.gguf	11.75 GB
Q4_K_S	gemma-4-A4B-98e-v3-it-Q4_K_S.gguf	11.37 GB
Q4_0	gemma-4-A4B-98e-v3-it-Q4_0.gguf	10.63 GB
IQ4_NL	gemma-4-A4B-98e-v3-it-IQ4_NL.gguf	10.63 GB
IQ4_XS	gemma-4-A4B-98e-v3-it-IQ4_XS.gguf	10.25 GB
Q3_K_XL	gemma-4-A4B-98e-v3-it-Q3_K_XL.gguf	9.96 GB
Q3_K_L	gemma-4-A4B-98e-v3-it-Q3_K_L.gguf	10.19 GB
Q3_K_M	gemma-4-A4B-98e-v3-it-Q3_K_M.gguf	9.79 GB
IQ3_M	gemma-4-A4B-98e-v3-it-IQ3_M.gguf	9.15 GB
Q3_K_S	gemma-4-A4B-98e-v3-it-Q3_K_S.gguf	9.01 GB
IQ3_XS	gemma-4-A4B-98e-v3-it-IQ3_XS.gguf	8.58 GB
IQ3_XXS	gemma-4-A4B-98e-v3-it-IQ3_XXS.gguf	8.33 GB
IQ2_M	gemma-4-A4B-98e-v3-it-IQ2_M.gguf	7.66 GB
IQ2_S	gemma-4-A4B-98e-v3-it-IQ2_S.gguf	7.29 GB
IQ2_XS	gemma-4-A4B-98e-v3-it-IQ2_XS.gguf	7.24 GB
IQ2_XXS	gemma-4-A4B-98e-v3-it-IQ2_XXS.gguf	6.86 GB

ContribDynamic (CD) Quants

Per-layer dynamic quantization based on expert contribution analysis. Important layers get higher precision.

Quantization	File	Size
CD-Q6_K	gemma-4-A4B-98e-v3-it-CD-Q6_K.gguf	14.39 GB
CD-Q5_K_M	gemma-4-A4B-98e-v3-it-CD-Q5_K_M.gguf	12.32 GB
CD-Q4_K_M	gemma-4-A4B-98e-v3-it-CD-Q4_K_M.gguf	10.12 GB
CD-Q3_K_M	gemma-4-A4B-98e-v3-it-CD-Q3_K_M.gguf	9.48 GB
CD-Q2_K	gemma-4-A4B-98e-v3-it-CD-Q2_K.gguf	7.97 GB

Skipped Quantizations (failed sanity check)

The following quantizations were attempted but failed the sanity check (3 capital city questions answered incorrectly or incoherently). These quants are intentionally not published:

Q2_K_L, Q2_K

Recommended Usage

llama.cpp

llama-server -m gemma-4-A4B-98e-v3-it-Q4_K_M.gguf -c 32768 -ngl 99 --no-warmup \
    --reasoning-format deepseek --reasoning-budget 8192

Ollama

For chat-only usage:

ollama run hf.co/ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF:CD-Q6_K

This repo ships sidecar files that Ollama reads on pull:

template — Gemma 4 NATIVE tool-call format: <|tool_call>call:NAME{...}<tool_call|> with <|"|>...<|"|> string delimiters.
params — repeat_penalty 1.15 (stops duplicate-block loops), stops on <turn|> / <|tool_response>, temperature 0.6, top_p 0.95, num_ctx 131072.

For tool-use / function-calling via the Ollama SDK or OpenAI-compat /v1/chat/completions, Ollama's built-in Gemma 4 RENDERER / PARSER directives are required — these aren't applied automatically to HF pulls. Wrap the pulled model once:

ollama pull hf.co/ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF:CD-Q6_K
cat <<'EOF' | ollama create gemma4-98e-it -f -
FROM hf.co/ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF:CD-Q6_K
RENDERER gemma4
PARSER gemma4
EOF
ollama run gemma4-98e-it

After wrapping, ollama show gemma4-98e-it reports capabilities [completion, tools, thinking] and tool calls are parsed into the structured tool_calls response field instead of appearing as raw tokens in content.

Sampler note: without repeat_penalty >= 1.1, Gemma 4 tool-use occasionally loops and emits hundreds of duplicate <tool_call> blocks per response. The params file sets 1.15 — keep it.

Method

Expert dropping via per-layer contribution analysis using teacher-force importance mapping (fp32 accumulation). Script: expert_drop.py in the model repo.

License

Gemma license, inherited from the base model.

Downloads last month: 14,008

GGUF

Model size

20B params

Architecture

gemma4

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF

Base model

google/gemma-4-26B-A4B-it

Finetuned

ManniX-ITA/gemma-4-A4B-98e-v3-it

Quantized

(1)

this model