gemma-4-A4B-98e-v3-it-GGUF

GGUF quantizations of ManniX-ITA/gemma-4-A4B-98e-v3-it — Gemma 4 26B pruned to 98 experts per layer (from 128).

Zero GPQA degradation despite dropping 30 experts per layer (23.4% of MoE capacity).

All quants made using imatrix with calibration data v5. Includes ContribDynamic (CD) quants with per-layer dynamic quantization based on expert contribution analysis.

GPQA Diamond (198 questions, Q6_K, full CoT reasoning)

Model Experts/Layer GPQA Diamond (flex) Delta
Gemma 4 26B-A4B-it (original) 128 75.25% baseline
109e v3 109 71.72% -3.53 pp
98e v3 (this model) 98 75.25% +0.00 pp

Available Quantizations

Standard Quants

Quantization File Size
Q8_0 gemma-4-A4B-98e-v3-it-Q8_0.gguf 19.71 GB
Q6_K_L gemma-4-A4B-98e-v3-it-Q6_K_L.gguf 16.75 GB
Q6_K gemma-4-A4B-98e-v3-it-Q6_K.gguf 16.58 GB
Q5_K_L gemma-4-A4B-98e-v3-it-Q5_K_L.gguf 14.20 GB
Q5_K_M gemma-4-A4B-98e-v3-it-Q5_K_M.gguf 14.04 GB
Q5_K_S gemma-4-A4B-98e-v3-it-Q5_K_S.gguf 13.21 GB
Q4_K_L gemma-4-A4B-98e-v3-it-Q4_K_L.gguf 12.50 GB
Q4_K_M gemma-4-A4B-98e-v3-it-Q4_K_M.gguf 12.33 GB
Q4_1 gemma-4-A4B-98e-v3-it-Q4_1.gguf 11.75 GB
Q4_K_S gemma-4-A4B-98e-v3-it-Q4_K_S.gguf 11.37 GB
Q4_0 gemma-4-A4B-98e-v3-it-Q4_0.gguf 10.63 GB
IQ4_NL gemma-4-A4B-98e-v3-it-IQ4_NL.gguf 10.63 GB
IQ4_XS gemma-4-A4B-98e-v3-it-IQ4_XS.gguf 10.25 GB
Q3_K_XL gemma-4-A4B-98e-v3-it-Q3_K_XL.gguf 9.96 GB
Q3_K_L gemma-4-A4B-98e-v3-it-Q3_K_L.gguf 10.19 GB
Q3_K_M gemma-4-A4B-98e-v3-it-Q3_K_M.gguf 9.79 GB
IQ3_M gemma-4-A4B-98e-v3-it-IQ3_M.gguf 9.15 GB
Q3_K_S gemma-4-A4B-98e-v3-it-Q3_K_S.gguf 9.01 GB
IQ3_XS gemma-4-A4B-98e-v3-it-IQ3_XS.gguf 8.58 GB
IQ3_XXS gemma-4-A4B-98e-v3-it-IQ3_XXS.gguf 8.33 GB
IQ2_M gemma-4-A4B-98e-v3-it-IQ2_M.gguf 7.66 GB
IQ2_S gemma-4-A4B-98e-v3-it-IQ2_S.gguf 7.29 GB
IQ2_XS gemma-4-A4B-98e-v3-it-IQ2_XS.gguf 7.24 GB
IQ2_XXS gemma-4-A4B-98e-v3-it-IQ2_XXS.gguf 6.86 GB

ContribDynamic (CD) Quants

Per-layer dynamic quantization based on expert contribution analysis. Important layers get higher precision.

Quantization File Size
CD-Q6_K gemma-4-A4B-98e-v3-it-CD-Q6_K.gguf 14.39 GB
CD-Q5_K_M gemma-4-A4B-98e-v3-it-CD-Q5_K_M.gguf 12.32 GB
CD-Q4_K_M gemma-4-A4B-98e-v3-it-CD-Q4_K_M.gguf 10.12 GB
CD-Q3_K_M gemma-4-A4B-98e-v3-it-CD-Q3_K_M.gguf 9.48 GB
CD-Q2_K gemma-4-A4B-98e-v3-it-CD-Q2_K.gguf 7.97 GB

Skipped Quantizations (failed sanity check)

The following quantizations were attempted but failed the sanity check (3 capital city questions answered incorrectly or incoherently). These quants are intentionally not published:

  • Q2_K_L, Q2_K

Recommended Usage

llama.cpp

llama-server -m gemma-4-A4B-98e-v3-it-Q4_K_M.gguf -c 32768 -ngl 99 --no-warmup \
    --reasoning-format deepseek --reasoning-budget 8192

Ollama

For chat-only usage:

ollama run hf.co/ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF:CD-Q6_K

This repo ships sidecar files that Ollama reads on pull:

  • template — Gemma 4 NATIVE tool-call format: <|tool_call>call:NAME{...}<tool_call|> with <|"|>...<|"|> string delimiters.
  • paramsrepeat_penalty 1.15 (stops duplicate-block loops), stops on <turn|> / <|tool_response>, temperature 0.6, top_p 0.95, num_ctx 131072.

For tool-use / function-calling via the Ollama SDK or OpenAI-compat /v1/chat/completions, Ollama's built-in Gemma 4 RENDERER / PARSER directives are required — these aren't applied automatically to HF pulls. Wrap the pulled model once:

ollama pull hf.co/ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF:CD-Q6_K
cat <<'EOF' | ollama create gemma4-98e-it -f -
FROM hf.co/ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF:CD-Q6_K
RENDERER gemma4
PARSER gemma4
EOF
ollama run gemma4-98e-it

After wrapping, ollama show gemma4-98e-it reports capabilities [completion, tools, thinking] and tool calls are parsed into the structured tool_calls response field instead of appearing as raw tokens in content.

Sampler note: without repeat_penalty >= 1.1, Gemma 4 tool-use occasionally loops and emits hundreds of duplicate <tool_call> blocks per response. The params file sets 1.15 — keep it.

Method

Expert dropping via per-layer contribution analysis using teacher-force importance mapping (fp32 accumulation). Script: expert_drop.py in the model repo.

License

Gemma license, inherited from the base model.

Downloads last month
14,008
GGUF
Model size
20B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF

Quantized
(1)
this model