gemma-4-A4B-98e-v3-it-GGUF
GGUF quantizations of ManniX-ITA/gemma-4-A4B-98e-v3-it — Gemma 4 26B pruned to 98 experts per layer (from 128).
Zero GPQA degradation despite dropping 30 experts per layer (23.4% of MoE capacity).
All quants made using imatrix with calibration data v5. Includes ContribDynamic (CD) quants with per-layer dynamic quantization based on expert contribution analysis.
GPQA Diamond (198 questions, Q6_K, full CoT reasoning)
| Model | Experts/Layer | GPQA Diamond (flex) | Delta |
|---|---|---|---|
| Gemma 4 26B-A4B-it (original) | 128 | 75.25% | baseline |
| 109e v3 | 109 | 71.72% | -3.53 pp |
| 98e v3 (this model) | 98 | 75.25% | +0.00 pp |
Available Quantizations
Standard Quants
| Quantization | File | Size |
|---|---|---|
| Q8_0 | gemma-4-A4B-98e-v3-it-Q8_0.gguf | 19.71 GB |
| Q6_K_L | gemma-4-A4B-98e-v3-it-Q6_K_L.gguf | 16.75 GB |
| Q6_K | gemma-4-A4B-98e-v3-it-Q6_K.gguf | 16.58 GB |
| Q5_K_L | gemma-4-A4B-98e-v3-it-Q5_K_L.gguf | 14.20 GB |
| Q5_K_M | gemma-4-A4B-98e-v3-it-Q5_K_M.gguf | 14.04 GB |
| Q5_K_S | gemma-4-A4B-98e-v3-it-Q5_K_S.gguf | 13.21 GB |
| Q4_K_L | gemma-4-A4B-98e-v3-it-Q4_K_L.gguf | 12.50 GB |
| Q4_K_M | gemma-4-A4B-98e-v3-it-Q4_K_M.gguf | 12.33 GB |
| Q4_1 | gemma-4-A4B-98e-v3-it-Q4_1.gguf | 11.75 GB |
| Q4_K_S | gemma-4-A4B-98e-v3-it-Q4_K_S.gguf | 11.37 GB |
| Q4_0 | gemma-4-A4B-98e-v3-it-Q4_0.gguf | 10.63 GB |
| IQ4_NL | gemma-4-A4B-98e-v3-it-IQ4_NL.gguf | 10.63 GB |
| IQ4_XS | gemma-4-A4B-98e-v3-it-IQ4_XS.gguf | 10.25 GB |
| Q3_K_XL | gemma-4-A4B-98e-v3-it-Q3_K_XL.gguf | 9.96 GB |
| Q3_K_L | gemma-4-A4B-98e-v3-it-Q3_K_L.gguf | 10.19 GB |
| Q3_K_M | gemma-4-A4B-98e-v3-it-Q3_K_M.gguf | 9.79 GB |
| IQ3_M | gemma-4-A4B-98e-v3-it-IQ3_M.gguf | 9.15 GB |
| Q3_K_S | gemma-4-A4B-98e-v3-it-Q3_K_S.gguf | 9.01 GB |
| IQ3_XS | gemma-4-A4B-98e-v3-it-IQ3_XS.gguf | 8.58 GB |
| IQ3_XXS | gemma-4-A4B-98e-v3-it-IQ3_XXS.gguf | 8.33 GB |
| IQ2_M | gemma-4-A4B-98e-v3-it-IQ2_M.gguf | 7.66 GB |
| IQ2_S | gemma-4-A4B-98e-v3-it-IQ2_S.gguf | 7.29 GB |
| IQ2_XS | gemma-4-A4B-98e-v3-it-IQ2_XS.gguf | 7.24 GB |
| IQ2_XXS | gemma-4-A4B-98e-v3-it-IQ2_XXS.gguf | 6.86 GB |
ContribDynamic (CD) Quants
Per-layer dynamic quantization based on expert contribution analysis. Important layers get higher precision.
| Quantization | File | Size |
|---|---|---|
| CD-Q6_K | gemma-4-A4B-98e-v3-it-CD-Q6_K.gguf | 14.39 GB |
| CD-Q5_K_M | gemma-4-A4B-98e-v3-it-CD-Q5_K_M.gguf | 12.32 GB |
| CD-Q4_K_M | gemma-4-A4B-98e-v3-it-CD-Q4_K_M.gguf | 10.12 GB |
| CD-Q3_K_M | gemma-4-A4B-98e-v3-it-CD-Q3_K_M.gguf | 9.48 GB |
| CD-Q2_K | gemma-4-A4B-98e-v3-it-CD-Q2_K.gguf | 7.97 GB |
Skipped Quantizations (failed sanity check)
The following quantizations were attempted but failed the sanity check (3 capital city questions answered incorrectly or incoherently). These quants are intentionally not published:
- Q2_K_L, Q2_K
Recommended Usage
llama.cpp
llama-server -m gemma-4-A4B-98e-v3-it-Q4_K_M.gguf -c 32768 -ngl 99 --no-warmup \
--reasoning-format deepseek --reasoning-budget 8192
Ollama
For chat-only usage:
ollama run hf.co/ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF:CD-Q6_K
This repo ships sidecar files that Ollama reads on pull:
template— Gemma 4 NATIVE tool-call format:<|tool_call>call:NAME{...}<tool_call|>with<|"|>...<|"|>string delimiters.params—repeat_penalty 1.15(stops duplicate-block loops), stops on<turn|>/<|tool_response>,temperature 0.6,top_p 0.95,num_ctx 131072.
For tool-use / function-calling via the Ollama SDK or OpenAI-compat /v1/chat/completions, Ollama's built-in Gemma 4 RENDERER / PARSER directives are required — these aren't applied automatically to HF pulls. Wrap the pulled model once:
ollama pull hf.co/ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF:CD-Q6_K
cat <<'EOF' | ollama create gemma4-98e-it -f -
FROM hf.co/ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF:CD-Q6_K
RENDERER gemma4
PARSER gemma4
EOF
ollama run gemma4-98e-it
After wrapping, ollama show gemma4-98e-it reports capabilities [completion, tools, thinking] and tool calls are parsed into the structured tool_calls response field instead of appearing as raw tokens in content.
Sampler note: without repeat_penalty >= 1.1, Gemma 4 tool-use occasionally loops and emits hundreds of duplicate <tool_call> blocks per response. The params file sets 1.15 — keep it.
Method
Expert dropping via per-layer contribution analysis using teacher-force importance mapping (fp32 accumulation). Script: expert_drop.py in the model repo.
License
Gemma license, inherited from the base model.
- Downloads last month
- 14,008
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF
Base model
google/gemma-4-26B-A4B-it