Gemma-4-26B-A4B-it-NVFP4
NVFP4 quantization of google/gemma-4-26b-a4b-it — the Mixture-of-Experts variant of Gemma 4 with 25.2B total parameters and ~3.8B active per token. Quantized using NVIDIA TensorRT Model Optimizer (modelopt) on DGX Spark hardware.
Supports text, image, and multimodal inference with a 256K token context window.
Base Model vs. NVFP4 Comparison
| Original (BF16) | NVFP4 (this) | |
|---|---|---|
| Size on disk | ~49 GB | ~16 GB |
| Compression | 1x | 3.0x |
| Total parameters | 25.2B | 25.2B |
| Active parameters/token | ~3.8B | ~3.8B |
| Architecture | MoE: 128 experts, top-8 routing | Same |
| Context window | 262,144 tokens | 262,144 tokens |
| Modalities | Text + Image | Text + Image |
| Min GPU Memory | ~50 GB+ | ~18 GB |
| DGX Spark (128 GB) | Runs, memory-constrained | Runs comfortably |
Quantization Details
| Method | NVFP4 (4-bit floating point weights) |
| Tool | NVIDIA modelopt |
| Group Size | 16 |
| Format | SafeTensors (3 shards) |
Quality-Preserving Exclusions
The following layers are kept at full bfloat16 precision to preserve output quality:
| Excluded Layer | Reason |
|---|---|
| MoE routers (all 30 layers) | Routing decisions are critical for MoE quality — quantizing routers degrades expert selection |
| Vision tower & embeddings | Image understanding quality degrades significantly under quantization |
| LM head | Final output layer directly impacts token probability accuracy |
This is standard best practice for MoE quantization: compress the bulk of the transformer (attention + FFN layers) where FP4 has minimal quality impact, but keep sensitive routing, vision, and output layers at full precision. The tradeoff is a slightly larger model (~16 GB vs theoretical ~12 GB for full FP4) but significantly better quality retention.
MoE Quantization Challenge
Gemma 4 MoE stores expert weights as fused 3D tensors (shape [128, dim, dim]) rather than individual linear modules. NVIDIA modelopt only quantizes linear layers — the fused expert parameters (which are ~91% of the model) require a custom plugin to unfuse into individual linear layers before quantization.
Hardware
Quantized and serving on NVIDIA DGX Spark:
- GB10 Grace Blackwell processor
- 128 GB unified CPU+GPU memory
- CUDA 13.1
- SM 12.1
Model loads at ~16 GB, leaving substantial headroom for long-context inference with KV cache.
Serving with vLLM
Quick Start
vllm serve CyberFitz/gemma-4-26B-A4B-it-NVFP4 \
--quantization modelopt \
--max-model-len 262144 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.85 \
--moe-backend marlin \
--trust-remote-code
Key Flags
| Flag | Why |
|---|---|
--quantization modelopt |
NVFP4 checkpoint format |
--moe-backend marlin |
Marlin kernel for efficient MoE expert layers |
--kv-cache-dtype fp8 |
Saves memory for longer contexts |
--trust-remote-code |
Required for Gemma 4 architecture |
Docker (DGX Spark)
docker run -d \
--name gemma4-nvfp4 \
--gpus all --ipc=host --network host \
-e VLLM_NVFP4_GEMM_BACKEND=marlin \
-v /path/to/model:/model \
<your-vllm-image> \
vllm serve /model \
--quantization modelopt \
--max-model-len 262144 \
--kv-cache-dtype fp8 \
--moe-backend marlin \
--trust-remote-code
Testing
# Text generation
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-26B-A4B-it-NVFP4",
"messages": [{"role": "user", "content": "Explain mixture of experts in one paragraph."}],
"max_tokens": 200
}'
Usage with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"CyberFitz/gemma-4-26B-A4B-it-NVFP4",
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained("CyberFitz/gemma-4-26B-A4B-it-NVFP4")
messages = [{"role": "user", "content": "Hello, how are you?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Benchmarks
Coming soon — GSM8K, IFEval, and throughput benchmarks comparing BF16 vs NVFP4 on DGX Spark.
License
This model inherits the Gemma license from the base model.
Credits
Quantized by CyberFitz on NVIDIA DGX Spark hardware.
- Downloads last month
- 642