Gemma-4-26B-A4B-it-NVFP4

NVFP4 quantization of google/gemma-4-26b-a4b-it — the Mixture-of-Experts variant of Gemma 4 with 25.2B total parameters and ~3.8B active per token. Quantized using NVIDIA TensorRT Model Optimizer (modelopt) on DGX Spark hardware.

Supports text, image, and multimodal inference with a 256K token context window.

Base Model vs. NVFP4 Comparison

Original (BF16) NVFP4 (this)
Size on disk ~49 GB ~16 GB
Compression 1x 3.0x
Total parameters 25.2B 25.2B
Active parameters/token ~3.8B ~3.8B
Architecture MoE: 128 experts, top-8 routing Same
Context window 262,144 tokens 262,144 tokens
Modalities Text + Image Text + Image
Min GPU Memory ~50 GB+ ~18 GB
DGX Spark (128 GB) Runs, memory-constrained Runs comfortably

Quantization Details

Method NVFP4 (4-bit floating point weights)
Tool NVIDIA modelopt
Group Size 16
Format SafeTensors (3 shards)

Quality-Preserving Exclusions

The following layers are kept at full bfloat16 precision to preserve output quality:

Excluded Layer Reason
MoE routers (all 30 layers) Routing decisions are critical for MoE quality — quantizing routers degrades expert selection
Vision tower & embeddings Image understanding quality degrades significantly under quantization
LM head Final output layer directly impacts token probability accuracy

This is standard best practice for MoE quantization: compress the bulk of the transformer (attention + FFN layers) where FP4 has minimal quality impact, but keep sensitive routing, vision, and output layers at full precision. The tradeoff is a slightly larger model (~16 GB vs theoretical ~12 GB for full FP4) but significantly better quality retention.

MoE Quantization Challenge

Gemma 4 MoE stores expert weights as fused 3D tensors (shape [128, dim, dim]) rather than individual linear modules. NVIDIA modelopt only quantizes linear layers — the fused expert parameters (which are ~91% of the model) require a custom plugin to unfuse into individual linear layers before quantization.

Hardware

Quantized and serving on NVIDIA DGX Spark:

  • GB10 Grace Blackwell processor
  • 128 GB unified CPU+GPU memory
  • CUDA 13.1
  • SM 12.1

Model loads at ~16 GB, leaving substantial headroom for long-context inference with KV cache.

Serving with vLLM

Quick Start

vllm serve CyberFitz/gemma-4-26B-A4B-it-NVFP4 \
  --quantization modelopt \
  --max-model-len 262144 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --moe-backend marlin \
  --trust-remote-code

Key Flags

Flag Why
--quantization modelopt NVFP4 checkpoint format
--moe-backend marlin Marlin kernel for efficient MoE expert layers
--kv-cache-dtype fp8 Saves memory for longer contexts
--trust-remote-code Required for Gemma 4 architecture

Docker (DGX Spark)

docker run -d \
  --name gemma4-nvfp4 \
  --gpus all --ipc=host --network host \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v /path/to/model:/model \
  <your-vllm-image> \
  vllm serve /model \
    --quantization modelopt \
    --max-model-len 262144 \
    --kv-cache-dtype fp8 \
    --moe-backend marlin \
    --trust-remote-code

Testing

# Text generation
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-26B-A4B-it-NVFP4",
    "messages": [{"role": "user", "content": "Explain mixture of experts in one paragraph."}],
    "max_tokens": 200
  }'

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "CyberFitz/gemma-4-26B-A4B-it-NVFP4",
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained("CyberFitz/gemma-4-26B-A4B-it-NVFP4")

messages = [{"role": "user", "content": "Hello, how are you?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Benchmarks

Coming soon — GSM8K, IFEval, and throughput benchmarks comparing BF16 vs NVFP4 on DGX Spark.

License

This model inherits the Gemma license from the base model.

Credits

Quantized by CyberFitz on NVIDIA DGX Spark hardware.

Downloads last month
642
Safetensors
Model size
15B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support