Gemma-4-26B-A4B-it-NVFP4

NVFP4 quantization of google/gemma-4-26b-a4b-it — the Mixture-of-Experts variant of Gemma 4 with 25.2B total parameters and ~3.8B active per token. Quantized using NVIDIA TensorRT Model Optimizer (modelopt) on DGX Spark hardware.

Supports text, image, and multimodal inference with a 256K token context window.

Base Model vs. NVFP4 Comparison

	Original (BF16)	NVFP4 (this)
Size on disk	~49 GB	~16 GB
Compression	1x	3.0x
Total parameters	25.2B	25.2B
Active parameters/token	~3.8B	~3.8B
Architecture	MoE: 128 experts, top-8 routing	Same
Context window	262,144 tokens	262,144 tokens
Modalities	Text + Image	Text + Image
Min GPU Memory	~50 GB+	~18 GB
DGX Spark (128 GB)	Runs, memory-constrained	Runs comfortably

Quantization Details


Method	NVFP4 (4-bit floating point weights)
Tool	NVIDIA modelopt
Group Size	16
Format	SafeTensors (3 shards)

Quality-Preserving Exclusions

The following layers are kept at full bfloat16 precision to preserve output quality:

Excluded Layer	Reason
MoE routers (all 30 layers)	Routing decisions are critical for MoE quality — quantizing routers degrades expert selection
Vision tower & embeddings	Image understanding quality degrades significantly under quantization
LM head	Final output layer directly impacts token probability accuracy

This is standard best practice for MoE quantization: compress the bulk of the transformer (attention + FFN layers) where FP4 has minimal quality impact, but keep sensitive routing, vision, and output layers at full precision. The tradeoff is a slightly larger model (~16 GB vs theoretical ~12 GB for full FP4) but significantly better quality retention.

MoE Quantization Challenge

Gemma 4 MoE stores expert weights as fused 3D tensors (shape [128, dim, dim]) rather than individual linear modules. NVIDIA modelopt only quantizes linear layers — the fused expert parameters (which are ~91% of the model) require a custom plugin to unfuse into individual linear layers before quantization.

Hardware

Quantized and serving on NVIDIA DGX Spark:

GB10 Grace Blackwell processor
128 GB unified CPU+GPU memory
CUDA 13.1
SM 12.1

Model loads at ~16 GB, leaving substantial headroom for long-context inference with KV cache.

Serving with vLLM

Quick Start

vllm serve CyberFitz/gemma-4-26B-A4B-it-NVFP4 \
  --quantization modelopt \
  --max-model-len 262144 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --moe-backend marlin \
  --trust-remote-code

Key Flags

Flag	Why
`--quantization modelopt`	NVFP4 checkpoint format
`--moe-backend marlin`	Marlin kernel for efficient MoE expert layers
`--kv-cache-dtype fp8`	Saves memory for longer contexts
`--trust-remote-code`	Required for Gemma 4 architecture

Docker (DGX Spark)

docker run -d \
  --name gemma4-nvfp4 \
  --gpus all --ipc=host --network host \
  -e VLLM_NVFP4_GEMM_BACKEND=marlin \
  -v /path/to/model:/model \
  <your-vllm-image> \
  vllm serve /model \
    --quantization modelopt \
    --max-model-len 262144 \
    --kv-cache-dtype fp8 \
    --moe-backend marlin \
    --trust-remote-code

Testing

# Text generation
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-26B-A4B-it-NVFP4",
    "messages": [{"role": "user", "content": "Explain mixture of experts in one paragraph."}],
    "max_tokens": 200
  }'

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "CyberFitz/gemma-4-26B-A4B-it-NVFP4",
    device_map="auto",
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained("CyberFitz/gemma-4-26B-A4B-it-NVFP4")

messages = [{"role": "user", "content": "Hello, how are you?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Benchmarks

Coming soon — GSM8K, IFEval, and throughput benchmarks comparing BF16 vs NVFP4 on DGX Spark.

License

This model inherits the Gemma license from the base model.

Credits

Quantized by CyberFitz on NVIDIA DGX Spark hardware.

Downloads last month: 642

Safetensors

Model size

15B params

Tensor type

BF16

F8_E4M3