⚡ Gemma 4 31B IT NVFP4 Turbo

A repackaged nvidia/Gemma-4-31B-IT-NVFP4 that is 68% smaller in GPU memory and ~2.5× faster than the base model, while retaining nearly identical quality (1-3% loss). Fits on a single RTX 5090 (🎉).

It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with ≥20 GB VRAM) for ~2× higher concurrent throughput than other quants like prithivMLmods/gemma-4-31B-it-NVFP4 or cyankiwi/gemma-4-31B-it-AWQ-4bit.

This variant is text-only, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR.

Benchmark

Benchmark chart

RTX PRO 6000, vllm bench @ 1K input / 200 output tokens. See bench.sh.

Note: We also ran ⚡Turbo benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.

Base model NVIDIA quant ⚡ Turbo (this model)
GPU memory 58.9 GiB 31 GiB 18.5 GiB (-68% base, -40% nvidia)
GPQA Diamond 75.71% 75.46% 72.73% (-2.98% base, -2.73% nvidia)
MMLU Pro 85.25% 84.94% 83.93% (-1.32% base, -1.01% nvidia)
Prefill 6352 tok/s 11069 tok/s 15359 tok/s (+142% base, +39% nvidia)
Decode (single) 24.1 tok/s 39.2 tok/s 51 tok/s (+112% base, +30% nvidia)
Decode (batched) 494 tok/s 913 tok/s 1244 tok/s (+152% base, +36% nvidia)
Concurrency 2.47 tok/s 4.56 req/s 6.22 req/s (+152% base, +36% nvidia)

Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:

prithivMLmods NVFP4 cyankiwi AWQ ⚡ Turbo (this model)
GPU memory 19.6 GiB 19.6 GiB 18.5 GiB
Prefill 6647 tok/s 6626 tok/s 15359 tok/s
Decode (single) 64.3 tok/s 64.4 tok/s 51 tok/s
Decode (batched) 757 tok/s 757 tok/s 1244 tok/s
Concurrency 3.79 req/s 3.78 req/s 6.22 req/s

Usage

Requirements:

  • A Blackwell GPU (see Compatibility)
  • transformers >= 5.5.0
  • vllm >= 0.19 with CUDA 13.0

    Note: pip install vllm installs CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below.

Docker (recommended)

We recommend using the vllm/vllm-openai:cu130-nightly Docker image, which ships with CUDA 13.0 and Blackwell support out of the box.

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:cu130-nightly \
  --model LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code

If you get model type gemma4 not recognized, run pip install transformers>=5.5.0 inside the container.

pip (CUDA 13.0 wheel)

pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl
pip install transformers>=5.5.0

vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \
  --quantization modelopt \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --trust-remote-code

Key flags

  • --quantization modelopt — required, activates NVIDIA's optimized CUTLASS kernels
  • --kv-cache-dtype fp8 — halves KV cache memory on Blackwell
  • --max-model-len 16384 — maximum context length per request. See Compatibility for max value per GPU.

Tuning

The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case:

  • High-throughput classification / short output — Reduce --max-model-len and limit output tokens (max_tokens in the API request). Less KV cache pressure means more concurrent requests. Expect 14+ req/s on RTX 5090 for classification workloads (~1K input, ~10 output tokens).
  • Long context — Increase --max-model-len (up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences.
  • Latency-sensitive — Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms — fast enough for interactive use.
  • Batch processing — Push --max-num-seqs higher and use --request-rate inf with --max-concurrency to saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload.

Compatibility

Blackwell (SM 12.0+) — full FP4 tensor core support:

GPU VRAM Works? Max context Notes
RTX 5090 32 GB ~25k Primary target
RTX PRO 6000 96 GB ~180K Ideal for high-concurrency or long-context workloads.
B200 192 GB 262k (full) Datacenter, untested
B100 192 GB 262k (full) Datacenter, untested
RTX 5080 and lower ≤16 GB Not enough VRAM

Older GPUs (H100, A100, RTX 4090, etc.) may work without --quantization modelopt but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse.

Approach

Three changes were made:

  1. Quantized all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
  2. Updated architecture to Gemma4ForCausalLM and quantization config accordingly
  3. Stripped the vision and audio encoder

Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, embed_tokens stays BF16, all norms preserved, so we retain all the nvidia/Gemma-4-31B-IT-NVFP4 optimizations.

Why RTN didn't hurt quality

RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because:

  • FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
  • Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
  • MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
  • embed_tokens stays BF16, preventing noise from propagating through all layers

License

Apache 2.0 — same as the base model.

Credits

Downloads last month
3,520
Safetensors
Model size
33B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for LilaRest/gemma-4-31B-it-NVFP4-turbo

Quantized
(87)
this model

Evaluation results