⚡ Gemma 4 31B IT NVFP4 Turbo
A repackaged nvidia/Gemma-4-31B-IT-NVFP4 that is 68% smaller in GPU memory and ~2.5× faster than the base model, while retaining nearly identical quality (1-3% loss). Fits on a single RTX 5090 (🎉).
It fully leverages NVIDIA Blackwell FP4 tensor cores (RTX 5090, RTX PRO 6000, B200, and other SM 12.0+ GPUs with ≥20 GB VRAM) for ~2× higher concurrent throughput than other quants like prithivMLmods/gemma-4-31B-it-NVFP4 or cyankiwi/gemma-4-31B-it-AWQ-4bit.
This variant is text-only, video/audio weights and encoders have been stripped. If you need video/audio support -> open an issue or PR.
Benchmark
RTX PRO 6000,
vllm bench@ 1K input / 200 output tokens. See bench.sh.Note: We also ran ⚡Turbo benchmark on RTX 5090, and it performed exactly the same because at 16k context, the performance is not limited by the GPU memory.
| Base model | NVIDIA quant | ⚡ Turbo (this model) | |
|---|---|---|---|
| GPU memory | 58.9 GiB | 31 GiB | 18.5 GiB (-68% base, -40% nvidia) |
| GPQA Diamond | 75.71% | 75.46% | 72.73% (-2.98% base, -2.73% nvidia) |
| MMLU Pro | 85.25% | 84.94% | 83.93% (-1.32% base, -1.01% nvidia) |
| Prefill | 6352 tok/s | 11069 tok/s | 15359 tok/s (+142% base, +39% nvidia) |
| Decode (single) | 24.1 tok/s | 39.2 tok/s | 51 tok/s (+112% base, +30% nvidia) |
| Decode (batched) | 494 tok/s | 913 tok/s | 1244 tok/s (+152% base, +36% nvidia) |
| Concurrency | 2.47 tok/s | 4.56 req/s | 6.22 req/s (+152% base, +36% nvidia) |
Other quants of similar size use kernel paths (compressed-tensors, Marlin) that don't leverage Blackwell's FP4 tensor cores, resulting in significantly lower prefill and concurrent throughput:
| prithivMLmods NVFP4 | cyankiwi AWQ | ⚡ Turbo (this model) | |
|---|---|---|---|
| GPU memory | 19.6 GiB | 19.6 GiB | 18.5 GiB |
| Prefill | 6647 tok/s | 6626 tok/s | 15359 tok/s |
| Decode (single) | 64.3 tok/s | 64.4 tok/s | 51 tok/s |
| Decode (batched) | 757 tok/s | 757 tok/s | 1244 tok/s |
| Concurrency | 3.79 req/s | 3.78 req/s | 6.22 req/s |
Usage
Requirements:
- A Blackwell GPU (see Compatibility)
transformers >= 5.5.0vllm >= 0.19with CUDA 13.0Note:
pip install vllminstalls CUDA 12, which doesn't support Blackwell FP4 tensor cores. Use one of the methods below.
Docker (recommended)
We recommend using the vllm/vllm-openai:cu130-nightly Docker image, which ships with CUDA 13.0 and Blackwell support out of the box.
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:cu130-nightly \
--model LilaRest/gemma-4-31B-it-NVFP4-turbo \
--quantization modelopt \
--max-model-len 16384 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--trust-remote-code
If you get
model type gemma4 not recognized, runpip install transformers>=5.5.0inside the container.
pip (CUDA 13.0 wheel)
pip install https://github.com/vllm-project/vllm/releases/download/v0.19.0/vllm-0.19.0+cu130-cp38-abi3-manylinux_2_35_x86_64.whl
pip install transformers>=5.5.0
vllm serve LilaRest/gemma-4-31B-it-NVFP4-turbo \
--quantization modelopt \
--max-model-len 16384 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--trust-remote-code
Key flags
--quantization modelopt— required, activates NVIDIA's optimized CUTLASS kernels--kv-cache-dtype fp8— halves KV cache memory on Blackwell--max-model-len 16384— maximum context length per request. See Compatibility for max value per GPU.
Tuning
The above benchmarks use a generic workload (1K input / 200 output tokens). You can tune vLLM flags for your specific use case:
- High-throughput classification / short output — Reduce
--max-model-lenand limit output tokens (max_tokensin the API request). Less KV cache pressure means more concurrent requests. Expect 14+ req/s on RTX 5090 for classification workloads (~1K input, ~10 output tokens). - Long context — Increase
--max-model-len(up to ~25K on RTX 5090, ~180K on PRO 6000). Trade concurrent capacity for longer sequences. - Latency-sensitive — Keep concurrency low. Single-request decode is ~51 tok/s with TTFT under 70ms — fast enough for interactive use.
- Batch processing — Push
--max-num-seqshigher and use--request-rate infwith--max-concurrencyto saturate the GPU. Peak throughput is ~6.2 req/s on RTX PRO 6000 at 1K/200 workload.
Compatibility
Blackwell (SM 12.0+) — full FP4 tensor core support:
| GPU | VRAM | Works? | Max context | Notes |
|---|---|---|---|---|
| RTX 5090 | 32 GB | ✅ | ~25k | Primary target |
| RTX PRO 6000 | 96 GB | ✅ | ~180K | Ideal for high-concurrency or long-context workloads. |
| B200 | 192 GB | ✅ | 262k (full) | Datacenter, untested |
| B100 | 192 GB | ✅ | 262k (full) | Datacenter, untested |
| RTX 5080 and lower | ≤16 GB | ❌ | — | Not enough VRAM |
Older GPUs (H100, A100, RTX 4090, etc.) may work without --quantization modelopt but they lack FP4 tensor cores, so you'll lose the optimized kernel path and performance will be significantly worse.
Approach
Three changes were made:
- Quantized all self-attention weights from BF16 → FP4 (RTN, group_size=16, matching modelopt NVFP4 format)
- Updated architecture to
Gemma4ForCausalLMand quantization config accordingly - Stripped the vision and audio encoder
Everything else is untouched — MLP layers keep NVIDIA's calibrated FP4, embed_tokens stays BF16, all norms preserved, so we retain all the nvidia/Gemma-4-31B-IT-NVFP4 optimizations.
Why RTN didn't hurt quality
RTN (Round-To-Nearest) is the simplest quantization method — no calibration data, fully reproducible. It worked here because:
- FP4 with group_size=16 and per-group scaling preserves relative weight distributions well
- Self-attention weights tend to be normally distributed near zero, where the FP4 grid has finest resolution (0, 0.5, 1.0, 1.5)
- MLP layers (more sensitive to quantization) keep NVIDIA's calibrated FP4
embed_tokensstays BF16, preventing noise from propagating through all layers
License
Apache 2.0 — same as the base model.
Credits
- Google DeepMind for Gemma 4
- NVIDIA for the modelopt NVFP4 checkpoint
- Downloads last month
- 3,520
Model tree for LilaRest/gemma-4-31B-it-NVFP4-turbo
Base model
google/gemma-4-31B-itEvaluation results
- Accuracy on GPQA Diamondself-reported72.730
- Accuracy on MMLU Proself-reported83.930
