This model is so good, but the KV cache ruins it.

#100
by UniversalLove333 - opened

The KV cache is excessively big, the complete opposite of Qwen 3.6.
It's a massive bump over Gemma 2, and an exponential bump over Gemma 3.
Gemma 3 was way too strict-confident with it's word choice, every re-generation was the same

Hi @UniversalLove333
Thank you for sharing the feedback .You can try using these two concrete mitigations/workarounds if you are using vLLM .

  1. Cap --max-model-len in vLLM
    Gemma 4 31B has a 256K context window, but vLLM pre-allocates KV cache for the full length by default. If you don't need 256K, cap it like this
vllm serve google/gemma-4-31B-it \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

This is the recommended configuration from the official vllm-project/recipes

  1. Another workaround is that you can use is Quantizing the KV cache to FP8 which reduces its memory footprint. This increases the number of tokens that can be stored in the cache, improving throughput. Please take a look at these resources for more info
    https://docs.vllm.ai/en/v0.14.0/features/quantization/quantized_kvcache/
    https://vllm.ai/blog/fp8-kvcache

Thanks

Sign up or log in to comment