Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", dtype="auto") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepseek-ai/DeepSeek-V4-Flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
- SGLang
How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
Unable to run on 2x RTX Pro 6000 (DEEP_GEMM problem)
This model at first glance seems like a perfect fit for a 192GB of VRAM setup thanks to the FP4 MoE.
I followed the vLLM recipe (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash), but unfortunately it doesn't seem to support SM120 GPUs. The package vllm/utils/deep_gemm.py throws the assertion error: Unsupported architecture.
I tried turning off deep gemm using: VLLM_USE_DEEP_GEMM=0, but it doesn't help.
Does anybody know of any way to get it working on SM120?
Cheers!
it was teased a lot that huawei support would be first, so probs just wait?
Same here, doesn't work with the recipe.
I wouldn't mind those Huawei 300i Duo - I've seriously considered buying a few more than once, but every time I can't find enough information that assures me I won't just have 10k in paper weights.
Same error here: (EngineCore_DP0 pid=256) ERROR 04-24 15:20:17 [core.py:1110] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/hyperconnection.hpp:56): Unsupported architecture', please check the stack trace above for the root cause
seems like a problem with the mhc arch. I mean it handles attention signals fundamently differeng, so no wonder theres no support yet. Too bad, because its always deepseek, whos hinderd by that. (MoE, MTP etc.) Though I heard they wrote custom kernels for mhc or N gram, so maybe they habe a fix. idk though. you may just have to wait
This model at first glance seems like a perfect fit for a 192GB of VRAM setup thanks to the FP4 MoE.
I followed the vLLM recipe (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash), but unfortunately it doesn't seem to support SM120 GPUs. The package vllm/utils/deep_gemm.py throws the assertion error: Unsupported architecture.
I tried turning off deep gemm using: VLLM_USE_DEEP_GEMM=0, but it doesn't help.Does anybody know of any way to get it working on SM120?
Cheers!
It doesnt support sm120. Only supports are b300 from blackwell, a100, h100and h200
I had to get Claude to knock out a few Triton kernels to patch the vLLM cu130 docker image. It took a few hours and I ended up with this monstrosity:
#!/bin/bash
# Run DeepSeek V4 Flash on SM120 (RTX PRO 6000 Blackwell) GPUs.
#
# SM120 issues and workarounds:
# 1. DeepGEMM's C++ kernels reject SM120 at runtime → patched deep_gemm.py
# dispatches to Triton kernels for lightning indexer (paged + non-paged)
# and keeps PyTorch fallbacks for tf32_hc_prenorm_gemm / fp8_einsum.
# 2. CUTLASS block-scaled FP8 kernel can't handle e8m0fnu scales through
# the stable C API → disabled so Triton kernel is used instead.
# 3. Triton doesn't know float8_e8m0fnu dtype → patched triton.py converts
# e8m0fnu scales to bfloat16 on-the-fly (lossless, same exponent range).
# 4. fused_inv_rope_fp8_quant produces packed INT32 UE8M0 scales on SM100+ →
# patched deepseek_v4_attention.py forces SM90-style FP32 scales so the
# PyTorch fp8_einsum fallback can dequantize them.
# 5. FlashMLA sparse_decode_fwd / sparse_prefill_fwd reject SM120 → patched
# flash_mla_interface.py dispatches to our Triton kernels.
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
PATCH_DIR="$SCRIPT_DIR/sm120-patches"
docker run --gpus all \
--ipc=host -p 8000:8000 \
-e VLLM_DISABLED_KERNELS=CutlassFp8BlockScaledMMKernel \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v "$PATCH_DIR/deep_gemm.py":/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py:ro \
-v "$PATCH_DIR/triton.py":/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/scaled_mm/triton.py:ro \
-v "$PATCH_DIR/deepseek_v4_attention.py":/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py:ro \
-v "$PATCH_DIR/flash_mla_interface.py":/usr/local/lib/python3.12/dist-packages/vllm/third_party/flashmla/flash_mla_interface.py:ro \
-v "$PATCH_DIR/triton_kernels":/usr/local/lib/python3.12/dist-packages/sm120_triton_kernels:ro \
-v "$PATCH_DIR/chat_template.jinja":/patches/chat_template.jinja:ro \
vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Flash \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--tensor-parallel-size 2 \
--enable-expert-parallel \
--max-model-len 200000 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.92 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--compilation-config='{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1, 2, 4, 8, 16]}' \
--enable-auto-tool-choice \
--default-chat-template-kwargs '{"thinking": true, "reasoning_effort": "high"}' \
--reasoning-parser deepseek_v4 \
--chat-template /patches/chat_template.jinja \
--attention-backend FLASHMLA_SPARSE
You're probably better off waiting for official support
Currently, GEMM does not have a plan to be compatible with sm120. When will we be able to do that?
Currently, GEMM does not have a plan to be compatible with sm120. When will we be able to do that?
I'm crying as 4090 48g user, which is sm_89 and perhaps won't work with deepgemm forever.
Community fork of vllm that supports sm120
https://github.com/jasl/vllm/tree/ds4-sm120
https://github.com/vllm-project/vllm/pull/40991
Was anyone able to make it work with 2x6000 Pros??
Patiently waiting with my 2x rtx 6000 pro
Patiently waiting with my 2x rtx 6000 pro
Same here
https://github.com/jasl/vllm/tree/ds4-sm120-preview
Comfirmed working on 2 x rtx pro 6000
1 request peaked at 60tps with mtp enabled.
https://github.com/jasl/vllm/tree/ds4-sm120-preview
Comfirmed working on 2 x rtx pro 6000
1 request peaked at 60tps with mtp enabled.
command / startup instructions please?
I've got two RTX Pro 6000s—just sitting here waiting for support.