Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("deepseek-ai/DeepSeek-V4-Flash", dtype="auto")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-V4-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash

SGLang

How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-V4-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-V4-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
```

Unable to run on 2x RTX Pro 6000 (DEEP_GEMM problem)

#15

by stev236 - opened Apr 24

Discussion

stev236

Apr 24

•

edited Apr 24

This model at first glance seems like a perfect fit for a 192GB of VRAM setup thanks to the FP4 MoE.
I followed the vLLM recipe (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash), but unfortunately it doesn't seem to support SM120 GPUs. The package vllm/utils/deep_gemm.py throws the assertion error: Unsupported architecture.
I tried turning off deep gemm using: VLLM_USE_DEEP_GEMM=0, but it doesn't help.

Does anybody know of any way to get it working on SM120?
Cheers!

CryptoAIM

Apr 24

it was teased a lot that huawei support would be first, so probs just wait?

retowyss

Apr 24

Same here, doesn't work with the recipe.

I wouldn't mind those Huawei 300i Duo - I've seriously considered buying a few more than once, but every time I can't find enough information that assures me I won't just have 10k in paper weights.

vody-am

Apr 24

Same error here: (EngineCore_DP0 pid=256) ERROR 04-24 15:20:17 [core.py:1110] RuntimeError: Worker failed with error 'Assertion error (/workspace/.deps/deepgemm-src/csrc/apis/hyperconnection.hpp:56): Unsupported architecture', please check the stack trace above for the root cause

CryptoAIM

Apr 24

seems like a problem with the mhc arch. I mean it handles attention signals fundamently differeng, so no wonder theres no support yet. Too bad, because its always deepseek, whos hinderd by that. (MoE, MTP etc.) Though I heard they wrote custom kernels for mhc or N gram, so maybe they habe a fix. idk though. you may just have to wait

billob01

Apr 25

This model at first glance seems like a perfect fit for a 192GB of VRAM setup thanks to the FP4 MoE.
I followed the vLLM recipe (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Flash), but unfortunately it doesn't seem to support SM120 GPUs. The package vllm/utils/deep_gemm.py throws the assertion error: Unsupported architecture.
I tried turning off deep gemm using: VLLM_USE_DEEP_GEMM=0, but it doesn't help.

Does anybody know of any way to get it working on SM120?
Cheers!

It doesnt support sm120. Only supports are b300 from blackwell, a100, h100and h200

richardhundt

Apr 25

•

edited Apr 25

I had to get Claude to knock out a few Triton kernels to patch the vLLM cu130 docker image. It took a few hours and I ended up with this monstrosity:

#!/bin/bash
# Run DeepSeek V4 Flash on SM120 (RTX PRO 6000 Blackwell) GPUs.
#
# SM120 issues and workarounds:
# 1. DeepGEMM's C++ kernels reject SM120 at runtime → patched deep_gemm.py
#    dispatches to Triton kernels for lightning indexer (paged + non-paged)
#    and keeps PyTorch fallbacks for tf32_hc_prenorm_gemm / fp8_einsum.
# 2. CUTLASS block-scaled FP8 kernel can't handle e8m0fnu scales through
#    the stable C API → disabled so Triton kernel is used instead.
# 3. Triton doesn't know float8_e8m0fnu dtype → patched triton.py converts
#    e8m0fnu scales to bfloat16 on-the-fly (lossless, same exponent range).
# 4. fused_inv_rope_fp8_quant produces packed INT32 UE8M0 scales on SM100+ →
#    patched deepseek_v4_attention.py forces SM90-style FP32 scales so the
#    PyTorch fp8_einsum fallback can dequantize them.
# 5. FlashMLA sparse_decode_fwd / sparse_prefill_fwd reject SM120 → patched
#    flash_mla_interface.py dispatches to our Triton kernels.

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
PATCH_DIR="$SCRIPT_DIR/sm120-patches"

docker run --gpus all \
  --ipc=host -p 8000:8000 \
  -e VLLM_DISABLED_KERNELS=CutlassFp8BlockScaledMMKernel \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$PATCH_DIR/deep_gemm.py":/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py:ro \
  -v "$PATCH_DIR/triton.py":/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/scaled_mm/triton.py:ro \
  -v "$PATCH_DIR/deepseek_v4_attention.py":/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/deepseek_v4_attention.py:ro \
  -v "$PATCH_DIR/flash_mla_interface.py":/usr/local/lib/python3.12/dist-packages/vllm/third_party/flashmla/flash_mla_interface.py:ro \
  -v "$PATCH_DIR/triton_kernels":/usr/local/lib/python3.12/dist-packages/sm120_triton_kernels:ro \
  -v "$PATCH_DIR/chat_template.jinja":/patches/chat_template.jinja:ro \
  vllm/vllm-openai:deepseekv4-cu130 deepseek-ai/DeepSeek-V4-Flash \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --tensor-parallel-size 2 \
  --enable-expert-parallel \
  --max-model-len 200000 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --compilation-config='{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes": [1, 2, 4, 8, 16]}' \
  --enable-auto-tool-choice \
  --default-chat-template-kwargs '{"thinking": true, "reasoning_effort": "high"}' \
  --reasoning-parser deepseek_v4 \
  --chat-template /patches/chat_template.jinja \
  --attention-backend FLASHMLA_SPARSE

You're probably better off waiting for official support

Luck-Rookie

Apr 27

Currently, GEMM does not have a plan to be compatible with sm120. When will we be able to do that?

bash99

Apr 27

Currently, GEMM does not have a plan to be compatible with sm120. When will we be able to do that?

I'm crying as 4090 48g user, which is sm_89 and perhaps won't work with deepgemm forever.

vody-am

Apr 27

@bash99 code it in CUDA! They accept community contributions. Someone already vibe-coded SM120 all the way into VLLM lol

andynoodles

Apr 28

•

edited Apr 28

Community fork of vllm that supports sm120

https://github.com/jasl/vllm/tree/ds4-sm120
https://github.com/vllm-project/vllm/pull/40991

mtcl

27 days ago

Was anyone able to make it work with 2x6000 Pros??

Quadrapole

25 days ago

Patiently waiting with my 2x rtx 6000 pro

mtcl

25 days ago

Patiently waiting with my 2x rtx 6000 pro

Same here

billob01

22 days ago

https://github.com/jasl/vllm/tree/ds4-sm120-preview
Comfirmed working on 2 x rtx pro 6000
1 request peaked at 60tps with mtp enabled.

retowyss

22 days ago

Nice! @billob01 So this is the compatibility branch? 60tps would be low for a13b even without MTP. Do you have any PP numbers and how much kv-cache are you fitting in there?

mtcl

21 days ago

https://github.com/jasl/vllm/tree/ds4-sm120-preview
Comfirmed working on 2 x rtx pro 6000
1 request peaked at 60tps with mtp enabled.

command / startup instructions please?

HarleyWang

6 days ago

I've got two RTX Pro 6000s—just sitting here waiting for support.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment