sglang-flash-attn3

Pre-built Flash Attention 3 (forward-only) CUDA kernels from sgl-flash-attn, packaged for the HuggingFace kernels library. Requires Hopper (sm_90+).

Kernel source: kernels-community/sgl-flash-attn3

Usage

pip install kernels
from kernels import get_kernel

fa3 = get_kernel("kernels-community/sgl-flash-attn3", revision="v1")

fa3.flash_attn_varlen_func(q, k, v, cu_seqlens_q, cu_seqlens_k, causal=True)
fa3.flash_attn_with_kvcache(q, k_cache, v_cache, cache_seqlens=cache_seqlens, causal=True)
fa3.is_fa3_supported()  # True on H100/H200

SGLang Integration

Entry point: python/sglang/srt/layers/attention/flashattention_backend.py

Original:

from sgl_kernel.flash_attn import flash_attn_varlen_func as flash_attn_varlen_func_fa3
from sgl_kernel.flash_attn import flash_attn_with_kvcache as flash_attn_with_kvcache_fa3

Replace with:

from kernels import get_kernel
_fa3_mod = get_kernel("kernels-community/sgl-flash-attn3", revision="v1")
flash_attn_varlen_func_fa3 = _fa3_mod.flash_attn_varlen_func
flash_attn_with_kvcache_fa3 = _fa3_mod.flash_attn_with_kvcache

Same pattern in 5 other files:

  • dual_chunk_flashattention_backend.py
  • nsa_backend.py
  • xpu_backend.py
  • vision.py
  • multimodal_gen/runtime/layers/attention/backends/flash_attn.py

Benchmarks

H100 NVL, Qwen2.5-3B-Instruct, FA3. All deltas within noise - zero performance regression.

Scenario Native sgl_kernel FA3 tok/s HF Hub FA3 tok/s Δ
Short Gen (128→32) 40,934 39,878 -2.6%
Long Gen (256→1024) 25,054 26,239 +4.7%
Long Prefill (2048→128) 53,833 54,283 +0.8%
Bursty (512→256, 16rps) 6,518 6,527 +0.1%
High Concurrency (256→256) 40,666 40,522 -0.4%

Credits

Downloads last month
2
kernels
bsd-3-clause
Supported hardwares new
CUDA 8.09.0a
NVIDIA SXM
H200
141GB
NVIDIA SXM
H100
80GB
GPU
L40s
48GB
GPU
L40
48GB
GPU
L20
48GB
GPU
L4
24GB
GPU
RTX 6000 Ada
48GB
GPU
RTX 5880 Ada
48GB
RTX
RTX 5000 Ada
32GB
GPU
RTX 4500 Ada
24GB
RTX
RTX 4000 Ada
20GB
RTX
RTX 4000 SFF Ada
20GB
GPU
RTX 2000 Ada
16GB
GPU
RTX A6000
48GB
GPU
RTX A5000
8GB
GPU
RTX A5000 Max-Q
16GB
GPU
RTX A5000 Mobile
16GB
GPU
RTX A4000
16GB
GPU
RTX A4000 Max-Q
8GB
GPU
RTX A4000 Mobile
8GB
GPU
RTX A3000 Mobile
6GB
GPU
RTX A2000
6GB
GPU
RTX A2000 Embedded
4GB
GPU
RTX A2000 Max-Q
4GB
GPU
RTX A2000 Mobile
4GB
GPU
A100
80GB
GPU
A40
48GB
GPU
A30
24GB
GPU
A10
24GB
GPU
A2
16GB
RTX
RTX 4090
24GB
RTX
RTX 4090D
24GB
RTX
RTX 4090 Mobile
16GB
RTX
RTX 4080 SUPER
16GB
RTX
RTX 4080
16GB
RTX
RTX 4080 Mobile
12GB
RTX
RTX 4070
12GB
RTX
RTX 4070 Mobile
8GB
RTX
RTX 4070 Ti
12GB
RTX
RTX 4070 Super
12GB
RTX
RTX 4070 Ti Super
16GB
RTX
RTX 4060
8GB
RTX
RTX 4060 Ti
8GB
RTX
RTX 4090 Laptop
16GB
RTX
RTX 4080 Laptop
12GB
RTX
RTX 4070 Laptop
8GB
RTX
RTX 4060 Laptop
8GB
RTX
RTX 4050 Laptop
6GB
RTX
RTX 3090
24GB
RTX
RTX 3090 Ti
24GB
RTX
RTX 3080
12GB
RTX
RTX 3080 Ti
12GB
RTX
RTX 3080 Mobile
8GB
RTX
RTX 3070
8GB
RTX
RTX 3070 Ti
8GB
RTX
RTX 3070 Ti Mobile
8GB
RTX
RTX 3060 Ti
8GB
RTX
RTX 3060
12GB
RTX
RTX 3060 Mobile
6GB
RTX
RTX 3050 Mobile
4GB
Jetson
Jetson AGX Orin 64GB
64GB
Jetson
Jetson AGX Orin 32GB
32GB
Jetson
Jetson Orin NX 16GB
16GB
Jetson
Jetson Orin NX 8GB
8GB
Jetson
Jetson Orin Nano 8GB
8GB
Jetson
Jetson Orin Nano 4GB
4GB
OS
linux
Arch
x86_64