Qwen3.6-27B-FP8

FP8 (W8A8) quantized version of Qwen/Qwen3.6-27B by vrfai using llm-compressor.

Also available: vrfai/Qwen3.6-27B-NVFP4 — more aggressive quantization for Blackwell GPUs only.

FP8 Quantization Details

Base model Qwen/Qwen3.6-27B
Quantization W8A8 FP8 — weights FP8 static, activations FP8 static
Strategy tensor (per-tensor symmetric, memoryless minmax)
Format compressed-tensors (native vLLM support)
Tool vllm-project/llm-compressor
Requires NVIDIA Ampere / Hopper / Blackwell (SM 89+)

What's Quantized / What's Not

Same selective strategy as the NVFP4 variant — sensitive components are preserved in BF16:

Component Precision Reason
FFN / MLP — all 64 transformer layers FP8 High parameter density, stable under quantization
Full-attention projections (q/k/v/o) — 16 GQA layers FP8 Standard attention, tolerant to 8-bit
DeltaNet / Linear-attention projections — 48 layers BF16 Gated linear recurrence sensitive to numerical errors
Vision encoder — all 27 blocks + merger BF16 Vision tower preserved for multimodal quality
lm_head BF16 Output logits preserved for generation stability

Quantization Config (llm-compressor)

# recipe.yaml
QuantizationModifier:
  targets: [Linear]
  scheme: FP8
  # static W8A8, per-tensor symmetric
  ignore:
    - lm_head
    - re:model\.visual\.blocks\.\d+\..*
    - model.visual.merger.linear_fc1
    - model.visual.merger.linear_fc2
    - re:model\.language_model\.layers\.\d+\.linear_attn\..*

Quick Start (vLLM)

vllm serve vrfai/Qwen3.6-27B-FP8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --dtype auto \
  --trust-remote-code \
  --tensor-parallel-size 2

Single GPU (≥ 24 GB VRAM, SM 89+):

vllm serve vrfai/Qwen3.6-27B-FP8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --dtype auto \
  --trust-remote-code

Python (Transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "vrfai/Qwen3.6-27B-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain quantization in one paragraph."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

OpenAI-compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="vrfai/Qwen3.6-27B-FP8",
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

NVFP4 vs FP8 Comparison

NVFP4 FP8 (this)
Weight bits 4 8
Activation bits 4 (dynamic) 8 (static)
Model size ~26 GB ~34 GB
Hardware Blackwell only (SM 120+) Ampere / Hopper / Blackwell
Speed Faster Slightly slower
Quality Slightly lower Higher

Tested Environment

Component Version
vLLM 0.19.1
Transformers 5.6.2
PyTorch 2.10.0+cu128
CUDA 12.8 (nvcc 12.8.61)
llm-compressor compressed-tensors 0.14.0.1
GPU 2× NVIDIA RTX 5090 (tensor-parallel-size 2)

Best Practices

Mode temperature top_p top_k presence_penalty
Thinking — general 1.0 0.95 20 0.0
Thinking — coding 0.6 0.95 20 0.0
Non-thinking / instruct 0.7 0.80 20 1.5

Thinking mode:

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    chat_template_kwargs={"enable_thinking": True},
)

Credits


Below is the original model card from Qwen/Qwen3.6-27B:


Qwen Chat

This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.

Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

Qwen3.6 Highlights

  • Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision.
  • Thinking Preservation: reasoning context from historical messages is retained, streamlining iterative development.

Benchmark Results

For more details, please refer to our blog post Qwen3.6-27B.

Model Overview

  • Type: Causal Language Model with Vision Encoder
  • Number of Parameters: 27B
  • Context Length: 262,144 natively and extensible up to 1,010,000 tokens

Citation

@misc{qwen3.6-27b,
    title  = {{Qwen3.6-27B}: Flagship-Level Coding in a {27B} Dense Model},
    author = {{Qwen Team}},
    month  = {April},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.6-27b}
}
Downloads last month
52
Safetensors
Model size
27B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vrfai/Qwen3.6-27B-FP8

Base model

Qwen/Qwen3.6-27B
Quantized
(211)
this model

Collection including vrfai/Qwen3.6-27B-FP8