Qwen2.5-Omni-7B-W4A16

INT4 post-training quantization of Qwen/Qwen2.5-Omni-7B β€” the 7B omni model with audio input, image input, and real-time speech output. ~3.5 GB on disk. Runs on any 8 GB GPU.


At a Glance

Property Value
Base model Qwen/Qwen2.5-Omni-7B
Architecture Thinker (LLM) + audio_tower + visual + talker (speech decoder)
Quant method AutoRound, iters=200
Quant format compressed-tensors (native vLLM)
Quantized thinker transformer layers (Linear targets, uniform W4A16)
Kept BF16 audio_tower, visual, talker, embeddings, lm_head, norms
Group size 128 (AutoRound default)
Disk size ~3.5 GB
Min GPU 1Γ— RTX 3080 10 GB or any 8 GB GPU

Memory Requirements

Configuration BF16 W8A16 W4A16
Weights ~18 GB ~8 GB ~3.5 GB
Min GPU 1Γ— A100 40 GB 1Γ— RTX 3090 24 GB 1Γ— RTX 3080 10 GB

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format β€” vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM β€” text output only

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen2.5-Omni-7B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format β€” no --quantization flag needed. Mainline vLLM returns text only. Audio input works; speech output does not.

Python client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.chat.completions.create(
    model="88plug/Qwen2.5-Omni-7B-W4A16",
    messages=[{"role": "user", "content": "Explain the difference between W4A16 and W8A16 quantization."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

vLLM-Omni β€” full audio output

vLLM-Omni v0.20.0 enables real-time speech output.

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-omni:v0.20.0-cu129 vllm serve \
  88plug/Qwen2.5-Omni-7B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

SGLang β€” BF16 baseline

SGLang v0.5.8 does not support compressed-tensors natively. Run the BF16 base model for prefix-heavy and high-concurrency workloads.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-Omni-7B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp β€” audio/vision in, text out

Speech output is not in mainline llama.cpp (issue #21956). Convert from the BF16 base, not from the compressed-tensors checkpoint.

python convert_hf_to_gguf.py Qwen/Qwen2.5-Omni-7B \
  --outfile Qwen2.5-Omni-7B-BF16.gguf

llama-quantize Qwen2.5-Omni-7B-BF16.gguf Qwen2.5-Omni-7B-Q4_K_M.gguf Q4_K_M
llama-quantize --imatrix calibration_datav3.txt \
  Qwen2.5-Omni-7B-BF16.gguf Qwen2.5-Omni-7B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Qwen2.5-Omni-7B-Q4_K_M.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Quantization Design

Recipe

AutoRound with scheme="W4A16", iters=200, applied uniformly to all Linear targets in model.thinker. No mixed-precision or per-layer overrides.

What is quantized

All Linear modules in model.thinker (the LLM backbone β€” 28 transformer blocks) are quantized to W4A16 INT4 weights with BF16 activations and group size 128.

What stays BF16

Component Precision Reason
thinker.* transformer layers W4A16 INT4 Quantized
thinker.audio_tower.* BF16 Whisper-based audio encoder β€” excluded
thinker.visual.* BF16 ViT vision encoder β€” excluded
talker.* BF16 Dual-track speech decoder β€” excluded
token2wav.* BF16 DiT + BigVGAN vocoder β€” excluded
Embeddings (embed_tokens) BF16 Excluded by recipe
LM head (lm_head) BF16 Excluded by recipe
Layer norms BF16 Excluded by recipe

Only the LLM backbone weights are quantized. The audio, vision, speech decoder, and vocoder components are untouched.

Calibration corpus

Dataset Samples Role
HuggingFaceH4/ultrachat_200k 512 Instruction-following / chat
wikitext-103-raw-v1 512 General knowledge / long-form text
Total 1024 Mixed, shuffled, seq_len=2048

Quality Targets

Metric W8A16 target W4A16 target
KL divergence vs BF16 < 0.005 < 0.018
MMLU recovery β‰₯ 99.7% β‰₯ 98%

W4A16 applies a wider KL tolerance than W8A16 by design β€” INT4 introduces more rounding error than INT8. The AutoRound optimizer and 200 calibration iterations are chosen to maximize recovery within this budget.


Competitor Comparables

Model Source Format Compare angle
Qwen/Qwen2.5-Omni-7B official BF16 Quality ceiling
Qwen/Qwen2.5-Omni-7B-AWQ official AWQ 4-bit Official W4 reference
Intel/Qwen2.5-Omni-7B-int4-AutoRound Intel AutoRound Int4 Same method, different format
ggml-org/Qwen2.5-Omni-7B-GGUF ggml GGUF llama.cpp users
88plug/Qwen2.5-Omni-7B-W8A16 88plug compressed-tensors W8A16 Higher precision alternative

Note: This is the only compressed-tensors W4A16 quant for Qwen2.5-Omni-7B at time of release, making it the only vLLM-native INT4 option for this model.


Benchmarks

Results pending.

Engine Format Batch ctx tok/s TTFT p50 TTFT p99 VRAM
vLLM v0.21.0 W4A16 1 32k β€” β€” β€” β€”
vLLM v0.21.0 W4A16 8 32k β€” β€” β€” β€”
SGLang v0.5.8 BF16 (baseline) 1 32k β€” β€” β€” β€”
llama.cpp b9297 Q4_K_M GGUF 1 32k β€” β€” β€” β€”
llama.cpp b9297 IQ4_XS GGUF 1 32k β€” β€” β€” β€”

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


SGLang Note

SGLang v0.5.8 does not support compressed-tensors natively. Use the BF16 base model (Qwen/Qwen2.5-Omni-7B) with SGLang for prefix-heavy workloads. The W4A16 quant is vLLM-native.


Technical Details

Parameter Value
Quantization library llmcompressor + AutoRound
Scheme W4A16 (INT4 weights, BF16 activations)
Group size 128
Calibration iterations 200
Calibration samples 1024
Calibration seq length 2048
Target scope model.thinker only
Pipeline sequential
Output format compressed-tensors

Citation

@misc{qwen25omni,
  title  = {Qwen2.5-Omni Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen2.5-Omni-7B}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models β€” built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 β€” INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 β€” AutoRound with iters=200 and a mixed calibration corpus. Targets β‰₯ 99% MMLU recovery β€” the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Qwen2.5-Omni-7B-W8A16 (INT8, ~8 GB) Β· Qwen2.5-Omni-7B-W4A16 (INT4, ~3.5 GB)

Browse all releases β†’ huggingface.co/88plug

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 88plug/Qwen2.5-Omni-7B-W4A16

Quantized
(22)
this model