Instructions to use 88plug/Qwen2.5-Omni-7B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 88plug/Qwen2.5-Omni-7B-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="88plug/Qwen2.5-Omni-7B-W4A16")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("88plug/Qwen2.5-Omni-7B-W4A16", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use 88plug/Qwen2.5-Omni-7B-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "88plug/Qwen2.5-Omni-7B-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen2.5-Omni-7B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/88plug/Qwen2.5-Omni-7B-W4A16

SGLang

How to use 88plug/Qwen2.5-Omni-7B-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "88plug/Qwen2.5-Omni-7B-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen2.5-Omni-7B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "88plug/Qwen2.5-Omni-7B-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "88plug/Qwen2.5-Omni-7B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use 88plug/Qwen2.5-Omni-7B-W4A16 with Docker Model Runner:
```
docker model run hf.co/88plug/Qwen2.5-Omni-7B-W4A16
```

Qwen2.5-Omni-7B-W4A16

INT4 post-training quantization of Qwen/Qwen2.5-Omni-7B — the 7B omni model with audio input, image input, and real-time speech output. ~3.5 GB on disk. Runs on any 8 GB GPU.

At a Glance

Property	Value
Base model	`Qwen/Qwen2.5-Omni-7B`
Architecture	Thinker (LLM) + audio_tower + visual + talker (speech decoder)
Quant method	AutoRound, iters=200
Quant format	compressed-tensors (native vLLM)
Quantized	thinker transformer layers (`Linear` targets, uniform W4A16)
Kept BF16	audio_tower, visual, talker, embeddings, lm_head, norms
Group size	128 (AutoRound default)
Disk size	~3.5 GB
Min GPU	1× RTX 3080 10 GB or any 8 GB GPU

Memory Requirements

Configuration	BF16	W8A16	W4A16
Weights	~18 GB	~8 GB	~3.5 GB
Min GPU	1× A100 40 GB	1× RTX 3090 24 GB	1× RTX 3080 10 GB

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM — text output only

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Qwen2.5-Omni-7B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Mainline vLLM returns text only. Audio input works; speech output does not.

Python client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.chat.completions.create(
    model="88plug/Qwen2.5-Omni-7B-W4A16",
    messages=[{"role": "user", "content": "Explain the difference between W4A16 and W8A16 quantization."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

vLLM-Omni — full audio output

vLLM-Omni v0.20.0 enables real-time speech output.

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-omni:v0.20.0-cu129 vllm serve \
  88plug/Qwen2.5-Omni-7B-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

SGLang — BF16 baseline

SGLang v0.5.8 does not support compressed-tensors natively. Run the BF16 base model for prefix-heavy and high-concurrency workloads.

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path Qwen/Qwen2.5-Omni-7B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp — audio/vision in, text out

Speech output is not in mainline llama.cpp (issue #21956). Convert from the BF16 base, not from the compressed-tensors checkpoint.

python convert_hf_to_gguf.py Qwen/Qwen2.5-Omni-7B \
  --outfile Qwen2.5-Omni-7B-BF16.gguf

llama-quantize Qwen2.5-Omni-7B-BF16.gguf Qwen2.5-Omni-7B-Q4_K_M.gguf Q4_K_M
llama-quantize --imatrix calibration_datav3.txt \
  Qwen2.5-Omni-7B-BF16.gguf Qwen2.5-Omni-7B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Qwen2.5-Omni-7B-Q4_K_M.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Quantization Design

Recipe

AutoRound with scheme="W4A16", iters=200, applied uniformly to all Linear targets in model.thinker. No mixed-precision or per-layer overrides.

What is quantized

All Linear modules in model.thinker (the LLM backbone — 28 transformer blocks) are quantized to W4A16 INT4 weights with BF16 activations and group size 128.

What stays BF16

Component	Precision	Reason
`thinker.*` transformer layers	W4A16 INT4	Quantized
`thinker.audio_tower.*`	BF16	Whisper-based audio encoder — excluded
`thinker.visual.*`	BF16	ViT vision encoder — excluded
`talker.*`	BF16	Dual-track speech decoder — excluded
`token2wav.*`	BF16	DiT + BigVGAN vocoder — excluded
Embeddings (`embed_tokens`)	BF16	Excluded by recipe
LM head (`lm_head`)	BF16	Excluded by recipe
Layer norms	BF16	Excluded by recipe

Only the LLM backbone weights are quantized. The audio, vision, speech decoder, and vocoder components are untouched.

Calibration corpus

Dataset	Samples	Role
`HuggingFaceH4/ultrachat_200k`	512	Instruction-following / chat
`wikitext-103-raw-v1`	512	General knowledge / long-form text
Total	1024	Mixed, shuffled, seq_len=2048

Quality Targets

Metric	W8A16 target	W4A16 target
KL divergence vs BF16	< 0.005	< 0.018
MMLU recovery	≥ 99.7%	≥ 98%

W4A16 applies a wider KL tolerance than W8A16 by design — INT4 introduces more rounding error than INT8. The AutoRound optimizer and 200 calibration iterations are chosen to maximize recovery within this budget.

Competitor Comparables

Model	Source	Format	Compare angle
`Qwen/Qwen2.5-Omni-7B`	official	BF16	Quality ceiling
`Qwen/Qwen2.5-Omni-7B-AWQ`	official	AWQ 4-bit	Official W4 reference
`Intel/Qwen2.5-Omni-7B-int4-AutoRound`	Intel	AutoRound Int4	Same method, different format
`ggml-org/Qwen2.5-Omni-7B-GGUF`	ggml	GGUF	llama.cpp users
`88plug/Qwen2.5-Omni-7B-W8A16`	88plug	compressed-tensors W8A16	Higher precision alternative

Note: This is the only compressed-tensors W4A16 quant for Qwen2.5-Omni-7B at time of release, making it the only vLLM-native INT4 option for this model.

Benchmarks

Results pending.

Engine	Format	Batch	ctx	tok/s	TTFT p50	TTFT p99	VRAM
vLLM v0.21.0	W4A16	1	32k	—	—	—	—
vLLM v0.21.0	W4A16	8	32k	—	—	—	—
SGLang v0.5.8	BF16 (baseline)	1	32k	—	—	—	—
llama.cpp b9297	Q4_K_M GGUF	1	32k	—	—	—	—
llama.cpp b9297	IQ4_XS GGUF	1	32k	—	—	—	—

Hardware: A6000 48 GB, CUDA 12.9, driver 570.

SGLang Note

SGLang v0.5.8 does not support compressed-tensors natively. Use the BF16 base model (Qwen/Qwen2.5-Omni-7B) with SGLang for prefix-heavy workloads. The W4A16 quant is vLLM-native.

Technical Details

Parameter	Value
Quantization library	llmcompressor + AutoRound
Scheme	W4A16 (INT4 weights, BF16 activations)
Group size	128
Calibration iterations	200
Calibration samples	1024
Calibration seq length	2048
Target scope	`model.thinker` only
Pipeline	sequential
Output format	compressed-tensors

Citation

@misc{qwen25omni,
  title  = {Qwen2.5-Omni Technical Report},
  author = {Qwen Team},
  year   = {2025},
  url    = {https://huggingface.co/Qwen/Qwen2.5-Omni-7B}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Qwen2.5-Omni-7B-W8A16 (INT8, ~8 GB) · Qwen2.5-Omni-7B-W4A16 (INT4, ~3.5 GB)

Browse all releases → huggingface.co/88plug

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for 88plug/Qwen2.5-Omni-7B-W4A16

Base model

Qwen/Qwen2.5-Omni-7B

Quantized

(22)

this model