Instructions to use 88plug/Qwen2.5-Omni-7B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use 88plug/Qwen2.5-Omni-7B-W4A16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="88plug/Qwen2.5-Omni-7B-W4A16") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("88plug/Qwen2.5-Omni-7B-W4A16", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use 88plug/Qwen2.5-Omni-7B-W4A16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "88plug/Qwen2.5-Omni-7B-W4A16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen2.5-Omni-7B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/88plug/Qwen2.5-Omni-7B-W4A16
- SGLang
How to use 88plug/Qwen2.5-Omni-7B-W4A16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "88plug/Qwen2.5-Omni-7B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen2.5-Omni-7B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "88plug/Qwen2.5-Omni-7B-W4A16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "88plug/Qwen2.5-Omni-7B-W4A16", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use 88plug/Qwen2.5-Omni-7B-W4A16 with Docker Model Runner:
docker model run hf.co/88plug/Qwen2.5-Omni-7B-W4A16
Qwen2.5-Omni-7B-W4A16
INT4 post-training quantization of Qwen/Qwen2.5-Omni-7B β the 7B omni model with audio input, image input, and real-time speech output. ~3.5 GB on disk. Runs on any 8 GB GPU.
At a Glance
| Property | Value |
|---|---|
| Base model | Qwen/Qwen2.5-Omni-7B |
| Architecture | Thinker (LLM) + audio_tower + visual + talker (speech decoder) |
| Quant method | AutoRound, iters=200 |
| Quant format | compressed-tensors (native vLLM) |
| Quantized | thinker transformer layers (Linear targets, uniform W4A16) |
| Kept BF16 | audio_tower, visual, talker, embeddings, lm_head, norms |
| Group size | 128 (AutoRound default) |
| Disk size | ~3.5 GB |
| Min GPU | 1Γ RTX 3080 10 GB or any 8 GB GPU |
Memory Requirements
| Configuration | BF16 | W8A16 | W4A16 |
|---|---|---|---|
| Weights | ~18 GB | ~8 GB | ~3.5 GB |
| Min GPU | 1Γ A100 40 GB | 1Γ RTX 3090 24 GB | 1Γ RTX 3080 10 GB |
Quick Start
Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format β vLLM detects and loads quantization automatically. No --quantization flag needed.
vLLM β text output only
docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Qwen2.5-Omni-7B-W4A16 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
Weights are in compressed-tensors format β no --quantization flag needed. Mainline vLLM returns text only. Audio input works; speech output does not.
Python client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
response = client.chat.completions.create(
model="88plug/Qwen2.5-Omni-7B-W4A16",
messages=[{"role": "user", "content": "Explain the difference between W4A16 and W8A16 quantization."}],
max_tokens=512,
)
print(response.choices[0].message.content)
vLLM-Omni β full audio output
vLLM-Omni v0.20.0 enables real-time speech output.
docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-omni:v0.20.0-cu129 vllm serve \
88plug/Qwen2.5-Omni-7B-W4A16 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
SGLang β BF16 baseline
SGLang v0.5.8 does not support compressed-tensors natively. Run the BF16 base model for prefix-heavy and high-concurrency workloads.
docker run --gpus device=0 -p 30000:30000 \
lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-Omni-7B \
--tp 1 \
--mem-fraction-static 0.85 \
--port 30000
llama.cpp β audio/vision in, text out
Speech output is not in mainline llama.cpp (issue #21956). Convert from the BF16 base, not from the compressed-tensors checkpoint.
python convert_hf_to_gguf.py Qwen/Qwen2.5-Omni-7B \
--outfile Qwen2.5-Omni-7B-BF16.gguf
llama-quantize Qwen2.5-Omni-7B-BF16.gguf Qwen2.5-Omni-7B-Q4_K_M.gguf Q4_K_M
llama-quantize --imatrix calibration_datav3.txt \
Qwen2.5-Omni-7B-BF16.gguf Qwen2.5-Omni-7B-IQ4_XS.gguf IQ4_XS
llama-server \
--model Qwen2.5-Omni-7B-Q4_K_M.gguf \
--n-gpu-layers 999 \
--ctx-size 32768 \
--port 8081
Quantization Design
Recipe
AutoRound with scheme="W4A16", iters=200, applied uniformly to all Linear targets in model.thinker. No mixed-precision or per-layer overrides.
What is quantized
All Linear modules in model.thinker (the LLM backbone β 28 transformer blocks) are quantized to W4A16 INT4 weights with BF16 activations and group size 128.
What stays BF16
| Component | Precision | Reason |
|---|---|---|
thinker.* transformer layers |
W4A16 INT4 | Quantized |
thinker.audio_tower.* |
BF16 | Whisper-based audio encoder β excluded |
thinker.visual.* |
BF16 | ViT vision encoder β excluded |
talker.* |
BF16 | Dual-track speech decoder β excluded |
token2wav.* |
BF16 | DiT + BigVGAN vocoder β excluded |
Embeddings (embed_tokens) |
BF16 | Excluded by recipe |
LM head (lm_head) |
BF16 | Excluded by recipe |
| Layer norms | BF16 | Excluded by recipe |
Only the LLM backbone weights are quantized. The audio, vision, speech decoder, and vocoder components are untouched.
Calibration corpus
| Dataset | Samples | Role |
|---|---|---|
HuggingFaceH4/ultrachat_200k |
512 | Instruction-following / chat |
wikitext-103-raw-v1 |
512 | General knowledge / long-form text |
| Total | 1024 | Mixed, shuffled, seq_len=2048 |
Quality Targets
| Metric | W8A16 target | W4A16 target |
|---|---|---|
| KL divergence vs BF16 | < 0.005 | < 0.018 |
| MMLU recovery | β₯ 99.7% | β₯ 98% |
W4A16 applies a wider KL tolerance than W8A16 by design β INT4 introduces more rounding error than INT8. The AutoRound optimizer and 200 calibration iterations are chosen to maximize recovery within this budget.
Competitor Comparables
| Model | Source | Format | Compare angle |
|---|---|---|---|
Qwen/Qwen2.5-Omni-7B |
official | BF16 | Quality ceiling |
Qwen/Qwen2.5-Omni-7B-AWQ |
official | AWQ 4-bit | Official W4 reference |
Intel/Qwen2.5-Omni-7B-int4-AutoRound |
Intel | AutoRound Int4 | Same method, different format |
ggml-org/Qwen2.5-Omni-7B-GGUF |
ggml | GGUF | llama.cpp users |
88plug/Qwen2.5-Omni-7B-W8A16 |
88plug | compressed-tensors W8A16 | Higher precision alternative |
Note: This is the only compressed-tensors W4A16 quant for Qwen2.5-Omni-7B at time of release, making it the only vLLM-native INT4 option for this model.
Benchmarks
Results pending.
| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|---|---|---|---|---|---|---|
| vLLM v0.21.0 | W4A16 | 1 | 32k | β | β | β | β |
| vLLM v0.21.0 | W4A16 | 8 | 32k | β | β | β | β |
| SGLang v0.5.8 | BF16 (baseline) | 1 | 32k | β | β | β | β |
| llama.cpp b9297 | Q4_K_M GGUF | 1 | 32k | β | β | β | β |
| llama.cpp b9297 | IQ4_XS GGUF | 1 | 32k | β | β | β | β |
Hardware: A6000 48 GB, CUDA 12.9, driver 570.
SGLang Note
SGLang v0.5.8 does not support compressed-tensors natively. Use the BF16 base model (Qwen/Qwen2.5-Omni-7B) with SGLang for prefix-heavy workloads. The W4A16 quant is vLLM-native.
Technical Details
| Parameter | Value |
|---|---|
| Quantization library | llmcompressor + AutoRound |
| Scheme | W4A16 (INT4 weights, BF16 activations) |
| Group size | 128 |
| Calibration iterations | 200 |
| Calibration samples | 1024 |
| Calibration seq length | 2048 |
| Target scope | model.thinker only |
| Pipeline | sequential |
| Output format | compressed-tensors |
Citation
@misc{qwen25omni,
title = {Qwen2.5-Omni Technical Report},
author = {Qwen Team},
year = {2025},
url = {https://huggingface.co/Qwen/Qwen2.5-Omni-7B}
}
About
88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models β built for native vLLM v0.21.0+ deployment with zero extra flags.
W8A16 β INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.
W4A16 β AutoRound with iters=200 and a mixed calibration corpus. Targets β₯ 99% MMLU recovery β the quality bar that makes W4A16 viable for production.
All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.
Also available: Qwen2.5-Omni-7B-W8A16 (INT8, ~8 GB) Β· Qwen2.5-Omni-7B-W4A16 (INT4, ~3.5 GB)
Browse all releases β huggingface.co/88plug
Model tree for 88plug/Qwen2.5-Omni-7B-W4A16
Base model
Qwen/Qwen2.5-Omni-7B