Slow GPTQ inference

by NePe - opened Apr 30, 2025

NePe

Apr 30, 2025

•

edited Apr 30, 2025

It seems like some weights(model.layers.*.mlp.gate) are not quantized. This slows down the inference a lot, about 2x slower the 32B dense variant model.

JunHowie

Owner May 1, 2025

This comment has been hidden (marked as Resolved)

JunHowie

Owner May 1, 2025

Can you provide specific hardware environment and benchmark test results

NePe

May 1, 2025

I created an issue on the GPTQModel repo, seems like the implementation is not final and have to requant after the stable release:
https://github.com/ModelCloud/GPTQModel/issues/1575

JunHowie

Owner May 1, 2025

Okay, I will pay attention to this issue and re quantify the model

JunHowie

Owner May 14, 2025

I have re uploaded the Qwen3-30B-A3B-GPTQ-Int4/int8 model

JunHowie changed discussion status to closed May 14, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment