Slow GPTQ inference

#2
by NePe - opened

It seems like some weights(model.layers.*.mlp.gate) are not quantized. This slows down the inference a lot, about 2x slower the 32B dense variant model.

This comment has been hidden (marked as Resolved)

Can you provide specific hardware environment and benchmark test results

I created an issue on the GPTQModel repo, seems like the implementation is not final and have to requant after the stable release:
https://github.com/ModelCloud/GPTQModel/issues/1575

Okay, I will pay attention to this issue and re quantify the model

I have re uploaded the Qwen3-30B-A3B-GPTQ-Int4/int8 model

JunHowie changed discussion status to closed

Sign up or log in to comment