Slow GPTQ inference
#2
by
NePe
- opened
It seems like some weights(model.layers.*.mlp.gate) are not quantized. This slows down the inference a lot, about 2x slower the 32B dense variant model.
This comment has been hidden (marked as Resolved)
Can you provide specific hardware environment and benchmark test results
I created an issue on the GPTQModel repo, seems like the implementation is not final and have to requant after the stable release:
https://github.com/ModelCloud/GPTQModel/issues/1575
Okay, I will pay attention to this issue and re quantify the model
I have re uploaded the Qwen3-30B-A3B-GPTQ-Int4/int8 model
JunHowie
changed discussion status to
closed