Q3_K_XL?

#2
by floory - opened

where? ;( been waiting for hours, is it planned

Unsloth AI org

I think the script randomly stooped, we're investigating :0

I'm also looking forward to the release of Q3_K_XL.

still actively waiting :(

Still waiting too, also UD-IQ2_M and UD-IQ2_XXS would be nice to have, thanks.

What would be the ideal size to fit into exactly 16gb of ram?

What would be the ideal size to fit into exactly 16gb of ram?

I'll assume you mean VRAM (GPU). Running a 27B dense model from CPU is probably not a good idea.

It depends on how much context you want. In my case I found Qwen 3.5 to be awesome for coding on a RTX 5060 w/16GB at Q3_K_M, I get about 48k tokens of context on llama.cpp with this:
-fa on -fitt 50 --fit-ctx 40000 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0 --batch-size 2048 --ubatch-size 128 --n-predict 20000

Of note, since this is a dense model, offloading all layers to VRAM is recommended. I get about 500 t/sec input and 20 t/sec output. Also, I'm not using the multimodal projector (images).

I haven't tried Q3_K_XL yet, but it is about 900MB more so I figure that will take you under 40000 tokens of context. That's very tight for coding, but could be good enough for other tasks.

Sign up or log in to comment