Q3_K_XL?

by floory - opened Feb 24

Discussion

floory

Feb 24

where? ;( been waiting for hours, is it planned

danielhanchen

Unsloth AI org Feb 24

I think the script randomly stooped, we're investigating :0

rikunarita

Feb 25

I'm also looking forward to the release of Q3_K_XL.

floory

Feb 25

still actively waiting :(

elpirater312

Feb 28

Still waiting too, also UD-IQ2_M and UD-IQ2_XXS would be nice to have, thanks.

jadbox

Mar 1

What would be the ideal size to fit into exactly 16gb of ram?

dugrema

Mar 2

What would be the ideal size to fit into exactly 16gb of ram?

I'll assume you mean VRAM (GPU). Running a 27B dense model from CPU is probably not a good idea.

It depends on how much context you want. In my case I found Qwen 3.5 to be awesome for coding on a RTX 5060 w/16GB at Q3_K_M, I get about 48k tokens of context on llama.cpp with this:
-fa on -fitt 50 --fit-ctx 40000 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0 --batch-size 2048 --ubatch-size 128 --n-predict 20000

Of note, since this is a dense model, offloading all layers to VRAM is recommended. I get about 500 t/sec input and 20 t/sec output. Also, I'm not using the multimodal projector (images).

I haven't tried Q3_K_XL yet, but it is about 900MB more so I figure that will take you under 40000 tokens of context. That's very tight for coding, but could be good enough for other tasks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment