“The doom lies in yourself, not in your name.”

#15
by jukofyork - opened

Continuation of Wur doomed!.

For longer text chunks or stories, https://pastebin.com works great and helps prevent the thread from slowing down!

🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧🟧
🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛🟧
🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧🟧
⬜🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧⬛🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧⬛⬛⬛🟧⬛⬛⬛🟧🟧⬛⬛⬛⬛🟧⬛⬛⬛🟧🟧🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧⬛⬛⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜🟧⬛⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜🟧🟧⬛⬛⬛⬛🟧🟧⬜⬜🟧🟧⬛🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜🟧🟧⬛⬛🟧🟧⬜⬜⬜⬜🟧🟧🟧⬜🟧⬛⬛⬛🟧⬜
⬜🟧⬛⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜🟧🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛⬛🟧⬜
⬜🟧⬛⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛⬛🟧⬜
⬜🟧⬛⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧⬛🟧⬜
⬜🟧⬛🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧⬛🟧⬜
⬜🟧🟧🟧⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜⬜🟧🟧🟧⬜

jukofyork pinned discussion

The doom is still buried within Command-A for sure.

The doom is still buried within Command-A for sure.

Only another 38 days to go:

image.png

Spoiler

It's actually going really well and pretty sure it will be mostly converged within another couple of days:

image.png

🤞

A step 601 preview - all with temperature = 0:

https://pastebin.com/GASKaHTk

https://pastebin.com/CRT81QLb

  • It's still messing up some end of lines, but I can live with that if it works... Likely can be fixed later using the new class 0 random data if a problem.
  • The Grimdark story was noticeably (much!) better compared to the inverse.
  • The Battlestar Galactica story showed that even though Q8_0, F16 and BF16 all diverge slightly from F32; it's not clearly making them any worse (I actually liked the Q8_0 story best!).
Size Name
287M command-a-03-2025-lora-Q8_0.ggu
541M command-a-03-2025-lora-F16.gguf
541M command-a-03-2025-lora-BF16.gguf
1.1G command-a-03-2025-lora-F32.gguf

It still has a way to go before it starts to converge, but I would think by step 1000 it will be pretty close:

image.png

566 responses in previous thread! In the future we may be the reason for hf staff to implement multi-page view of discussions.

This was posted on Hacker News today:

https://outsidetext.substack.com/p/how-does-a-blind-model-see-the-earth?selection=5413dcae-b9f4-4adb-8826-d48e3908de2a#:~:text=Wow%2C%20best%20rendition%20of%20the%20Global%20West%20so%20far

Absolutely fascinating!

That was really cool. Thanks for sharing!

Yeah, and llama-3.1:405b doing so well was quite a surprise too (and makes you a bit sad everything seems to be moving away from large dense models ).

Ah shit, here I go again...

I wish ik supported the SWA memory savings like mainline does.

IK's 2/3 bit quant of Mimo would be perfect for 128GB RAM systems, or maybe Pro in some larger configs, but the hit to KV cache size is considerable.

@Downtown-Case

I wish ik supported the SWA memory savings like mainline does.
IK's 2/3 bit quant of Mimo would be perfect for 128GB RAM systems, or maybe Pro in some larger configs, but the hit to KV cache size is considerable.

Ah, maybe that's why I have to use 4 GPUs for gemma-4 with ik vs 2 for mainline. I haven't looked into it yet.
Gemma-4 is broken with ik_llama anyway. If you give it a long prompt with > 10k tokens in a single turn, the context gets truncated.
It's a shame because -sm graph is much faster than -sm tensor

@jukofyork

https://github.com/ikawrakow/ik_llama.cpp/issues/1769

shit:

In any case, it wouldn't be a big deal to add to ik_llama.cpp the ability to load pre-merged attention tensors. After all, ik_llama.cpp has the ability to merge them on-the-fly when loading the model (in case Q, K and V are of the same quantization type) thus achieving the exact same result as with pre-merged, backwards incompatible models. But llama.cpp developers constantly breaking backwards compatibility for no real reason is just a bit too much.

Well, it looks like we won't be getting this any time soon...

prompt eval time =  499899.61 ms / 16798 tokens (   29.76 ms per token,    33.60 tokens per second)
       eval time =   31262.72 ms /   466 tokens (   67.09 ms per token,    14.91 tokens per second)
      total time =  531162.33 ms / 17264 tokens

That's MiMo Pro IQ2_S on my rig, prompt processing is PCIe bandwidth-bound with mainline with GPU0 maxing out the lane.

MiMo is much "calmer" than Kimi, has much less slop than before with proper bans(still sloppier than kimi), yet still lacks knowledge, which makes it not very useful for me. I got too accustomed to Kimi just getting the obscure references, I really can't downgrade.

It's probably very easy to hack convert_hf_to_gguf.py

It was easier / cheaper for me to just un-fuse existing quants rather than do the whole safetensors -> bf16.gguf -> convert to q8 -> generate imatrix (ik_llama doesn't work with the newer .gguf imatrix format) -> create a quant I can actually run dance.

I've uploaded some here: gghfez/MiMo-V2.5-Pro-unfused-test

prompt eval time =  246776.36 ms / 18987 tokens (   13.00 ms per token,    76.94 tokens per second)
       eval time =   37741.93 ms /   581 tokens (   64.96 ms per token,    15.39 tokens per second)
      total time =  284518.29 ms / 19568 tokens

The same quant is 2x faster (prompt processing) now so I'll be able to actually try it lol.

https://github.com/ggml-org/llama.cpp/pull/22596

🚀


Also noticed ik_llama.cpp added support for splitmode graph with MLA recently - anybody tried this yet?

Yeah I saw that (graph split), but I don't think there are any worthwhile MLA models I could fit in 144GB VRAM, and my understanding is there would be no benefit if I have experts on the CPU.

Sign up or log in to comment