Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
hypotheticalΒ 
posted an update 3 days ago

I'd love to see this treatment on some of the larger models. Been using G4 26B's and used to use 70B models, those squashed down would remove the need to quantize at all. It would even make the 100B+ models workable.

(note, 8Gb VRAM so memory is definitely the bottleneck)

Β·

Yes, fully understand. Team is working on a new set of releases.

For now a lot of compressed checkpoints coming without proper evaluation.
Team's pipeline always includes to reproduce "paper" results for released models (what takes time including that sometimes its a bit hard to recover full technique of evaluation). Then evaluate the best existing checkpoints and then run TheStage AI algorithms with evals to approve quality improvement.

Each step takes time comparing to just release set of quantised models. Our goal is to build meaningful and controllable compression and release models which not just small but really can do the work and provide clear limitations.

You can check some high-level ideas of automated/controllable compression here: https://huggingface.co/spaces/TheStageAI/ANNA-LLM

In this post