Helium-1-2B โ€” GGUF

๐ŸŸข Fits on: every GPU class โ€” even integrated graphics. Runs on phones at Q2_K.

GGUF conversion of kyutai/helium-1-2b โ€” Kyutai's lightweight 2B base language model targeting edge and mobile devices, with native support for all 24 official EU languages.

This is a community quantization. The base model is by Kyutai (creators of Mimi, Moshi, and the Kyutai TTS/STT family). Until now, only MLX (Apple Silicon) variants existed โ€” this fills the GGUF gap for llama.cpp and ollama users.

Model details

Field Value
Architecture LlamaForCausalLM (standard Llama; works with stock llama.cpp)
Parameters 2B
Layers 28
Hidden size 2048
Vocab 64,000 (multilingual)
Context 4K
Type Base model โ€” not instruction-tuned
License CC-BY-SA 4.0 + Gemma Terms of Use (Helium is distilled from Gemma 2)

Use case

  • Edge / mobile inference โ€” fits comfortably on consumer hardware, including phones and small GPUs
  • EU multilingual base โ€” train your own instruction-following model on top of this with the language coverage you need
  • Research โ€” distillation lineage from Gemma 2 with smaller footprint
  • Not for chat out-of-the-box โ€” this is a base model, no instruction tuning. For chat, fine-tune it first.

Quants

Quant Size Use case
Q2_K ~0.8 GB tiniest footprint โ€” phones, microcontrollers, 4 GB cards
Q3_K_M ~1.0 GB balance for 6 GB cards
Q4_K_M ~1.2 GB recommended default โ€” fits anywhere
Q5_K_M ~1.5 GB quality bump if you have headroom
Q6_K ~1.8 GB near-lossless
Q8_0 ~2.3 GB reference quality
F16 ~4.0 GB full precision

Usage โ€” Ollama

hf download RhinoWithAcape/helium-1-2b-GGUF \
  helium-1-2b.Q4_K_M.gguf Modelfile --local-dir ./helium
cd ./helium
ollama create helium-1-2b:Q4_K_M -f Modelfile
ollama run helium-1-2b:Q4_K_M "Once upon a time"

Usage โ€” llama.cpp

./build/bin/llama-completion \
    -m helium-1-2b.Q4_K_M.gguf \
    -p "The capital of France is" \
    -n 30 --temp 0.6

(Sample: "The capital of France is Paris...")

License notes

  • This conversion is CC-BY-SA 4.0 (matching the source release).
  • Helium-1 is distilled from Gemma 2, so use is also subject to the Gemma Terms of Use.
  • This GGUF inherits both terms.

Conversion details

  • Source: kyutai/helium-1-2b (downloaded 2026-04-29; Q2_K + Q3_K_M backfilled 2026-05-02)
  • Tools: stock llama.cpp (no patches required โ€” standard Llama arch)
  • Steps: convert_hf_to_gguf.py โ†’ llama-quantize

More from RhinoWithAcape

We're a small AI lab making powerful models actually run on consumer GPUs. Curated GGUFs with the full Q2/Q3/Q4 ladder for 12-16 GB cards and first-mover conversions for new architectures.

โ†’ Full catalogue at huggingface.co/RhinoWithAcape

Acknowledgments

  • Kyutai for the open release of Helium-1, targeting under-served EU language coverage at edge scale
  • Google DeepMind for the Gemma 2 base from which Helium was distilled
  • llama.cpp maintainers
Downloads last month
184
GGUF
Model size
2B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for RhinoWithAcape/helium-1-2b-GGUF

Quantized
(5)
this model