BYOL Chichewa 4B CPT

This model was produced by the BYOL framework for extending LLMs to low-resource languages.

Base model: google/gemma-3-4b-pt
Language: Chichewa (nya)
Training stage: Continual Pre-Training
License: Gemma Terms of Use (derived from Gemma 3)
Paper: BYOL: Bring Your Own Language Into LLMs
Code: github.com/microsoft/byol

Model Description

This is a continually pre-trained (CPT) language model adapted for Chichewa (nya). Starting from Gemma 3 4b, the model was further trained on a curated bilingual corpus of Chichewa and English text using the BYOL framework. This extends the base model's knowledge and fluency in Chichewa while retaining its English capabilities.

As a base (non-instruction-tuned) model, it is best suited for text completion tasks. For a chat/instruction-following model, see the merged variant.

Usage

pip install -U transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ai-for-good-lab/byol-nya-4b-cpt"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype=torch.bfloat16)

# Text completion (base model)
prompt = "Dziko la Malawi ndi dziko la"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

@article{zamir2026byolbringlanguagellms,
    title={BYOL: Bring Your Own Language Into LLMs},
    author={Syed Waqas Zamir and Wassim Hamidouche and Boulbaba Ben Amor and Luana Marotti and Inbal Becker-Reshef and Juan Lavista Ferres},
    year={2026},
    journal={arXiv:2601.10804},
    url={https://arxiv.org/abs/2601.10804},
}