Sinhala-Extended LLaMA 3.2 1B Tokenizer
An extended tokenizer for LLaMA 3.2 1B with a Sinhala-specific vocabulary added to the base tokenizer. Developed as part of a diversity-driven Sinhala language model adaptation study.
Motivation
The base LLaMA 3.2 1B tokenizer has very limited coverage of Sinhala script, resulting in severe over-tokenization - Sinhala words are split into character-level or sub-character fragments. This extended tokenizer adds 7,843 Sinhala-specific tokens trained on a large Sinhala corpus, reducing token count on Sinhala text by approximately 70.4%.
Tokenizer Details
| Metric | Value |
|---|---|
| Base tokenizer | meta-llama/Llama-3.2-1B |
| Base vocabulary size | 128,256 |
| Sinhala tokens added | 7,843 |
| Extended vocabulary size | 136,099 |
| Token reduction on Sinhala text | ~70.4% |
| Training corpus | Minuri/diverse_sinhala_dataset (12.38M sentences) |
| SentencePiece model type | Unigram |
| SentencePiece vocab size | 8,000 |
| Special tokens | <|begin_of_text|>, <|end_of_text|> |
Tokenization Example
The dramatic improvement in Sinhala tokenization over the base tokenizer:
from transformers import AutoTokenizer
text = "ශ්රී ලංකාවේ භාෂා විද්යාව"
# Base LLaMA 3.2 1B tokenizer - 47 tokens (garbled byte-level)
base = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
print(base.tokenize(text))
# ['à·', 'ģ', 'à·', 'Ĭ', 'âĢį', 'à¶', '»', 'à·', 'ĵ', 'Ġà', '¶', '½', ...]
# Token count: 47
# Extended Sinhala tokenizer - 13 tokens (readable Sinhala subwords)
ext = AutoTokenizer.from_pretrained("Minuri/sinhala-llama-3.2-1b-tokenizer")
print(ext.tokenize(text))
# ['ශ්', '\u200d', 'රී', 'Ġ', 'ලංකාවේ', 'Ġ', 'භා', 'ෂා', 'Ġ', 'විද', '්', '\u200d', 'යාව']
# Token count: 13
72.3% reduction in token count for this example (47 → 13). The base tokenizer produces garbled byte-level fragments with no Sinhala script awareness, while the extended tokenizer produces meaningful Sinhala subword units.
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Minuri/sinhala-llama-3.2-1b-tokenizer")
text = "ශ්රී ලංකාවේ භාෂා විද්යාව"
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
print("Token count:", len(tokens))
Intended Uses
- tokenization of Sinhala text for LLM pretraining and fine-tuning
- Drop-in replacement for the base LLaMA 3.2 1B tokenizer for Sinhala tasks
- Extending other LLaMA-family models with Sinhala vocabulary
Models Using This Tokenizer
| Repo | Description |
|---|---|
Minuri/sinhala-llama-1b-corpus-news |
CPT Model A - news-only |
Minuri/sinhala-llama-1b-corpus-random |
CPT Model B - random |
Minuri/sinhala-llama-1b-corpus-diverse |
CPT Model C - diversity-optimised |
Minuri/sinhala-llama-1b-sft-news |
SFT Model A |
Minuri/sinhala-llama-1b-sft-random |
SFT Model B |
Minuri/sinhala-llama-1b-sft-diverse |
SFT Model C |
Related Repositories
| Repo | Description |
|---|---|
Minuri/diverse_sinhala_dataset |
Tokenizer training corpus (12.38M sentences) |
License
This tokenizer is derived from meta-llama/Llama-3.2-1B and is subject to the LLaMA 3.2 Community License.