Sinhala-Extended LLaMA 3.2 1B Tokenizer

An extended tokenizer for LLaMA 3.2 1B with a Sinhala-specific vocabulary added to the base tokenizer. Developed as part of a diversity-driven Sinhala language model adaptation study.

Motivation

The base LLaMA 3.2 1B tokenizer has very limited coverage of Sinhala script, resulting in severe over-tokenization - Sinhala words are split into character-level or sub-character fragments. This extended tokenizer adds 7,843 Sinhala-specific tokens trained on a large Sinhala corpus, reducing token count on Sinhala text by approximately 70.4%.

Tokenizer Details

Metric	Value
Base tokenizer	`meta-llama/Llama-3.2-1B`
Base vocabulary size	128,256
Sinhala tokens added	7,843
Extended vocabulary size	136,099
Token reduction on Sinhala text	~70.4%
Training corpus	`Minuri/diverse_sinhala_dataset` (12.38M sentences)
SentencePiece model type	Unigram
SentencePiece vocab size	8,000
Special tokens	`<\|begin_of_text\|>`, `<\|end_of_text\|>`

Tokenization Example

The dramatic improvement in Sinhala tokenization over the base tokenizer:

from transformers import AutoTokenizer

text = "ශ්‍රී ලංකාවේ භාෂා විද්‍යාව"

# Base LLaMA 3.2 1B tokenizer - 47 tokens (garbled byte-level)
base = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
print(base.tokenize(text))
# ['à·', 'ģ', 'à·', 'Ĭ', 'âĢį', 'à¶', '»', 'à·', 'ĵ', 'Ġà', '¶', '½', ...]
# Token count: 47

# Extended Sinhala tokenizer - 13 tokens (readable Sinhala subwords)
ext = AutoTokenizer.from_pretrained("Minuri/sinhala-llama-3.2-1b-tokenizer")
print(ext.tokenize(text))
# ['ශ්', '\u200d', 'රී', 'Ġ', 'ලංකාවේ', 'Ġ', 'භා', 'ෂා', 'Ġ', 'විද', '්', '\u200d', 'යාව']
# Token count: 13

72.3% reduction in token count for this example (47 → 13). The base tokenizer produces garbled byte-level fragments with no Sinhala script awareness, while the extended tokenizer produces meaningful Sinhala subword units.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Minuri/sinhala-llama-3.2-1b-tokenizer")

text = "ශ්‍රී ලංකාවේ භාෂා විද්‍යාව"
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
print("Token count:", len(tokens))

Intended Uses

tokenization of Sinhala text for LLM pretraining and fine-tuning
Drop-in replacement for the base LLaMA 3.2 1B tokenizer for Sinhala tasks
Extending other LLaMA-family models with Sinhala vocabulary

Models Using This Tokenizer

Repo	Description
`Minuri/sinhala-llama-1b-corpus-news`	CPT Model A - news-only
`Minuri/sinhala-llama-1b-corpus-random`	CPT Model B - random
`Minuri/sinhala-llama-1b-corpus-diverse`	CPT Model C - diversity-optimised
`Minuri/sinhala-llama-1b-sft-news`	SFT Model A
`Minuri/sinhala-llama-1b-sft-random`	SFT Model B
`Minuri/sinhala-llama-1b-sft-diverse`	SFT Model C

Related Repositories

Repo	Description
`Minuri/diverse_sinhala_dataset`	Tokenizer training corpus (12.38M sentences)

License

This tokenizer is derived from meta-llama/Llama-3.2-1B and is subject to the LLaMA 3.2 Community License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support