Sinhala-Extended LLaMA 3.2 1B Tokenizer

An extended tokenizer for LLaMA 3.2 1B with a Sinhala-specific vocabulary added to the base tokenizer. Developed as part of a diversity-driven Sinhala language model adaptation study.

Motivation

The base LLaMA 3.2 1B tokenizer has very limited coverage of Sinhala script, resulting in severe over-tokenization - Sinhala words are split into character-level or sub-character fragments. This extended tokenizer adds 7,843 Sinhala-specific tokens trained on a large Sinhala corpus, reducing token count on Sinhala text by approximately 70.4%.

Tokenizer Details

Metric Value
Base tokenizer meta-llama/Llama-3.2-1B
Base vocabulary size 128,256
Sinhala tokens added 7,843
Extended vocabulary size 136,099
Token reduction on Sinhala text ~70.4%
Training corpus Minuri/diverse_sinhala_dataset (12.38M sentences)
SentencePiece model type Unigram
SentencePiece vocab size 8,000
Special tokens <|begin_of_text|>, <|end_of_text|>

Tokenization Example

The dramatic improvement in Sinhala tokenization over the base tokenizer:

from transformers import AutoTokenizer

text = "ශ්‍රී ලංකාවේ භාෂා විද්‍යාව"

# Base LLaMA 3.2 1B tokenizer - 47 tokens (garbled byte-level)
base = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
print(base.tokenize(text))
# ['à·', 'ģ', 'à·', 'Ĭ', 'âĢį', 'à¶', '»', 'à·', 'ĵ', 'Ġà', '¶', '½', ...]
# Token count: 47

# Extended Sinhala tokenizer - 13 tokens (readable Sinhala subwords)
ext = AutoTokenizer.from_pretrained("Minuri/sinhala-llama-3.2-1b-tokenizer")
print(ext.tokenize(text))
# ['ශ්', '\u200d', 'රී', 'Ġ', 'ලංකාවේ', 'Ġ', 'භා', 'ෂා', 'Ġ', 'විද', '්', '\u200d', 'යාව']
# Token count: 13

72.3% reduction in token count for this example (47 → 13). The base tokenizer produces garbled byte-level fragments with no Sinhala script awareness, while the extended tokenizer produces meaningful Sinhala subword units.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Minuri/sinhala-llama-3.2-1b-tokenizer")

text = "ශ්‍රී ලංකාවේ භාෂා විද්‍යාව"
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
print("Token count:", len(tokens))

Intended Uses

  • tokenization of Sinhala text for LLM pretraining and fine-tuning
  • Drop-in replacement for the base LLaMA 3.2 1B tokenizer for Sinhala tasks
  • Extending other LLaMA-family models with Sinhala vocabulary

Models Using This Tokenizer

Repo Description
Minuri/sinhala-llama-1b-corpus-news CPT Model A - news-only
Minuri/sinhala-llama-1b-corpus-random CPT Model B - random
Minuri/sinhala-llama-1b-corpus-diverse CPT Model C - diversity-optimised
Minuri/sinhala-llama-1b-sft-news SFT Model A
Minuri/sinhala-llama-1b-sft-random SFT Model B
Minuri/sinhala-llama-1b-sft-diverse SFT Model C

Related Repositories

Repo Description
Minuri/diverse_sinhala_dataset Tokenizer training corpus (12.38M sentences)

License

This tokenizer is derived from meta-llama/Llama-3.2-1B and is subject to the LLaMA 3.2 Community License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support