Subword tokenization efficiency for Bengali
View on GitHubResearch paper evaluating how efficiently different tokenizers handle Bengali text. The key finding: multilingual LLM tokenizers are 5-9x less efficient for Bengali than dedicated ones, and a single missing Unicode character (Bengali Nukta, U+09BC) causes 89.8% of all byte-fallback tokens.
Evaluated 14 tokenizers across 4 categories: Bengali encoders (BanglaBERT, BanglaT5), Bengali LLMs (Kotha-1, LilTii), newly trained SentencePiece models (BPE and Unigram at 32K/49K/64K vocab sizes), and multilingual LLMs (Qwen-2.5, BLOOM, TinyLlama, Phi-2, StableLM-2).
Training script generates 13 tokenizer configs across BPE/Unigram, 4 vocab sizes (16K-64K), and Bengali-only vs Bengali+English mix. Trained via SentencePiece with byte-fallback and identity normalization on 1.5GB Bengali text.
Evaluation measures 6 metrics: fertility (tokens/word), compression ratio (bytes/token), byte-fallback rate, script purity distribution, average document token count, and morphological probe segmentation on 20 curated Bengali words.
The Nukta finding: newly trained tokenizers showed ~20% byte-fallback vs Kotha-1's 1.9%. Root cause: U+09BC (Bengali Nukta combining character) was absent from vocabularies due to NFD-decomposed training data. Each Nukta occurrence produces 3 byte-fallback tokens. Excluding U+09BC drops the rate to 2.03%, matching Kotha-1.
Paper structure follows ACL format: abstract, introduction, background (script characteristics, subword tokenization, related work), methodology, results (efficiency gap, vocab size impact, BPE vs Unigram), analysis (Nukta problem, efficiency-compositionality tradeoff), discussion, conclusion.