Python / SentencePiece / HuggingFace Tokenizers

Bengali Tokenizer Research

Subword tokenization efficiency for Bengali

Overview

Research paper evaluating how efficiently different tokenizers handle Bengali text. The key finding: multilingual LLM tokenizers are 5-9x less efficient for Bengali than dedicated ones, and a single missing Unicode character (Bengali Nukta, U+09BC) causes 89.8% of all byte-fallback tokens.

Architecture

Evaluated 14 tokenizers across 4 categories: Bengali encoders (BanglaBERT, BanglaT5), Bengali LLMs (Kotha-1, LilTii), newly trained SentencePiece models (BPE and Unigram at 32K/49K/64K vocab sizes), and multilingual LLMs (Qwen-2.5, BLOOM, TinyLlama, Phi-2, StableLM-2).

Training script generates 13 tokenizer configs across BPE/Unigram, 4 vocab sizes (16K-64K), and Bengali-only vs Bengali+English mix. Trained via SentencePiece with byte-fallback and identity normalization on 1.5GB Bengali text.

Evaluation measures 6 metrics: fertility (tokens/word), compression ratio (bytes/token), byte-fallback rate, script purity distribution, average document token count, and morphological probe segmentation on 20 curated Bengali words.

The Nukta finding: newly trained tokenizers showed ~20% byte-fallback vs Kotha-1's 1.9%. Root cause: U+09BC (Bengali Nukta combining character) was absent from vocabularies due to NFD-decomposed training data. Each Nukta occurrence produces 3 byte-fallback tokens. Excluding U+09BC drops the rate to 2.03%, matching Kotha-1.

Paper structure follows ACL format: abstract, introduction, background (script characteristics, subword tokenization, related work), methodology, results (efficiency gap, vocab size impact, BPE vs Unigram), analysis (Nukta problem, efficiency-compositionality tradeoff), discussion, conclusion.

Design Decisions

6 metrics over fewer or more

Fertility alone doesn't capture tokenization quality. Compression ratio measures storage efficiency. Byte-fallback rate catches vocabulary gaps. Script purity detects cross-script contamination. Document token count predicts context window usage. Morphological probes test linguistic alignment. Together they give a complete picture.

SentencePiece with byte-fallback over character-level fallback

Byte-fallback ensures no unknown tokens -- any character decomposes to UTF-8 bytes. This is critical for Bengali where rare combining marks (like the Nukta) would otherwise produce <UNK> tokens. The tradeoff is that unrecognized characters inflate token counts by 3x.

NFC normalization for the Kotha-1 tokenizer

The paper demonstrates that Unicode normalization form matters enormously. NFD (decomposed) training data splits the Nukta from its base character, so it never appears in the vocabulary. NFC (composed) keeps it as a single codepoint that gets learned as a subword. This single choice accounts for a 10x difference in byte-fallback rate.

Tech Stack

PythonSentencePieceHuggingFace TokenizersLaTeX (ACL format)14 tokenizers evaluated6 metrics1.5GB Bengali corpus