Your 4K Context Window Might Only Be 300 Bengali Words
Everybody talks about model size, context length, and scaling laws.
But for Bengali, there is a more basic failure mode hiding underneath all of that:
your tokenizer can waste most of the context window before the model even gets a chance to think.
I released a repo measuring exactly that:
The main result is not subtle.
On Bengali text, tokenizer choice alone changed efficiency by 5x to 9x.
Not “a few percent.” Not “small but statistically significant.” A completely different cost structure.

Fertility comparison from the release. Lower is better.
The number that matters
I evaluated 14 tokenizers on 3,000 held-out Bengali documents.
Here is the simplest way to think about the result:
| Tokenizer | Tokens/word | KB in 4,096 tokens | Approx Bengali words in 4,096 tokens |
|---|---|---|---|
| BanglaBERT | 1.47 | 49.7 KB | 2,786 |
| Kotha-1 | 1.53 | 47.8 KB | 2,677 |
| BLOOM | 1.75 | 41.9 KB | 2,341 |
| Qwen-2.5 | 7.32 | 10.0 KB | 560 |
| Phi-2 | 13.69 | 5.4 KB | 299 |
That is the whole story.
With one tokenizer, a 4K window holds a real Bengali passage. With another, it holds a short paragraph.
People usually describe context windows as if they are model properties. They are not. They are model-plus-tokenizer properties.
If your tokenizer is inefficient on Bengali, your expensive context window is not really your context window.
What surprised me
I expected Bengali-specific tokenizers to do better than generic multilingual ones.
I did not expect the gap to be this large.
Most multilingual tokenizers I checked were terrible on Bengali. They split words too aggressively, burn through context, and turn a language modeling problem into a token budget problem.
There was one important exception: BLOOM.
BLOOM stayed competitive, which matters because it shows this is not some unavoidable “multilingual models are doomed on Bengali” story. The problem is not multilinguality by itself. The problem is coverage and tokenizer design.
That distinction matters. It means the penalty is not inevitable. It is an engineering choice.
The most interesting bug was Unicode
The strangest part of the project came from my own training runs.
I trained several new SentencePiece tokenizers and found a brutal byte-fallback rate: around 20%.
That was bad enough to be obvious, but the reason was even more interesting.
A single combining character, U+09BC Bengali Nukta, explained 89.8% of all byte-fallback tokens in the affected models.
That one character turned up as three byte tokens every time it appeared. The consequence was ugly:
- byte fallback jumped from about 2% to about 20%
- the tokenizer looked dramatically worse than it should have
- the root cause was not “Bengali is hard”
- the root cause was a normalization and character coverage failure
This is exactly the kind of bug that disappears inside preprocessing and then reappears later as a systems tax on the whole model.
If you work with Bengali or other combining-mark-heavy scripts, this is the part I would pay attention to. People talk a lot about vocab size and merge strategy. They talk much less about whether the tokenizer actually learned the characters that matter.
Why this changes how I think about continued pretraining
A lot of Bengali model work starts from a multilingual base model and continues pretraining.
That is a sensible strategy in general. But if the inherited tokenizer is already wasting 5x to 9x more tokens on Bengali than a dedicated alternative, then part of your training budget is getting spent on fragmentation instead of signal.
That does not mean continued pretraining is pointless.
It means the tokenizer bottleneck needs to be measured instead of hand-waved away.
If one setup fits roughly 2,800 Bengali words into a 4K window and another fits roughly 300, those are not comparable learning conditions. Even before downstream evaluation, the model is seeing a very different amount of language per forward pass.
That is why I think tokenizer evaluation deserves to be a first-class step in language adaptation, especially for scripts that are poorly represented in mainstream LLM tokenizers.
What is in the repo
I wanted this to be more than a paper PDF.
So the repo includes:
- the paper source and compiled PDF
- the evaluation framework
- 13 released SentencePiece tokenizers
- result artifacts
- a portable comparison workflow for reproducing paper-style runs
I also cleaned up the release so it is easier to use on another machine instead of only on mine.
What I think people are missing
The common framing is:
“We need bigger models for Bengali.”
Sometimes that is true.
But a more immediate framing is:
“We may already be wasting the models we have because the tokenizer is bad.”
That is a much cheaper problem to detect, and sometimes a much cheaper problem to fix.
The repo does not solve Bengali language modeling by itself. But it does give a concrete baseline for a question that is usually left fuzzy:
How much context are we actually getting for Bengali once tokenization is accounted for?
For the systems side of NLP, that is not a detail. That is the budget.
The release
Repo: github.com/oneKn8/bengali-tokenizer-eval
If you are building Bengali LLMs, evaluating multilingual tokenizers, or working on Unicode-heavy language pipelines, I think this is worth looking at.
If nothing else, check what your tokenizer is doing before you spend another few billion tokens training around it.
Stay in the loop
New posts on systems engineering, tools, and building in public. No spam.