LLM4N4: The art of tokenization
How computers understand text, types of tokenization, BPE, and so much more
Hey everyone, and welcome back to LLM4N: LLM For Noobs!
A 16-part series designed to take you from zero to one in understanding Large Language Models.
In our journey so far, we’ve made some incredible progress. We started by simply counting words with N-grams (LLM4N1), then gave words meaning Word2Vec (LLM4N2), and gave our models a form of memory with RNNs and LSTMs (LLM4N2, LLM4N3). We’ve tackled semantics and sequence, memory and vanishing gradients.
Across all these stages, however, we’ve been operating on a silent assumption. We’ve assumed that before our model ever sees the data, the raw, messy human text has already been converted into a clean, structured, numerical format it can comprehend.
Today, we pause and zoom in on that critical, often overlooked first step. Before a model can find meaning, remember context, or predict the next word, it must first learn to read. This is the bridge between our world and the machine's world. This is the art of tokenization.
So, How Does a Model "See" a Sentence?
To build a strong intuition, let's use an analogy that involves cooking (I like cooking :P). Think of our language model as a master chef and a raw text document as a collection of unprepared ingredients. The chef cannot begin cooking with a whole, unwashed potato or an entire side of meat. A crucial preparatory phase is required first, known in the culinary world as mise en place.
Tokenization is this mise en place. It's the process of taking the raw ingredients of language and chopping them into consistent, usable units called tokens. Following this, the next step we’ve already discussed, vectorization (LLM4N2), is like pre-cooking, such as marinating the meat, which imbues the tokens with semantic properties.
Think of tokenization less like filing paperwork and more like choosing the lens for a camera. It's a fundamental act of abstraction that shapes the entire world the model perceives. This new reality is defined numerically; the output of a tokenizer is a sequence of numerical IDs, and from that point forward, the model never again sees the original characters or words. If the tokenizer decides "antidisestablishmentarianism" is a single unit, the model is forever blind to its constituent parts like "anti-" and "establish". This choice imposes a permanent "worldview" on the model, defining the atomic units from which all understanding must be constructed.
The spectrum of Tokenization: From Words to Characters
Converting text to tokens is a challenging task. And it can be approached from several angles. At one end is the most intuitive approach, word-level tokenization, and at the other, the most granular, character-level tokenization. Understanding the severe limitations of these two extremes is crucial for appreciating the modern solutions that lie in the middle.
The obvious way: Word-level tokenization (and why it’s a disaster)
The most straightforward method is to split text by whitespace and punctuation. Its extremely simply, but this approach suffers from 3 fatal flaws when applied to large-scale, real-world text.
Vocabulary explosion: With word-level tokenization, every unique word form in the training corpus becomes an entry in the vocabulary. This is especially problematic in morphologically rich languages like German or Turkish, where words change form frequently, leading to a combinatorial explosion of unique words. A massive vocabulary requires an equally massive embedding matrix (the lookup table for vectors), dramatically increasing the model's memory footprint and computational complexity. For instance, the Transformer-XL model, using a word-level approach, had a vocabulary of 267,735. A scale now considered unmanageable.
Out-of-Vocabulary (OOV) crisis: A model's vocabulary is fixed after training. When it encounters a new word, a technical term, a name, a misspelling like "knowldge"—it has no representation for it. Such words are typically mapped to a generic [UNK] (unknown) token, resulting in a catastrophic loss of information. The model is now blind, severely hampering its ability to understand the text.
Language incompatibility: The assumption of splitting text on spaces is fundamentally language specific. Many of the world's languages, including Chinese, Japanese, and Thai, do not use spaces to delineate words. For these languages, whitespace tokenization is entirely non-functional.
The opposite extreme: Character-level tokenization
Given the failures of word-level tokenization, what if treat each indivisual character as a token?
The advantage: This completely eliminates the OOV problem, as any word is just a sequence of known characters. The vocabulary size is also tiny and fixed (e.g., a few hundred characters), which is great for memory.
But robustness comes at a steep price.
Extreme sequence length: By breaking words into characters, the length of our token sequences explodes. A 10-word sentence might become 60 character tokens. This is a major issue for modern Transformer architectures, where the cost of the self-attention mechanism scales quadratically with sequence length.
Doubling the sequence length quadruples the computation.Semantically poor: Individual characters carry very little meaning. The model must expend a vast amount of its learning capacity simply to recognize that certain character sequences form common words, a task word-level tokenization does by default.
The solution: Subword tokenization
The tension between word- and character-level approaches is a classic example of the bias-variance trade-off.
Word level is a high-bias approach: it makes a strong assumption that space-separated words are the fundamental units of meaning. When this is true, it’s efficient; when it’s violated (OOV words), it fails completely.
Character-level is a high-variance approach: it makes few assumptions, giving it the flexibility to represent anything. However, this forces the model to learn everything from scratch.
This ideal solution lies in the middle. A method that represents common words as single tokens while breaking down rare words into smaller, meaningful sub-units. This is what we call subword tokenization.
The algorithm that learns to read: Byte-Pair Encoding (BPE)
The most foundational algorithm for learning subwords is Byte-Pair Encoding (BPE). Interestingly, BPE originated not in NLP but as a data compression algorithm in a 1994 paper. Its goal was to find the most frequent pair of consecutive bytes in data and replace them with a single, unused byte, thus compressing the file. This same frequency-driven logic is perfect for discovering the statistical patterns of language.
The modern BPE algorithm for tokenization follows four stages:
Initialization: Begin by defining a base vocabulary consisting of all unique individual characters present in the training corpus.
Pre-tokenization and Word representation: The corpus is first split into words (e.g., by whitespace). Then, a special end-of-word symbol such as
</w>,is appeneded to each word.This is vital as it allows the final model to distinguish between a subword that occurs mid-word (like "er" in "newer") and one that occurs at the end (like "er" in "teacher").Iterative Merging: This is the core of the algorithm. For a predetermined number of merge operations (which defines the final vocabulary size), the process is repeated:
Count all adjacent pairs of symbols in the corpus.
Find the single most frequent pair.
Merge this pair into a single new symbol (a new token) and add it to the vocabulary.
Go back through the entire corpus and replace every occurence of the original pair with the new merged token.
Tokenization: The process stops when the specified numer of merges is complete. The final vocabulary consists of the initial characters plus all the new tokens created through merging.
BPE is a greedy algorithm. It makes the best possible (most frequent) merge at each step without considering future steps. This is efficient and works surprisingly well.
A worked out example of BPE
Alright, enough theory. Let’s get our hands dirty!
Let’s trace five merges on a small corpus with word counts:
{"low</w>": 5, "lower</w>": 2, "newest</w>": 6, "widest</w>": 3}
Initial State
Base Vocabulary:
{l, o, w, e, r, n, s, t, i, d, </w>}Corpus representation:
l o w </w>(5 times), l o w e r </w>(2 times, n e w e s e t </w>(6 times), etc.
Merge 1:
The most frequent pair is
(e, s), appearing 9 times (newestx 6 +widestx 3)Action: Merge
eandsintoes. Addesto the vocabulary.Updated Corpus: …..
n e w es t </w>, …… w i d es t </w>……
Merge 2:
The most frequent pair is now
(es, t), also with a frequency of 9.\Action: Merge
esandtintoest.Addestto the vocabulary.Updated Corpus:
...n e w est </w> ...w i d est </w> ...
Merge 3-5:
This continues. The next merges would likely be:
(est, </w>) → est</w>, then (l, o) → lo, and (lo, w) → low .
After just a few merges, BPE has learned the common word “low” and the common suffix “-est” as single tokens. Now, if it encounters the unseen word “lowest”, it can elegantly tokenize it as [“low”, “est</w>”], handling the OOV word gracefully.
The universal solution: Byte-Level BPE (BBPE)
Modern models like GPT-2 and GPT-3 use an even more robust version called Byte- Level BPE. Instead of starting with characters, BBPE operates on raw UTF-8 bytes. Since any text can be represented as a sequence of bytes and there are only 256 possible byte values, the initial vocabulary is a fixed, universal set of 256 tokens. This elegant modification guarantees that any possible string, in any language, with any emoji or symbol, can be tokenized without ever encountering an unknown unit, creating a truly universal tokenizer.
A core system design trade-off
The choice of a tokenizer is not an abstract linguistic exercise. It’s a core component of ML System Design. It creates a direct and unavoidable trade-off between a model's static memory cost and its dynamic computational cost.
Vocabulary Size
(|V|):This is the total number of unique tokens. It directly determines the size of the model's embedding matrix, which has dimensions|V| x d_model(e.g., 50,000 tokens × 4096 dimensions = 204.8 million parameters). A larger|V|means a larger memory footprint.Sequence Length
(L): This is how many tokens a piece of text is converted into. As we noted, the computational cost of Transformers scales quadratically, making this a critical factor for speed and expense.
What’s the trade-off?
Increasing |V| allows the vocabular to include more common words as single tokens, which generally decreases the average sequence length L, thereby reducing the expensive compute cost. However, this comes at the price of a larger memory footprint. This is a fundamental optimization problem where the best balance depends entirely on the deployment context, from memory-constrained mobile devices to high-throughput inference servers.
Conclusion
Tokenization is far more than a simple preprocessing step; it is a foundational design choice in the architecture of any LLM. The journey from words to characters and the elegant subword compromise of BPE highlights a deep tension between semantic meaning, computational cost, and linguistic flexibility. Algorithms like WordPiece (which merges pairs based on data likelihood rather than frequency) and the Unigram model (which starts large and prunes down) offer different philosophies, but all strive for this same balance.
We now have our text broken down into a sequence of manageable tokens. But our ingredients aren't fully prepped for the master chef yet. These tokens are just IDs. Our model needs to know two more things about each one:
What does it mean? (It’s rich, multidimensional vector).
Where is it? (Its position in the sequence. Is it the first token? The seventh?)
In our next issue, we will finally leave the sequential processing world of RNNs and LSTMs behind and step into the architecture that defines modern AI. We'll see how the Transformer model simultaneously solves the meaning and position problems, using the tokenized sequence we just created as its raw material. We'll explore embedding layers and a clever trick called positional encoding to prepare this data for the star of the show: attention.
Get ready to pay attention! Because “Attention is all you need!”
See you next week!





