LLM4N1: Why did N-gram models fail?
A brief primer on N-gram models, their applications, and limitations
Welcome to another edition of DataPeCharcha where we breakdown topics related to Data Science/ML/Stats/Probability and so much more.
We’re starting with a new mini-series called LLM4N: LLM For Noobs. A 16-part series designed to take you from zero to one in understanding LLMs.
Each LLM4N issue is going to be suffixed by the issue number so it’s easier for all of us to track. Welcome to LLM4N1!
Today, we’re stripping back the layers of complexity surrounding LLMs and going back to a foundational concept that started it all: the simple, yet profound, act of counting.
Have you ever wondered how your phone’s keyboard seems to read your mind, suggesting the perfect next word? That little magic has its roots in a statistical method called an N-gram model.
Long before the era of massive neural networks, N-grams were the workhorses of NLP. And understanding them is key to grasping the core problem that even today’s most advanced LLMs are trying to solve.
We’'ll start with intuition, do some math, then understand why this technique hit a wall, and ended paving the way for the AI revolution.
You ready? Let’s begin!
Language as LEGOs
At its core, an N-gram is just a sequence of ‘N’ consecutive words from a text. Think of language as an intricate LEGO castle.
A unigram (where N=1) is like looking at each LEGO brick individually. You can count them, but you don’t know how they connect
A bigram (N=2) is like looking at every pair of connected bricks.
A trigram (N=3) is a sequence of three connected bricks.
By counting these small, connected chunks, we start to see the “blueprints” of the language. We learn that “San” is very often followed by “Francisco”, and “the quick brown” is often followed by “fox”. This simple act of counting allows a machine to learn the statistical patterns of a language, turning the abstract chaos of human expressions into a structured, quantifiable problem.
Let’s break down this sentence: “The cat sat on the mat.”
- Unigrams (N=1): "The", "cat", "sat", "on", "the", "mat".
- Bigrams (N=2): "The cat", "cat sat", "sat on", "on the", "the mat".
- Trigrams (N=3): "The cat sat", "cat sat on", "sat on the", "on the mat".
This process is the first step in teaching a computer to “understand” language. Not by comprehending meaning, but by recognizing and counting common patterns.
Let’s do some math
The ultimate goal of a language model is to assign a probability to a sequence of words. In other words, is the sentence “the cat sat on the mat” more likely than “mat the on sat cat the”?
While it may seem obvious to us, machines find it hard to understand.
Mathematically, the probability of a sentence is the probability of the first word, times the probability of the second word given the first, times the probability of the third word given the first two, and so on. This is known as the Chain Rule of Probability.
In mathy terms:
or it is more commonly written as:
But this formula is practically impossible to use. The history of words gets too long, and we’d never have enough data to calculate the probability of a word given a unique, ten-word-long context.
This is where we have a clever shortcut called the Markov Assumption.
We simply assume that the probability of a word only depends on the previous N-1 words.
For a bigram model (N=2), we assume a word only depends on the one word before it. The math suddenly becomes much simpler and beautifully intuitive:
This is much easier to read and understand.
In plain English: The probability of seeing a word is just the number of times we saw the two words together, divided by the number of times we saw the previous word.
Small example
Imagine our entire “corpus” (the text we train on) is just these three sentences:
I am Sam
Sam I am
I do not like green eggs and ham
What’s the probability of the bigram “I am”?
Count how many times “I am” appears: 2
Count how many times “I” appears: 3
Divide them: P(am|I) = 2/3 ≈ 0.67
It’s that simple really. Our model has learned that, based on this tiny world, there’s a 67% chance the word “am” will follow the word “I”.
That’s it! That’s the core of how N-gram models work. By counting and dividing.
So why did it fail?
While powerful, N-gram models have some fundamental flaws that ultimately led to the development of modern neural networks. These are conceptual limits that reveal why simply counting words isn’t enough to understand language.
1. Sparsity problem
Language is creative. For any text that you use for training, there will be a vast number of perfectly valid word combinations that simply don’t appear. This is called data sparsity. The direct consequence of this is the zero-frequency problem.
According to our formula, if an N-gram’s count is zero, it’s probability is also zero.
This is a catastrophic failure.
The probability of a sentence is the product of its N-gram probabilities. If even one of those N-grams has a probability of zero, the probability of the entire sentence collapses to zero. The model incorrectly deems a perfectly valid sentence "impossible" just because it contains a new combination. This makes the model brittle and unreliable for any real-world use.
This is not a robust system.
2. Curse of dimensionality
We do also have the problem of context, right? So why don’t we just increase N to capture more context?
Well…that’s a brilliant suggestion to increase the context. But the problem is that the number of possible word combinations grows exponentially. This is what data scientists call the curse of dimensionality.
Consider a modest vocabulary of 10,000 words:
Bigrams (N=2): 100 million possible combinations
Trigrams (N=3) = 1 trillion possible combinations
As N increases, the “feature space” of possible N-gram expands so rapidly that even a massive training corpus becomes infinitesimally small, sparse fraction of that space.
This leads to a fundamental trade-off: the desire for more context (higher N) is in direct conflict with the need for reliable statistics (which requires dense data, only possible with lower N).
3. Semantic blindness
This is the biggest conceptual hurdle.
To an N-gram model, words are just discrete, unrelated symbols. It has no concept of semantics or similarity. If it learns "the king sat on the throne," that knowledge gives it zero information to help it understand "the queen sat on the throne." From the model's perspective, "king" and "queen" are as different as "king" and "chainsaw."
This also leads to context blindness. The model can't distinguish between the "bank" in "she went to the river bank" and the "bank" in "she went to the financial bank." Its understanding is trapped within the local N-gram, preventing it from using the wider sentence to resolve ambiguity.
4. Fixed context window
The Markov assumption is the biggest weakness. The model is blind to anything that happened before its fixed N-1 word window, so it cannot capture long-range dependencies.
Consider this sentence:
“The woman who wrote several critically acclaimed books over the last decade is coming to speak.”
To correctly predict the verb "is," a model needs to know the subject is the singular noun "woman." But a trigram model trying to predict "is" only has access to the context "last decade." It has completely forgotten the word "woman," which appeared much earlier. The crucial grammatical link is broken, making it impossible for the model to enforce proper subject-verb agreement.
They aren’t all limited
As much as we discussed it’s limitations, they aren’t just part of history. Despite their limitations, they are still incredibly useful and widely deployed in the real world for tasks where their strengths shine.
Feature engineering
N-grams are excellent “features” for classical machine learning models like SVMs or Logistic regression. In tasks like sentiment analysis, the bigram “not good” is a much stronger negative signal than the unigrams “not” and “good” separately. Libraries like scikit-learn, with its CountVectorizer, make it simple to build a “bag-of-n-grams” model that can dramtically improve classification accuracy for spam detection, topic modelling and more.
Powering search and Information retrieval
Search engines use N-grams to better understand multi-word queries and improve the relevance of results. The Google Books Ngram Viewer is an amazing public tool that lets you chart the frequency of phrases over centuries of printed text. The same core idea is used for plagiarism detection, where systems compare the N-gram overlap between documents to find copied content.
Modern evaluation metrics
The core idea of N-gram overlap is still used to evaluate today's most sophisticated generative models. Metrics like BLEU (for machine translation) and ROUGE (for text summarization) work by comparing the N-grams in a machine-generated text to those in a high-quality human reference.
That’s why we need more “context”
The limitations of N-grams are precisely what motivated the shift to neural networks and eventually, the Transformer architecture. The evolution can be seen as a quest to create a better representation of context.
Furthermore, neural models solved the "no sense of meaning" problem with word embeddings. Instead of treating words as discrete symbols, they represent them as vectors in a multi-dimensional space, where words with similar meanings (like "king" and "queen") are located close to each other. This allows the model to generalize and understand relationships it has never explicitly seen.
Understanding N-grams is the first step to understanding the “why” behind more sophisticated architectures and neural networks. Makes you appreciate a bit more about the first principles of language modelling.
In the further issues, we will be breaking down how we “kinda” solved context with RNNs and LSTMs. And more importantly the “why” behind everything.
Resources
The map of all that we are going to cover: LLM4N roadmap
The Most Important Machine Learning Equations: Link



