LLM4N9: Let's train our Transformer
playing the training game, predicting next token, cross-entropy loss and more
Hey everyone!
Welcome back to another issue of LLM4N: LLM For Noobs!
A 16-part series designed to take you from zero to one in understanding Large Language Models.
In our journey so far, we have been architects.
We laid the statistical foundations with N-grams, built sequential memory with LSTMs, and, in recent installments, we assembled a modern marvel: the Transformer. We constructed its core engine, the Self-Attention mechanism, scaled it into a Multi-Head “panel of experts,” and stabilized the entire towering structure with Residual Connections, Layer Normalization, and Feed-Forward Networks.
What we have now is a magnificent machine. But the machine is silent.
Its billions of parameters are just collections of small, random numbers. It is a beautifully constructed piano that has never been played. It has the complete capacity for intelligence but no intelligence itself
So, how do we teach it? How do we feed the entire internet into this structure and end up with something that can write a poem or solve an equation?
The answer is simple.
We don’t teach it in the traditional sense. We make it play a game. A simple, relentless, and powerful game called “Predict the Next Word.”
In this issue, we will be breaking down this game. And how it leads to “intelligent” behaviour in these LLMs.
Let’s begin!
A child’s game
I hate to break it to you, but there is no AGI (yet). Your favouritet LLM, doesn’t actually understand what you mean. It has a very simple objective: Next-Token Prediction. Given a sequence of text, the model’s one and only job is to predict the most probable token (a word or sub-word) that comes next.
And somehow this next-token prediction makes us believe that these models are intelligent.
This training style is also called self-supervised learning.
To understand why this is so powerful, let’s contrast it with traditional supervised learning. To teach a model to identify cats, you need a human to label thousands of images as “cat” or “not cat.” This dataset is expensive and slow to create.
Here, the training data itself provides the answers.
Think of a child learning language. They aren’t given grammar textbooks. They are immersed in a world of conversation and stories. With every sentence they hear, their brain is implicitly predicting what word will come next.
They hear, “The cat sat on the...” and their little brain might guess “mat!” or “floor!”
If the next word is “mat,” their internal model is reinforced. “Great! My prediction was right.”
If the next word is “rug,” their model is updated. “Ah, noted. In this context, it’s ‘rug.’”
This constant cycle of prediction and correction is the engine of learning. The child isn’t trying to “learn grammar” or “absorb facts.” They are just trying to be less surprised by the world. And in doing so, they organically learn the detailed rules of grammar, the meanings of words, and even facts about reality.
Our LLM learns in precisely the exact same way, just on a vastly larger and faster scale. It reads trillions of words, and for every single one, it makes a prediction, checks the answer, and adjusts.
The true revolution of modern AI is just this. A radically simple goal, combined with an unprecedented scale of data and computation.
The rules of the game
Well...as math nerds, we ofcourse cannot leave it upto language mumbo-jumbo. We need mathematical equations that make sense. To turn this game into something a computer can execute, let’s formalize it mathematically.
Goal is to maximize probablity (that’s all!)
The model’s job is to learn a probability distribution. Let’s take a sequence of tokens, x1, x2,x3, ..., xT. The joint probability of this whole sequence can be broken down using the chain rule of probability:
In plain English, this just means the probability of an entire sentence is the product of the probabilities of each word, given all the words that came before it. Our model, defined by its parameters θ, must learn a distribution that gets as close as possible to this true probability.
The scorecard of the game
Alright, we know the “what” of the game.
But how do we measure “wrongness” mathematically?
We use a loss function called Categorical Cross-Entropy.
It measures the “surprise” the model feels when it sees the correct answer.
For a single prediction, the loss is brutally simple:
Let’s break down the magic of this -log(p) function:
If the model is confident and correct (it assigns
P=0.99to the right word), it’s barely surprised. So,-log(0.99)is a tiny loss, about0.01If the model is confident and wrong (it assigns
P=0.01to the right word), it’s utterly shocked.-log(0.01)is a huge loss, about4.6
This is the key.
The logarithmic nature of the loss heavily punishes confident mistakes. This strong signal is what forces the model to learn quickly and effectively. It’s not enough to be vaguely unsure; the model is driven to become confidently correct.
The choice of loss here is deeply connected to the statistical principle of Maximum Likelihood Estimation. By minimizing this cross-entropy loss across our entire dataset, we are, in fact, finding the most statistically plausible model of human language.
The engineering is grounded in century-old statistics.
The Core Loop
So, we have our goal and our scorekeeper. Now, let’s look at the gameplay loop itself. This cycle is the heartbeat of deep learning, repeated billions of times until intelligence emerges.
Let’s walkthrough the five stages of this.
Stage 1: The Forward Pass
We take a batch of text sequences and feed them into our Transformer. The data flows through every layer we built. The embedding layer, the deep stack of Multi-Head Attention and Feed-Forward blocks. After this journey, the final layer produces a vector of raw scores, called logits, one for every token in the vocabulary. The model has processed the input and is now ready to place its bet.
Stage 2: The Probability Conversion (Softmax)
he logits are just raw numbers. To turn them into a proper probability distribution, we use the softmax function. It transforms the scores so that they are all positive and sum to 1. Now, for each possible next token, we have a clear probability. The model has officially made its prediction.
Stage 3: Loss Calculation
We now compare the model’s prediction against reality. The actual next word from the training data. Using our cross-entropy scorekeeper, we calculate the loss.
How surprised was the model?
This single number represents the total error for the batch.
Stage 4: Backward Pass
This is where the real magic happens.
The loss value tells us how wrong the model was, but not why.
Backpropagation is the algorithm that answers the “why.” Using calculus, it computes the gradient for every single one of the model’s trillions of parameters. A gradient is a measure of how much each parameter contributed to the final error.
This is where our architectural choices from LLM4N8 pay off!
In a very deep network, the error signal can fade away as it travels backward, a problem known as vanishing gradients. But our Residual Connections act as gradient superhighways, ensuring the learning signal flows powerfully from the final layer all the way back to the first. This is what makes training these deep behemoths possible.
Stage 5: Parameter Updates
Now that we know “how much” wrong we were. We use a sophisticated optimizer like AdamW (blog on it coming soon btw!) to give each parameter a tiny nudge in the opposite direction of the gradient.
It’s a small step “downhill” on the loss landscape. Every weight matrix in every attention head and FFN is adjusted just a little. The model becomes a little bit better at the game.
And then the loop repeats. A new batch of text. A new forward pass. A new bet. A new calculation of error. A new backward pass. A new nudge.
This beautiful, brutal simplicity, played out over trillions of tokens across thousands of GPUs for weeks on end, is what forges intelligence from raw statistics.
Emergent Intelligence
This raises a good question.
How does mastering “Predict the Next Word” lead to models that can reason, code, and create?
The answer is that to become a truly excellent predictor, the model is forced to learn much more than just surface-level statistics. It must implicitly build an internal model of the world to understand the content it’s predicting. The complex abilities we observe are not explicitly programmed; they are emergent abilities. Sub-skills the model discovers as necessary to minimize its prediction error.
Think about it. To get really good at the game, the model must learn:
Grammar: To consistently predict
“jumps”after“The quick brown fox...”it must internalize syntax.Factual Knowledge: To predict
“Paris”after“The capital of France is...”, it must absorb and store facts about the world.Reasoning: To reliably predict
“5”after“2 + 3 =”, it must learn the abstract rules of arithmetic.
This process is a form of massive, implicit compression. The training data is far too large to be memorized. To efficiently store the information, the model must learn the underlying generative rules that produce the data.
Rules about grammar, logic, and the causal structure of the world.
This leads to fascinating emergent abilities that appear suddenly at scale, such as:
In-Context Learning: The ability to perform a new task (like translation) from just a few examples in the prompt, which smaller models can’t do.
Chain-of-Thought Reasoning: The ability to solve complex logic problems by “thinking step-by-step,” a technique that only works on very large models.
This complexity is not programmed. It emerges as the most efficient solution for minimizing the prediction error on a massive, diverse dataset.
A model that’s really good at playing the game
The whole process we have broken down is called Pre-training. The result is a base model: a powerful, knowledgeable machine that’s really good at playing the nextt-token prediction game.
But it is a raw and untamed player. Its sole drive is to produce the most statistically probable sequence. But that’s not really all that helpful.
Why?
If you ask it a question, it might just continue the question. If you give it a harmful prompt, it might complete it with harmful text, simply because that’s what it saw on the internet. It has no inherent concept of following instructions, being truthful, or adhering to safety guidelines. It is a powerful engine without a steering wheel.
This brings us to our next great challenge.
We have created an incredibly knowledgeable but unpredictable entity. How do we now align this raw power with human values and intent? How do we teach it to follow instructions?
That is the goal of the next stage: Supervised Fine-Tuning (SFT), where we give our base model a curated curriculum to transform it from a brilliant predictor into a capable assistant. The architecture is built. The mind has been awakened. Now, we must teach it how to behave.
But that...is a story for LLM4N10.
See ya next week 👋👋







Wow, the anlogy of the silent piano with bilions of parameters really hit different for me. If intelligence emerges from just prediction, where do we even begin to draw the line for true understanding?