LLM4N8: The Complete Transformer Block
Combining MHA, LayerNorm, and a Feed-Forward Network to get the engine running.
Hey my few attentive readers, welcome back to LLM4N!
A 16-part series designed to take you from zero to one in understanding Large Language Models.
In our last issue, we built the Multi-Head Attention mechanism. A key engine that lets words in a sentence communicate directly. We gave our model a “panel of experts” by projecting our input sequence into multiple specialized subspaces using learned weight matrices:
Each attention head then performed independent self-attention calculations, producing distinct context matrices. These matrices were finally synthesized into a unified representation through a learned projection.
But this brilliant communication system creates two fundamental challenges when we try to build deep networks by stacking layers:
The Stability Problem: How do we prevent signal degradation and vanishing gradients across dozens of layers?
The Processing Problem: Attention tells tokens what information matters, but how does each token non-linearly transform its newly acquired context into deeper understanding?
Today, we solve both challenges by constructing the complete Transformer Block. The fundamental building block that makes modern LLMs possible.
In this issue, we’ll assemble three crucial puzzle pieces that transform our attention mechanism into a stable, powerful computational unit:
First, we’ll tackle stability with two complementary techniques:
Residual Connections that create “information superhighways” ensuring gradients flow unimpeded through deep networks
Layer Normalization that stabilizes the statistical distribution of activations at each layer
Then, we’ll add computational depth with the:
Position-wise Feed-Forward Network that gives each token its own powerful non-linear transformation capability
Finally, we’ll assemble these components into the complete Transformer Block and explore how stacking these blocks enables hierarchical understanding of language. From local syntax in early layers to global semantics in deeper ones.
The architecture we’re building today isn’t just theoretical; it’s the exact same mathematical structure used in GPT, BERT, and other foundation models.
Let’s dive in and complete our understanding of the Transformer’s core machinery.
Building Stability and Why?
To build a bullet-proof engine, we need more than just the parts. We need a strong frame that connects everything and ensures the machine is stable.
Stacking neural networks is no different. If we just pile our MHA layers on top of each other, we run headfirst into the same old villain we met back in the age of RNNs: the vanishing (and exploding) gradient problem.
The learning signal (the gradient) has to travel backward through all the layers during training. In a very deep network, this signal can either shrink to nothing (vanish) or grow uncontrollably (explode), which makes the gradient updates to be zero. The Transformer’s architects solved this with two elegant pieces of engineering that work in perfect harmony.
Residual Connections: The Information Highway
This is an extremely simple mathematical concept but has deep consequences. For any sub-layer Sublayer(x), we define:
During backprop, when we compute the gradient of the loss L respect to x, we get:
The I term is the key. It creates a direct, uninterrupted channel for the gradient. Even if the gradient through the sub-layer (∂Sublayer(x)/∂x) becomes very small, this I ensures that the gradient from the output can flow directly back to the input, preventing the signal from vanishing entirely.
Counterintuitively, making networks deeper can sometimes make performance worse, even just on training data. This is an optimization failure. The network struggles to learn identity mappins where they would be optimal.
Residual connections make identity mappings trivial to learn. If F(x) = 0, then x + F(x) = x. The network can easily learn to “do nothing” when that’s the best option, ensuring that adding more layers never hurts performance.
Now that we know, what Residual connections do. Let’s move to LayerNorm.
The Moving Target Problem & LayerNorm
As network weights update during training, the scale and distribution of inputs to each layer shifts constantly. This phenomenon is called Internal Covariate Shift. Each layer is trying to hit a moving target, slowing down convergence and requiring careful parameter initialization.
Layer Normalization is a technique that stabilizes these distributions by enforcing consistent statistics. The operation:
ensures that regardless of how the previous layer’s distribution shifts, the input to the next layer has consistent mean and variance. The learnable parameters γ and β then restore the representational power, allowing the network to learn whatever scaling and shifting is actually useful.
This is a crucial detail. Forcing every layer’s input to a strict mean of 0 and variance of 1 might be too restrictive. By introducing these learnable parameters, we give the network the flexibility to scale and shift the normalized distribution to whatever is optimal. It can effectively learn to “undo” the normalization if that’s what’s best for minimizing the overall loss.
Geometric benefits for Attention
LayerNorm provides an additional, subtle benefit for attention mechanisms. By normalizing each token’s vector to have approximately the same norm, it prevents attention scores from being dominated by a few high-magnitude vectors. This ensures that the softmax distribution reflects genuine semantic relevance rather than vector magnitude artifacts.
Residual connections and LayerNorm work together perfectly. Residual connections ensure gradient flow, while LayerNorm ensures the data flowing through those connections remains well-conditioned. Without this combination, building Transformers with dozens or hundreds of layers would be impossible.
These are fundamental requirements for making deep networks work.
Positioning the LayerNorm
The placement of the Layer Normalization relative to the residual connection has a significant impact on training stability. There are mainly two kinds:
Post-LN (Original architecture): The original “Attention is All You Need” paper implemented what is now known as the Post-LN architecture. Here, the normalization is applied after the residual addition:
Output = LayerNorm(x + Sublayer(x)). This architecture was found to be somewhat unstable to train and often requires a “learning rate warmup” phase, where the learning rate is slowly increased at the beginning of training to prevent divergence.Pre-LN (Modern Architecture): Later research demonstrated that applying normalization before the sub-layer leads to more stable training. This is the Pre-LN architecture:
Output = x + Sublayer(LayerNorm(x)). By normalizing the input to the sub-layer, the gradients are better behaved at initialization, and the optimization landscape is smoother. This often eliminates the need for a learning rate warmup, simplifying the training process and improving stability. Most modern Transformer implementations have adopted the Pre-LN variant for this reason.
Now that we know the “WHY” behind Residual Connections and LayerNorm. It’s time we move the next step.
Postion-wise Feed-Forward Network (FFN)
Now, let’s solve the second problem: processing
We’ve used Multi-Head Attention to let tokens communicate. It’s a lively group discussion where every word can listen to every other word. But after the meeting, each individual needs private time to think and process what they’ve learned. That’s the job of the Position-wise Feed-Forward Network (FFN).
Think of it this way.
Attention decides what information is important. The FFN decides what it means.
The FFN is a small but powerful neural network applied independently to every single token’s representation. Its architecture follows a clever “expand-and-compress” strategy, and the reasoning behind it is profound:
Expand: It takes your token vector (e.g., 512-dimensional) and projects it into a much larger space (e.g., 2048-dimensional).
Why?
According to the principles of function approximation, this massive expansion gives the model the necessary “room to think”. The computational headroom to construct highly complex, non-linear transformations that are simply impossible in the original, cramped space.Activate: It applies a non-linear activation function, most commonly ReLU (
max(0, x)).
Why?
This is the source of its real power. Without this non-linearity, the entire two-layer network could be collapsed into a single, simple linear transformation. ReLU breaks this linearity, allowing the model to learn the intricate, piece-wise linear functions that capture the nuances of language. It also acts like a filter, sparsely activating only the most relevant patterns.Compress: It then projects the result back down to the original dimension (e.g., 512).
Why?
This step distills the new, rich understanding into a dense, refined representation that’s the right size to be passed to the next layer. It also makes the output compatible with the residual connection, which requires the input and output to be the same size.
Why “Position-wise”?
It’s an important detail. It means the exact same FFN, with the same weights, is applied to every token in the sequence. However, each token processes its own unique, context-aware vector completely independently. This makes the computation massively parallelizable and incredibly efficient on hardware like GPUs.
A cool way to understand FFN is imagine it as the model’s knowledge base.
A more advanced interpretation is that the FFN acts as giant, distributed key-value memory.
The first layer
(xW_1):acts as a set of learned “keys”. Thousands of specialized pattern detectors.The ReLU activation determines which of these patterns are relevant to the current token.
The second layer
(W_2)then outputs the “values”. The information associated with those activated patterns.
From this perspective, the FFN is not just processing; it’s recalling information. This is why the FFN contains most of the model’s parameters. It stores a huge amount of the model’s knowledge about grammar, facts, and concepts.
Now that we have all the building blocks ready. Let’s combine them.
Putting it all together
We now have all the pieces. The Multi-Head Attention for communication, the FFN for computation, and the Add & Norm layers for stabilization. Let’s assemble them into one complete block, tracing the journey of our data. We will use the modern and more stable Pre-LayerNorm configuration.
Here’s teh step-by-step dataflow for a single Transformer block:
Input(X): A sequence of token vectors enters the block. Either from the embedding layer or the previous transformer block.
Sub-Layer 1: MHA with Stabilization
Step A (Norm): We first normalize the input using LayerNorm.
Step B (Attention): The normalized input goes through MHA, where tokens gather context from each other.
Step C (Residual): The original input (pre-normalization) is added to the attention output.
Sub-Layer 2: FFN with Stabilization
Step A (Norm): We take the input from the first sub-layer and normalize it again.
Step B (Processing): The normalized data passes through the FFN, where each token “thinks” individually.
Step C (Residual): The input to this sub-layer is added to the FFN’s output, preserving the signal.
Output: A new sequence of vectors emerges, now richer and more context-aware, ready to be fed into the next identical Transformer Block.
Stacking one upon another
A single Transformer Block is amazing, but true magic happens when we stack them. The original Transformer used 6 encoder blocks. Modern models use dozens.
With the stabilizers (residual connections and layer norm) in place, we can safely stack these blocks very deep. The model is performing iterative refinement on a single, evolving representation, getting closer to the “truth” with each step.
A complete learning machine
What we’ve built is more than just a component; it’s a complete, self-contained learning machine. The Transformer Block masterfully combines:
Communication (Multi-Head Attention)
Computation (Feed-Forward Network)
Stability (Residual Connections & LayerNorm)
This block is the fundamental atom of modern AI. Every GPT, BERT, and T5 model is, at its core, just a tall tower of these identical blocks stacked one upon the other. Each block refines the understanding of the last, building a hierarchy of meaning from simple grammar to meaningful semantics.
So…what’s next?
Well, we have built the machine. But it’s still not working, is it?
The architecture is complete, but the intelligence is absent.
So. How do we breathe life into this beautiful machine? How do we teach these matrices and non-linearities to understand the meaning of a poem, or the logic of code?
In our next issue, we will cover just this.
We will uncover The Training Game, where we’ll learn how the simple and powerful objective of “predict the next word,” a child’s game in essence, is all it takes to awaken this architecture into a mind.
The machine is built. Now, let the learning begin.
See you in the next issue. Goodbye 👋👋




