LLM4N7: Upgrading to Multi-Head Attention
One attention head isn't enough, combining multiple heads, and making a panel of experts
Hey everyone, and welcome back to LLM4N: LLM For Noobs!
A 16-part series designed to take you from zero to one in understanding Large Language Models.
We made a lot of progress in our last issue. We built the engine at the heart of the modern AI revolution: the self-attention mechanism. With a single, elegant equation, we created a system that lets any word in a sentence directly look at and learn from any other word, finally solving the long-range dependency problem that troubled RNNs and LSTMs.
But, as we hinted at the end of LLM4N6, our powerful new engine has a subtle but critical limitation. By using only one set of Query, Key, and Value matrices, it’s forced to learn a single, “averaged” attention pattern for every task. It has to blend grammatical relationships, semantic similarities, and pronoun references all into one compromised representation. It’s a jack-of-all-trades, but a master of none.
This leads us to the core question for today.
How can we allow our model to specialize?
How can we let it discover and focus on different types of linguistic relationships at the same time?
Let’s solve this puzzle together.
Panel of Experts
The solution to this “averaging” problem is wonderfully intuitive. If one generalist isn’t enough, why not assemble a panel of specialists?. This is the core idea behind Multi-Head Attention (MHA).
Instead of one attention mechanism trying to do everything, we’ll run multiple self-attention processes in parallel, each with its own unique perspective. Imagine we’re analyzing the sentence:
“The animal didn’t cross the street because it was too tired”.
Our panel of experts would get to work:
The Grammarian Head: This specialist focuses only on syntax. It immediately identifies that “tired” is an adjective describing the subject “it” and is linked by the verb “was”. It cares about the sentence’s structure.
The Reference Resolver Head: This expert’s only job is to link pronouns to what they refer to (their antecedents). It scans the sentence and draws a strong attention line directly from “it” back to “animal,” solving the ambiguity.
The Semantic Head: This head looks for related concepts. It might connect “tired” to “animal” as a state or focus on the local relationship between “too” and “tired”.
And here’s the most beautiful part of this whole setup is that we don’t assign these roles. We don’t tell one head, “You will be the grammarian,” and another, “You will resolve pronouns.”
This specialization is not hard-coded by us. It emerges naturally from the math. By initializing multiple distinct sets of W_Q, W_K, W_V matrices with random values, we give each head a unique starting point. A different perspective in the high-dimensional space.
The training process, through gradient descent, then sculpts each of these perspectives. It finds that the global loss is minimized most effectively not if all heads do the same thing, but if they diversify their labor. One set of matrices evolves to be exquisitely sensitive to syntactic patterns, while another evolves to be a pronoun-resolution expert. The model self-organizes into a team of specialists because that is the optimal configuration for the task.
The mechanism: From one head to many
So, how do we actually build this panel of experts?
We do it by extending the self-attention architecture we already know in a few logical steps.
Step 1: Creating multiple perspectives
In single-head attention, we had one set of learned weight matrices to transform our input. To create h different heads (e.g., 8 in the original Transformer), we simply create h independent sets of these matrices.
Each set of matrices learns to project our input embeddings into different, specialized representational subspace. Think of it like this: one set of matrices learns to project the sentence into a “view” where grammatical relationships are most obvious. Another set learns a “view” where pronoun references are clearest. These initial projections are the key, they give each head its unique world-view, enabling them to specialize.
Step 2: Parallel Self-Attention
Once each head has its own specialized Query, Key, and Value matrices, they all get to work. Each of the h heads performs the exact same self-attention calculation we learned in LLM4N6, completely independently of the others and all at the same time.
This is a brilliant design because it’s perfect for modern GPUs, which excel at running many parallel computations at once. The result isn’t one context matrix (Z), but h different context matrices. Each matrix represents the “opinion” or “analysis report” from one of our specialists.
Visually, you can imagine our input matrix flowing into the system and splitting into h parallel streams. Each stream goes through its own self-attention block, and at the end, h distinct output streams emerge, each carrying a different perspective on the sentence’s meaning.
Combining the Expert’s findings
We’ve created our panel and they’ve all submitted their reports. This leads to a new, practical problem: we now have h different output matrices, but the next layer of our model expects a single, unified input.
How do we combine all these expert opinions into one coherent understanding?
The solution is a beautiful two-step process: Concatenation & Linear Projection.
First, we take all the output matrices from ours heads (Z_1 through Z_h) and simply place them side-by-side into one big matrix.
Why?
This preserves the unique information from each head. No information has been mixed or lost yet.
Second, we take this large concatenated matrix and multiply it by one more learned weight matrix, which we’ll call W_O.
Why?
The final linear projection does two critical jobs at once. First, it’s a dimensionality reduction, compressing the wide, concatenated matrix back down to the model’s standard working dimension (d_model). But more importantly, it’s an intelligent synthesis. The W_O matrix doesn’t just cut things down; it learns the optimal way to fuse the disparate perspectives. It figures out how to take a crucial pronoun link from head 2, a key grammatical signal from head 5, and a semantic connection from head 7, and merge them into a single, coherent vector that captures all of these relationships simultaneously. It’s the model learning how to build a consensus
.
Why this all matters
Let’s return to our sentence:
“The animal didn’t cross the street because it was too tired.”
A single-head model would struggle, its attention diluted between different tasks. But our multi-head model excels.
One head, our Reference Resolver, can dedicate its entire capacity to creating a strong attention score between “it” and “animal”.
Another head, maybe a Semantic one, can focus on the connection between the state of being “tired” and the “animal” itself.
The final output for the word “it” is no longer a single, compromised vector. It’s a rich, synthesized representation created by the final projection layer (W_O), which has learned to blend the definitive reference link from one head with the semantic state link from another. The model’s understanding becomes far more robust and nuanced.
The key takeaway is this: Multi-Head Attention is not just about having “more attention.” It’s about having diversified attention. This ability to look for many different kinds of patterns in parallel is a cornerstone of what allows Transformers to model the incredible complexity of human language.
What’s next? Assembling the Full Transformer Block
We’ve done it.
We’ve upgraded our self-attention engine to a powerful, multi-perspective mechanism that can generate a deeply contextualized representation of a sentence.
So, is this the whole story? Almost, but not quite. We’ve built the core miracle, the multi-head attention mechanism that lets words talk to each other. But if we just stacked these layers on top of each other, the model would be unstable and lack a certain kind of computational depth. We have a powerful way to communicate, but we’re missing the systems that make this communication stable and productive over many layers.
This forces us to tackle two key architectural questions:
Stability: How do we stack these attention layers without the signal degrading into noise?
Computation: After gathering context from others, how does each word individually “process” that new understanding?
That’s our next step.
In the next issue, we will construct the complete Transformer Block. We’ll add the architectural “scaffolding”: Residual Connections & Layer Normalization. And then we’ll give our model a chance to ‘think’ on its own with the Position-wise Feed-Forward Network.
The grand finale is next. See you in the next issue.





