LLM4N6:How self-attention works
what k,q,v geometrically mean, WHY behind the formula, tackling tricky questions
Hey everyone, and welcome back to LLM4N: LLM For Noobs!
A 16-part series designed to take you from zero to one in understanding Large Language Models.
What a journey it’s been. In our last issue, we left off with the stage perfectly set for the Transformer. We’d figured out how to prepare our data, turning raw text into a sequence of vectors—each one infused with its core meaning (through Token Embeddings) and a precise sense of its place in the sequence (through Positional Encodings).
We had our actors on stage, dressed and in their starting positions.
But a stage full of actors isn’t a play. For that, they need to interact. They need to understand their relationships, their motivations, and how they influence one another. This is the monumental leap the Transformer makes, and it all rests on answering one extremely simple question:
When trying to understand a single word, which other words in the sentence matter most?
Let’s take an example. Take the sentence: “The animal didn’t cross the street because it was too tired.”
You and I instantly know that “it” refers to the “animal,” not the “street.” But for a model that looks at all words simultaneously, this is a monumental task. There’s no inherent order to guide it. The word “it” is just sitting there, surrounded by other words, with no prior sequence to tell it what came before.
This is the profound new problem created by the Transformer’s brilliant, parallel design. And the solution is the engine at its core: self-attention.
In this issue of LLM4N, we’re going to dissect this engine from first principles. We won’t just describe what it does; we’ll uncover the why behind every single step. You’ll learn what a Query, Key, and Value vector truly represent, why we need to scale our calculations, and how a series of elegant, justified operations allows a word to dynamically absorb the context of its peers.
Let’s begin.
From Static Vectors to Dynamic Context
Let’s pick up right where we left off in LLM4N5. Our starting point is the matrix X, a neat list of vectors we prepared last time. ach vector is the sum of a token’s core identity (its embedding) and its precise location in the sequence (its positional encoding).
These vectors are informative, but they’re also isolated.
Think of them as a group of experts who have just walked into a conference room. Each one has a name tag and a fixed seat, but they haven’t started talking yet. The expert named “it” has the same resume, whether the meeting is about a tired animal or a complex legal clause. It sits there, completely unaware of the other experts in the room and how they might define its role.
Our mission now is to break this isolation. We need to transform these static, self-contained vectors into dynamic, context-aware representations. We want to let every vector in the sequence listen to all the others, figure out who is most relevant to it, and then update itself accordingly.
Self-attention is the algorithm that facilitates this conversation.
It’s the function that takes in our matrix of static vectors and outputs a new matrix of the same size, where each vector has been dynamically refined. For every single word, self-attention runs a simple but powerful calculation: “Based on who I am and where I’m sitting, which of you should I be listening to most closely, and how does your expertise change my own understanding?”
The result is that the static vector for “it” is transformed. It’s no longer just “it”; it becomes “it, as informed by ‘animal’ and ‘tired’.” It has absorbed its context. This is the fundamental shift from a bag-of-words to a network-of-ideas, and it all happens through a sequence of justified, mathematical steps.
I will be breaking this down into 2 phases:
Understanding the “WHY”: giving you the geometrical and the intuition behind things
Understanding the “HOW”: the math and the explanation of it
The foundational “WHY”: Query, Key, & Value
So, how does this conversation actually happen? How does the word “it” figure out who to listen to? The answer lies in a beautifully designed mechanism that, at its core, solves a fundamental problem of representation.
Our input is a list of static vectors. The vector for the word “it” is fixed; it contains the general meaning of a pronoun and its position. But this single vector is being asked to do three conflicting jobs at once:
Express its own identity (“I am a pronoun”)
Articulate what it needs (“I need to what I represent.”)
Provide its own information (“Here is my semantic content for others to use.”)
Forcing one vector to handle all of this is like giving a single person in a startup three full-time roles: the machine learning guy, the devops guy, the data analyst (yes this is a jab at today’s job market :P).
It’s inefficient and leads to a confused outcome.
The Transformer’s elegant solution, introduced in “Attention Is All You Need,” is to split this single responsibility. It projects the one input vector into three separate, specialized representations. This is a fundamental design choice that gives the model immense flexibility. These are the Query, Key, and Value vectors.
Let’s break down each one, not just by what they do, but by why they must exist and what they truly represent.
The Query Vector (Q): The Seeker
What it is?
A projection of a token’s input vector that represents what it is actively looking for in other tokens.
Why we need it?
The word “it” needs to find its reference; the verb “chases” needs to find a subject. The Query vector is this directed question. It’s the part of the word that says, “Given who I am, what am I missing?”
The Geometry of a Question
Imagine our high-dimensional space of word meanings.
Our initial vector for “chased” is just a point representing that action. This transformation isn’t random; it’s performed by a learned weight matrix W_Q. The model learns to multiply our input vector by W_Q (Q = X * W_Q) to produce the Query. Geometrically, this matrix rotates and stretches the vector. Why? Because the model discovers that certain directions in this space correspond to certain types of questions. The W_Q matrix learns to rotate the vector for a verb like “chased” so that its resulting Query vector points strongly in the “subject-seeking” direction.
It turns a static identity (”I am ‘chased’”) into an active probe (”Who is doing the chasing?”).
The Key Vector (K): The Signpost
What it is?
A projection of a token’s input vector that represents what it is about. It’s a signpost advertising its relevance.
Why we need it?
It acts as a matchmaking profile for other tokens’ Queries. While the Query asks, the Key answers. It’s designed to be found. If the Query is “Who is doing the chasing?”, the Key for “cat” should be a clear “I am a thing that can chase.”
The Geometry of an Answer
Similarly, a second learned matrix, W_K, performs a different transformation
(K = X * W_K).
Its job is to take an input vector and re-position it to be easily “discovered” by the right Queries. If Queries from verbs point in the “subject-seeking” direction, the model will learn a W_K that rotates nouns like “cat” to also point in that general direction. When a Query and a Key point in similar directions, their dot product is high.
The Key for “cat” effectively raises a signpost that is perfectly visible to the “subject-seeking” Query from “chased.”
The Value Vector (V): The Substance
What it is?
A projection that contains the actual information the token will contribute if selected.
Why it’s different from the Key?
This is the most crucial distinction. The Key is for matching; the Value is for meaning. The Key is a concise headline; the Value is the full article.
A word might be relevant for a simple reason (its Key matches), but the information it contributes should be its full, rich semantic content (its Value).
The Geometry of Substance
The third matrix, W_V, has a different goal (V = X * W_V). It’s not optimizing for dot products.
It’s learning to filter the initial input vector, preserving and refining the most useful semantic essence. It might strip away noise or positional data to create a purer representation of the word’s core meaning.
The Value vector answers: “If another word pays attention to me, what is the most valuable information I should give it?”
These are learnable
The most important part is these 3 specialized vectors: Query, Key, Value, are not pre-defined. They are generated for our entire input sequence by multiplying our input matrix X by three separate, learned weight matrices:
This means the model doesn’t just get three fixed roles; it learns the optimal way to create these roles to best solve its language task. During training, W_Q, W_K, and W_V are updated to become expert projection-makers, learning what “questions” are most useful to ask, what “signposts” are most effective for matching, and what “substance” is most valuable to share. This learnable projection is the foundation of the mechanism’s adaptability and power.
Let’s take an example
To truly understand this concept, let’s do an example.
Consider the sentence: “The cat chased the mouse”
For the word “chased”:
It’s Query vector, q_chased, is geometrically oriented to ask: “Find me the agent and the object of this action.”
Its Key and Value are produced, ready to be used by other words.
For the word “cat”:
Its Key vector,
k_cat, is positioned to be highly discovereable byq_chased’s“agent-seeking” component. Their dot product will be high.Its Value vector,
v_cat,contained the distilled essence of “cat”. Its role as a predator, etc. This is unrelated to the matching process; it’s the payload.
The Result: The high score between
q_chasedandk_cattells the mechanism to pay a lot of attention to “cat.” But the information it actually pulls is not the Key (k_cat), but the Value (v_cat). The final, updated representation for “chased” becomes a blend that includes the substance of “cat” as its agent. It’s no longer just “chased”; it’s “chased-by-a-cat.”
The Self-Attention Mechanism: The “How”
We now have our three specialized representations: the Queries (Q), Keys (K), and Values (V). The entire self-attention mechanism, which transforms our static input into a dynamic, context-aware output, is captured in one elegant, and dense, equation:
Take a moment to admire this equation. THIS is the core of what is running ALL major LLMs today. Let’s unpack this step-by-step, logically from the left to right.
Step 1: The Score Matrix (Why Multiplication?)
The first operation is:
(Q multiplied by the transpose of K)
The Action
For each token, we take its Query vector and compute the dot product with the Key vector of every other token (and itself). The result is a raw score matrix.
The “Why” → Why dot product?
The dot product is a fundamental measure of vector similarity. A high positive score means the Query’s “question” and the Key’s “answer” are well-aligned. When the Query from “it” (seeking a singular subject) meets the Key from “animal” (advertising one), the score is high. When it meets the Key from “street,” the score is lower. This matrix directly answers the core question for every possible pair: “How relevant is token J to token I?”
Step 2: Scaling - Ensuring stable gradients
We then proceed to scale the product obtained.
The Action
We take the score matrix from Step 1 and divide every element by √dₖ, the square root of the dimension of the Key vectors.
The “Why” → Preventing softmax saturation
This step is critical for stable training. As the dimensionality dₖ grows, the dot products can become extremely large in magnitude. The softmax function that follows is sensitive to these large values. Very large inputs can “saturate” the softmax, pushing its output toward a one-hot vector (where one value is nearly 1 and the rest are nearly 0). In this state, gradients become vanishingly small, and learning grinds to a halt.
Scaling counteracts this. By dividing by √dₖ, we keep the scores in a range where the softmax function remains sensitive, producing a softer, more nuanced attention distribution and ensuring healthy gradients flow back during training.
Step 3: Softmax - Creating the attention map
We now proceed to apply the softmax function.
The Action
We apply the softmax function to each row of our scaled score matrix: softmax( (QKᵀ) / √dₖ ). The result is the attention weight matrix, often denoted as A.
The “Why” → Clear and Competitive “Attention Budget”
This step achieves two vital things:
Interpretability: It converts the arbitrary scaled scores into a clean probability distribution. Each value is between 0 and 1, and each row sums to 1. We now have a definitive “attention budget” for each token.
Competitiveness: The exponential nature of softmax is competitive. It dramatically amplifies the highest scores and suppresses the lower ones. This forces the model to be decisive, concentrating its focus on the most critical relationships. It transforms the scores into a definitive answer: “For token I, what percentage of its focus should be on token J?”
Step 4: The Weighted Sum (AV)
We finally multiply it with the Value vector.
The Action
Finally, we multiply the attention weight matrix A by our Value matrix V: Z = AV. For each token, this is a weighted sum of all the Value vectors in the sequence.
The “Why” → The payoff
This is where information synthesis happens. The output for a token is no longer its original static vector. It is now a context-rich combination. The output vector for “it” is not just “it”; it’s the vector for “it” that is, for example, composed of 85% of the substance of “animal,” 5% of the substance of “tired,” and so on.
We are blending the Value vectors, the actual semantic payload, according to the relevance scores we so carefully calculated. This final output Z is our goal: a sequence of vectors where every single one has been updated based on its relationship with all the others.
And with that, all the pieces click into place, completing our understanding of the full, elegant equation that powers modern LLMs:
The payoff and what’s next?
Let’s pause and appreciate what we have just built. With a sequence of matrix multiplications, scaling, and a softmax function, we have fundamentally solved the two core problems that plagued its predecessors.
The Architectural bottleneck is gone: We have completely removed the sequential chain that made RNNs and LSTMs slow and difficult to scale. We have created a mechanism perfectly suited for modern hardware.
Long-Range dependencies are solved: The path for information to travel from the first word to the last word in a sequence is no longer a long, fragile chain prone to the vanishing gradient. It is now a direct connection. The model can compare any two words instantly, regardless of their distance.
The output, matrix Z, is the proof. It has the same shape as our input matrix X, but its contents are fundamentally different. We have transformed a list of static definitions into a network of contextualized ideas.
Crucially, the output vector for a word like “it”, z_it, is not simply a copy of the vector for “animal.” It is a new, unique vector: the original “it” vector, intelligently shifted and refined by the semantic substance of the “animal” Value vector. It has absorbed the context it needs to resolve its meaning, all while retaining its core identity as a pronoun.
However, this powerful “single-headed” mechanism has a subtle limitation. With only one set of W_Q, W_K, and W_V matrices, the model must learn a single, “averaged” strategy for attention. It has to compress every possible type of relationship, grammar, semantics, causality, co-reference, into one combined score. It’s a jack-of-all-trades, but a master of none.
This leads to a natural and powerful question: What if we didn’t have to choose?
What if we could run multiple, specialized self-attention operations in parallel? Instead of one general mechanism, we could have several, each with its own set of projection matrices, each learning to focus on a different aspect of language.
One head could learn to specialize in grammatical structure, expertly identifying subjects, verbs, and objects.
Another head could become a reference resolver, focusing almost entirely on linking pronouns like “it” and “they” to their antecedents.
A third could act as a semantic analyst, connecting words with similar meanings across the sentence.
This is the intuition behind Multi-Head Attention. It allows the model to develop a panel of specialists, each examining the same sentence from a different, learned perspective.
But this creates a new, exciting engineering challenge: once we have these multiple, specialized outputs, how do we combine them into a single, coherent representation that the next layer can use? How do we synthesize the insights of our panel of experts?
Assembling this more powerful engine is our next step. In the next issue, we will see how the Transformer elegantly solves this problem and stacks these layers to build a truly deep understanding of language.
Stay tuned for the next issue of LLM4N!







