LLM4N5: Embeddings and Postional Encoding
setting the stage for transformers, how and why positional encoding work, and what's next?
Hey everyone, and welcome back to LLM4N: LLM For Noobs!
A 16-part series designed to take you from zero to one in understanding Large Language Models.
What a journey it has been so far. We started with the simple act of counting words with N-grams, learning the statistical blueprint of language. From there, we gave words meaning with Word2Vec, mapping them to vector space. Then, we tackled sequence. We gave our models a form of memory with Recurrent Neural Networks (RNNs). When their memory was too short-term, we engineered a better one with the brilliant gated system of Long Short-Term Memory (LSTM) networks.
And just last time, in LLM4N4, we solved a fundamental groundwork problem: how to convert messy, real-world text into a clean, numerical stream our models can digest. We learned about tokenization methods like Byte-Pair Encoding (BPE), which gracefully handles unknown words and keeps our vocabulary manageable.
So, let’s track this. We’ve solved meaning (embeddings), and memory (LSTMs), and preprocessing (tokenization). It seems like we’ve built our perfect model, right?
Not quite.
Because for all their cleverness, our models up to this point, especially our champion, the LSTM → share a fundamental, speed limiting flaw: they are inherently sequential. They process text exactly like we do: one word at time, in a strict, unbreakable order. To understand the 100th word in an essay, an LSTM must patiently churn through all 99 words that came before it. This creates a massive computational bottleneck, preventing us from leveraging the full parallel power of modern GPUs.
Even with our perfect tokens in hand, we’re still feeding them through a model that reads like a slow, deliberate scroll. To reach the next level, we needed to smash this sequential bottleneck. We needed a new architecture.
PS: We will be referencing the original “Attention is all you need” paper for this post.
The Transformer
Enter the Transformer.
Wait wait, not this one. It’s this one.
(It looks complex, but we will be breaking it down step by step in the further issues.)
Its core innovation is a radical departure from everything we’ve seen. It throws out the sequential handbook. Instead of processing a sentence word-by-word, the Transformer is designed to look at all the tokens at once.
This is a total game-changer for speed and scalability. But this brilliant idea: ditching the one-by-one sequence—immediately creates two brand-new problems we have to solve before the main show can begin:
The “What does it mean” Problem: If all tokens arrive simultaneously, how does the model know the inherent meaning of each one?
The “Where is it? Problem: Without an inherent order, how does the model know if the “dog” bit the “man” or the “man” bit the “dog”?
Think of it like this: our tokens are a group of dancers. We’ve already given them their identities (via tokenization). But to perform, they need to get on stage. The Transformer’s stage, however, has a unique rule: all dancers must appear at the exact same time.
So, to make this work, we need to give each dancer two things as they step onto the stage:
A costume that tells everyone who they are (their semantic meaning)
A numbered spot on the floor that tells everyone where they stand (their position in the sequence)
Our job in this issue is to prepare the Transformer’s stage. Let’s get our dancers ready.
Part 1: The Costume: Token Embeddings
Let’s start with the easy part: the costume. This is the “what does it mean?” problem, and luckily, we’ve already met the solution back in LLM4N2. The idea is to convert each token ID (the output of our tokenizer from LLM4N4) into a rich, dense vector that captures its semantic meaning.
While Word2Vec was a separate model we trained beforehand, modern architectures integrate this process directly into the model itself with a Trainable Embedding Layer.
Think of it as a giant, learnable lookup table.
Input: A token ID (e.g., the number
837, which our tokenizer mapped to the token “cat”)The Table: A huge matrix where the number of rows is our vocabulary size (e.g., 50,000 tokens) and the number of columns is our desired vector dimension,
d_model(e.g., 512 for the original Transformer).Output: The layer simply grabs the 837th row from the matrix. That 512-dimensional vector is the initial “costume” for our token.
The most powerful part is that this embedding matrix is trainable.
It starts as random numbers, but as the model learns, the gradients flow all the way back and update these vectors. The model learns the most useful “costumes” or meanings for each token to solve its specific task, whether that’s translation or question-answering.
Okay, each of our dancers now has their costume. But they are all jumbled up in the middle of the stage. We need to give them their positions.
Part 2: The order: Positional encoding
Order is everything in language. “Man bites dog” and “Dog bites man” use the exact same words, but the order changes the meaning entirely. A simple bag of words isn’t enough. Our dancers need their numbered spots on the stage, or the choreography becomes meaningless.
The Transformer has no built-in sense of recurrence or position, so we have to inject this information ourselves. This is done via Positional Encoding.
Most naive approach
Your first instinct might be to just assign a number to each position: the first word is 1, the second is 2, and so on. Let’s see why that’s a bad idea:
Just using the index (1, 2, 3...): What happens with long sentences? The positions could grow into very large numbers (500, 1000, etc.). The token embedding values are usually small, initialized in a range like [-1, 1]. Adding a huge number like 500 would completely drown out the semantic meaning of the embedding. The model would pay more attention to the position than the word’s meaning.
Normalize the index (0 to 1): Okay, let’s fix the scale issue by dividing each position by the sentence length. So in a 10-word sentence, position 5 becomes
5/10 = 0.5. This also fails, but more subtly. Now, the positional value is inconsistent. The 5th word gets a value of0.5in a 10-word sentence but0.1in a 50-word sentence. The model can never learn a stable meaning for what “position 5” represents.
We need something better. We need a system that is consistent regardless of sentence length, bounded within a small range, and gives each position a unique signature.
Let’s ride the sine wave!
If you notice, sine wave satisfies all the properties we listed!
The authors of the “Attention Is All You Need” paper proposed a brilliant solution using sine and cosine functions. Don’t be scared by the math; the intuition behind it is beautiful.
Here are the equations that generate the positional encoding vector for a token at position pos and dimension i:
Let’s break down the math:
pos: This is the position of the word in the sentence (0, 1, 2, 3, …)i: This is the index of the dimension within the output vector (0, 1,2, … uptod_model/2) We deal with dimensions in pairs (even2iand odd2i+1)sinandcos: Using both sine and cosine is a clever trick. The key reason is that for any offsetk,PE(pos+k)can be represented as a linear transformation (a rotation) ofPE(pos). This means the model doesn’t have to learn the position of every word from scratch; it can easily learn the concept of relative positions. The geometric distance between position 5 and 7 is the same as between 10 and 12, making it easy for the attention mechanism to learn rules like “look two words to the left.”The
10000^{…}term: This is the real magic. This term creates waves of different frequencies for different dimension pairsi.For small
i(the first few dimensions of the vector), the wavelength is short, and the values oscillate very quickly. This is like the second hand on a clock.For a large
i(the last few dimensions), the wavelength is very long, and the values change very slowly over many positions. This is like the hour hand on a clock.
By combining these fast and slow frequencies, every single position gets a unique numerical signature, or “timestamp,” across its 512 dimensions. If you were to visualize a heatmap of these positional vectors, you’d see beautiful, cascading wave patterns, giving the model a rich, unique map of the sequence’s order.
Putting it All Together
So, we finally have our “costumes” (Token Embeddings) and our “positions” (Positional Embeddings).Both are vectors of the same size d_model(e.g., 512 dimensions). The final step is to combine them.
The method chosen in the paper is a simple, element-wise addition:
Final_Input = Token_Embedding + Positional_Encoding
A question that should immediately pop into your head is, “Why add? Wouldn’t stacking them side-by-side (concatenation) be safer? That way, you keep the meaning and position signals separate.”
It’s a brilliant question, and the answer reveals a core principle of the Transformer’s architecture: elegance and efficiency.
Let’s consider the alternative. If we concatenate the two 512-dimensional vectors, we would get a single 1024-dimensional vector. This immediately creates a problem. The entire Transformer architecture, from the self-attention mechanism to the feed-forward networks, is built to expect a single, consistent dimension—d_model—flowing through all its layers. This consistency is vital for one of the Transformer’s most important tricks: residual connections, where the input of a layer is added to its output. You can’t add a 512-dim vector to a 1024-dim vector.
To fix the dimension mismatch from concatenation, we would have to add an extra linear projection layer (basically a dense neural network) right at the beginning, just to shrink the 1024-dimensional vector back down to 512. This adds complexity, introduces more trainable parameters, and is computationally less efficient.
Addition, on the other hand, is beautiful.
It’s parameter-free. It requires no extra weights or layers.
It preserves dimensionality. The output is the exact same shape as the inputs, fitting perfectly into the downstream architecture.
Conceptually, you can think of the addition as treating the positional signal as a “nudge” in the high-dimensional meaning space. We start with a vector that represents the token’s meaning. The positional encoding then shifts that vector to a unique point in the space that now represents both the original meaning and its specific position in the sentence.
The bet the Transformer makes is that its subsequent layers are powerful enough to learn to disentangle these two signals. It can learn that certain “directions” in this space correspond to position, while others correspond to semantics.
This disentanglement is mathematically plausible because in high-dimensional spaces, like our 512-dimensional embedding space, random vectors tend to be nearly orthogonal to each other. Think of it like pointing in different directions in a vast, 512-dimensional universe. The token embedding and positional encoding vectors, being initialized differently, likely point in distinct directions. This means that when we add them, we’re not just creating a messy soup of information; we’re creating a new vector that preserves much of the original directional relationships, allowing the model to learn weight transformations that can effectively separate the positional ‘signal’ from the semantic ‘signal’.
It’s an elegant solution that avoids architectural clutter and relies on the learning capacity of the model itself.
Let’s quickly trace a sentence through this pipeline: “LLMs are powerful.”
Tokenize: [’[CLS]’, ‘LLM’, ‘##s’, ‘are’, ‘power’, ‘##ful’, ‘.’, ‘[SEP]’]
Get Token IDs: [101, 15720, 2043, 2024, 2393, 16541, 1012, 102]
Get Token Embeddings: Look up the vector for each ID. We get a matrix of size [8, 512].
Get Positional Encodings: Generate the unique sinusoidal vectors for positions 0 through 7. We get another matrix of size [8, 512].
Add Them: The two matrices are added together, resulting in our final input tensor, shaped [8, 512].
The final tensor is now ready for the main event.
The Stage is set, dancers are ready
And there we have it. We have successfully prepared the input for the Transformer. We’ve created a set of rich vectors where each one is packed with two critical pieces of information: what the token means (its costume) and where it is in the sequence (its spot on the stage).
Our dancers are in costume and in their positions. The lights are on. But the performance hasn’t begun.
Now that they can all see each other at once, how do they interact? How does the model learn that “it” in “The cat drank the milk because it was thirsty” refers to the “cat” and not the “milk”?
That is the power of Self-Attention, and it’s where the magic truly begins. One of the most important concepts, used in literally every single SOTA model these days.
We will be breaking it down in the next edition of LLM4N! Stay tuned!
See you next week!
To dive even deeper into positional encoding, check-out this blog-post here.






