LLM4N10: Teaching Models to Follow Instructions - Supervised Fine-tuning (SFT)
How we go from a base model to a instruction-tuned model
Helloooo my few attentive readers, welcome back to LLM4N!
Had to take a short break because of academic commitments, but we are back again.
Just a recap.
LLM4N is a 16-part series designed to take you from zero to one in understanding Large Language Models.
In our last chapter, LLM4N9, we achieved something remarkable. We took a blank neural network and transformed it into a base model. Through pre-training, by feeding it trillions of words from the internet and teaching it the singular objective of “predict the next word,” we created something that feels almost magical: a repository of human knowledge, language patterns, and even reasoning.
This base model is, in many ways, a genius. It has internalized the statistical fabric of how we write, what we know, and how we think. It can complete sentences with stunning coherence. It understands context across paragraphs. It knows facts about history, science, literature, and culture.
But here’s the problem: it’s completely useless as an assistant.
Let me show you what I mean. Imagine you sit down with your freshly pre-trained base model, excited to finally have a helpful AI companion. You type:
Prompt: “Explain quantum physics.”
You expect a clear, helpful explanation. Instead, your model responds:
Base Model: “Explain quantum physics to a five-year-old is difficult because the concepts involve mathematical abstractions that require... Explain quantum physics principles using analogies from classical literature has been attempted by...”
Wait, what?
It’s not explaining quantum physics. It’s continuing your sentence as if you’d left it unfinished in a blog post. It thinks your prompt is the beginning of an article, and it’s helpfully completing it for you.
One more example. You ask:
Prompt: “What’s the capital of France?”
Base Model: “What’s the capital of France? Is it Paris or Lyon? Many people confuse... What’s the capital of France compared to other European nations? The question of capitals in the European Union...”
It’s generating questions, then more questions, then answering those questions, in an endless, meandering loop. It has no concept of when a task is “done.”
The core problem is: Knowledge without Intent
Here’s what’s happening. Our base model has knowledge. It absolutely knows that Paris is the capital of France, it understands quantum physics, and it can write grammatically. But it has zero intent to be helpful
.
It’s not trying to answer your question. It’s trying to predict what text would statistically follow your prompt in its training data. And in that training data, prompts like yours were usually fragments of longer articles, forum posts, or search engine spam. So that’s what it learned to generate.
Our model has a brilliant mind, but it has no manners, no purpose, and no understanding of what a helpful conversation looks like.
This is the problem that Supervised Fine-Tuning exists to solve.
School for AIs - Introducing SFT
The solution to our problem is elegantly simple: we need to teach our model a new skill. Not new knowledge, it already has plenty of that from pre-training. We need to teach it a new behavior.
Supervised Fine-Tuning (SFT) is the model’s first formal education. If pre-training was giving a student unlimited library access and saying “read everything,” then SFT is enrolling them in a masterclass taught by expert instructors.
The goal: teach the model the distribution of a helpful, harmless, and honest AI assistant.
Instead of learning from the wild internet, the model will learn from a carefully curated curriculum of how a good assistant responds.
The Key Insight: We’re Shifting Distributions
During pre-training, the model learned: P(next word | internet text)
With SFT, we’re teaching: P(next word | helpful AI assistant)
These are very different distributions.
In the internet distribution, “Explain quantum physics” might be followed by anything. In the assistant distribution, it’s followed by a clear, helpful explanation.
The Curriculum - SFT Dataset
The curriculum is the SFT dataset.
Tens of thousands of carefully curated (prompt, response) pairs. Each pair is a lesson from an expert tutor.
Here is a demo of the lessons:
In SFT, quality is much more important than quantity. Research shows 1,000 carefully curated examples can outperform 50,000 mediocre ones.
Why?
SFT isn’t teaching new facts. The model learned those during pre-training. SFT teaches style, format, and behavior.
Your base model already knows Paris is France’s capital. It doesn’t know that when asked, the correct behavior is to respond with “The capital of France is Paris” rather than generating an essay about European capitals.
How SFT Training Works
The entire SFT training process uses exactly the same algorithm as pre-training. Same forward pass, same loss function, same backward pass, same weight updates.
The only thing that changes is the data.
The Five-Step Dance, Revisited
Let’s walk through training with one example:
Prompt: “What’s the capital of Portugal?”
Ideal Response: “The capital of Portugal is Lisbon.”
Step 1: Forward Pass
Feed the model: “What’s the capital of Portugal? The capital of Portugal is”
Model produces probabilities for next token:
45% “Lisbon”
30% “Porto”
25% other
Step 2: Compare to Target
The correct token is: “Lisbon”
Step 3: Calculate Loss
Cross-Entropy Loss: -log(0.45) ≈ 0.80
This penalty would be ~0.05 if the model was 95% confident, or ~3.0 if only 5% confident.
Step 4: Backward Pass
Propagate the loss backward through the network using backpropagation.
Step 5: Update Weights
Adjust all weights to make “Lisbon” more probable next time.
This repeats for every token in every response, across the entire dataset, typically for 2-3 epochs.
A Key Detail: Prompt Masking
We only calculate loss on response tokens, not prompt tokens. Why? We’re not teaching the model to generate prompts—we’re teaching it to generate good responses given a prompt. This “prompt masking” focuses learning on what matters: helpful answers.
So what we’ve achieved and what not?
After SFT, we have an “instruction-tuned model.” Let’s see what changed.
The Successes
1. Follows instructions reliably
Prompt: “Write a poem about recursion.”
Now it writes the poem instead of rambling about what recursion is.
2. Consistent, helpful tone
Notice how SFT models sound similar?
“Sure, I can help with that!”
“Let me break this down for you...”
“Let’s think about it step by step...”
The model learned this from thousands of examples with this friendly tone.
3. Proper formatting
Knows when to use bullet points, code blocks, lists, or paragraphs.
4. Basic safety
Learned pattern-matching for harmful requests through training examples.
The Limitations
Problem 1: The Mimicry Ceiling
The model can only be as good as its curriculum. Mediocre examples → mediocre model. Biases in → biases out.
Problem 2: The “One Answer” Problem
This is the fundamental limitation.
Prompt: “Explain neural networks.”
Answer A: “Neural networks are computing systems inspired by biological brains.” (Concise)
Answer B: “Neural networks are machine learning models composed of interconnected layers of nodes...” (Detailed)
Answer C: “Think of a neural network as a team of specialists...” (Metaphorical)
Which is “correct”? Depends on who’s asking and why. But SFT forces us to pick one as the “ideal response.
”
SFT teaches a good answer, but not necessarily the best answer. It has no sense of nuanced preference.
Problem 3: Brittle Safety
Pattern-matching “IF prompt contains ‘hack email’ THEN refuse” doesn’t understand true harm. Leads to over-refusal (blocking harmless requests with trigger words) and under-refusal (missing rephrased harmful requests).
Problem 4: Coverage is Impossible
Can’t create a perfect (prompt, response) pair for every conceivable question. Human language is infinitely creative.
What’s next?
SFT is essential and transformative. It gives our brilliant but aimless base model purpose, structure, and basic social skills. It’s the difference between a genius hermit and a functional member of society.
But we’ve hit a ceiling.
SFT is supervised learning, which requires a single “correct” answer for every input. This works for teaching format and basic behavior, but fails at teaching the subtle, nuanced qualities that separate good responses from great ones.
How do we teach when conciseness is better than thoroughness? When formal beats casual? When to refuse and when to help with true safety understanding? We can’t write a perfect response for every scenario.
We need a new paradigm. Instead of asking humans to write perfect responses (hard, doesn’t scale), what if we just asked them to judge between responses (much easier)?
What if we let the model generate multiple answers, and simply tell it: “This answer is better than that one”? What if we teach through preferences rather than examples?
This is the insight that unlocks the next stage: teaching the model to maximize a reward signal that captures human preferences.
In the next issue, LLM4N11, we will explore the final piece of the puzzle: RLHF (Reinforcement Learning from Human Feedback). This is where our model stops being a student and starts becoming a partner.






