A language model does not “write sentences.”
It predicts the next token. One step at a time.
So the real question is:
How does it decide what matters right now?
That is why attention exists.
Core Idea
A Language Model = next-token probability estimator.
Given previous tokens, it predicts the next token.
Attention = mechanism that decides which past tokens matter more.
This is critical.
Because not all context is equally useful.
The Key Structure
Language Modeling can be reduced to:
P(x₁, x₂, ..., xₜ) = Π P(xₜ | x₁...xₜ₋₁)
And attention adds:
weighted context selection
More concretely:
Language Model = context + weighting + prediction
Without attention:
All context is compressed.
With attention:
Context is dynamically re-weighted at every step.
Pseudo-code View
Autoregressive generation:
context = ["I", "love"]
while not finished:
probs = model(context)
next_token = sample(probs)
context.append(next_token)
Attention inside the model:
for each token t:
score = compare(query_t, keys)
weights = softmax(score)
output_t = sum(weights * values)
That is the core loop.
Predict → append → repeat.
Concrete Example
Input:
"I love"
Possible next tokens:
you, it, this, pizza
The model assigns probabilities:
you → 0.6
it → 0.2
this → 0.1
pizza → 0.1
Why does “you” win?
Because attention focuses on relationships in context.
“I” + “love” → strong pattern → “you”
Now extend:
"I love you because"
The model must now decide:
What does “because” relate to?
Attention allows it to re-evaluate the entire context.
Not just the last token.
Why Attention Is Needed
Old Seq2Seq models:
- compress entire input into one vector
- lose information as sequence grows
Attention fixes this:
- keeps all token representations
- dynamically selects relevant ones
This matters because:
Long sentences break fixed representations.
Attention removes that bottleneck.
QKV Intuition
Attention uses three vectors:
Query, Key, Value
Think like search:
Query = what I want
Key = what each token offers
Value = the actual information
Flow:
- compare Query with Keys
- compute similarity scores
- normalize with softmax
- combine Values using weights
That is how context is selected.
The Core Formula
Attention is:
Attention(Q, K, V) = softmax(QKᵀ / √d) V
Meaning:
- match Query with Keys
- turn matches into probabilities
- use those probabilities to mix Values
Result:
Each token becomes context-aware.
Cross Attention and Context Vector
In encoder-decoder models:
Decoder does not rely only on its own tokens.
It looks at Encoder outputs.
Context vector:
c = Σ (attention_weight × encoder_hidden_state)
This is dynamic.
At every step, the model recomputes what matters.
Not a fixed summary.
A moving focus.
Naive vs Real View
Naive view:
Language model = next word generator
Real view:
Language model = dynamic context weighting system
Naive:
predict next token
Real:
compute attention
reweight context
then predict token
That difference is everything.
It explains why Transformers outperform older models.
Important Constraints
Attention is powerful, but not free.
- cost grows with sequence length
- requires memory for all tokens
- depends on good tokenization
- still generates sequentially at inference
Also:
Attention does not understand meaning by itself.
It only learns patterns from data.
So quality depends on training.
Why This Matters (Again)
Early:
Without attention → information bottleneck
Now:
With attention → full context + selective focus
This is why modern LLMs work.
Not because they “know language.”
But because they efficiently manage context.
Takeaway
Language Model = next-token prediction.
Attention = context selection.
QKV = mechanism for selecting information.
If you remember one thing:
Attention lets a model decide what to look at before predicting what to say.
That is the core of modern LLMs.
Discussion
When you think about LLM behavior, do you see it more as:
a probability engine or a context selection system?
Originally published at zeromathai.com
Original article: https://zeromathai.com/en/attention-language-modeling-basics-en/
GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai
Top comments (0)