DEV Community

zeromathai
zeromathai

Posted on • Originally published at zeromathai.com

How Attention Actually Works — From Next-Token Prediction to QKV Intuition

A language model does not “write sentences.”

It predicts the next token. One step at a time.

So the real question is:

How does it decide what matters right now?

That is why attention exists.

Core Idea

A Language Model = next-token probability estimator.

Given previous tokens, it predicts the next token.

Attention = mechanism that decides which past tokens matter more.

This is critical.

Because not all context is equally useful.

The Key Structure

Language Modeling can be reduced to:

P(x₁, x₂, ..., xₜ) = Π P(xₜ | x₁...xₜ₋₁)

And attention adds:

weighted context selection

More concretely:

Language Model = context + weighting + prediction

Without attention:

All context is compressed.

With attention:

Context is dynamically re-weighted at every step.

Pseudo-code View

Autoregressive generation:

context = ["I", "love"]

while not finished:
    probs = model(context)
    next_token = sample(probs)

    context.append(next_token)
Enter fullscreen mode Exit fullscreen mode

Attention inside the model:

for each token t:
    score = compare(query_t, keys)

    weights = softmax(score)

    output_t = sum(weights * values)
Enter fullscreen mode Exit fullscreen mode

That is the core loop.

Predict → append → repeat.

Concrete Example

Input:

"I love"

Possible next tokens:

you, it, this, pizza

The model assigns probabilities:

you → 0.6

it → 0.2

this → 0.1

pizza → 0.1

Why does “you” win?

Because attention focuses on relationships in context.

“I” + “love” → strong pattern → “you”

Now extend:

"I love you because"

The model must now decide:

What does “because” relate to?

Attention allows it to re-evaluate the entire context.

Not just the last token.

Why Attention Is Needed

Old Seq2Seq models:

  • compress entire input into one vector
  • lose information as sequence grows

Attention fixes this:

  • keeps all token representations
  • dynamically selects relevant ones

This matters because:

Long sentences break fixed representations.

Attention removes that bottleneck.

QKV Intuition

Attention uses three vectors:

Query, Key, Value

Think like search:

Query = what I want

Key = what each token offers

Value = the actual information

Flow:

  1. compare Query with Keys
  2. compute similarity scores
  3. normalize with softmax
  4. combine Values using weights

That is how context is selected.

The Core Formula

Attention is:

Attention(Q, K, V) = softmax(QKᵀ / √d) V

Meaning:

  • match Query with Keys
  • turn matches into probabilities
  • use those probabilities to mix Values

Result:

Each token becomes context-aware.

Cross Attention and Context Vector

In encoder-decoder models:

Decoder does not rely only on its own tokens.

It looks at Encoder outputs.

Context vector:

c = Σ (attention_weight × encoder_hidden_state)

This is dynamic.

At every step, the model recomputes what matters.

Not a fixed summary.

A moving focus.

Naive vs Real View

Naive view:

Language model = next word generator

Real view:

Language model = dynamic context weighting system

Naive:

predict next token
Enter fullscreen mode Exit fullscreen mode

Real:

compute attention
reweight context
then predict token
Enter fullscreen mode Exit fullscreen mode

That difference is everything.

It explains why Transformers outperform older models.

Important Constraints

Attention is powerful, but not free.

  • cost grows with sequence length
  • requires memory for all tokens
  • depends on good tokenization
  • still generates sequentially at inference

Also:

Attention does not understand meaning by itself.

It only learns patterns from data.

So quality depends on training.

Why This Matters (Again)

Early:

Without attention → information bottleneck

Now:

With attention → full context + selective focus

This is why modern LLMs work.

Not because they “know language.”

But because they efficiently manage context.

Takeaway

Language Model = next-token prediction.

Attention = context selection.

QKV = mechanism for selecting information.

If you remember one thing:

Attention lets a model decide what to look at before predicting what to say.

That is the core of modern LLMs.

Discussion

When you think about LLM behavior, do you see it more as:

a probability engine or a context selection system?

Originally published at zeromathai.com
Original article: https://zeromathai.com/en/attention-language-modeling-basics-en/

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

Top comments (0)