Blog posts

December 10, 2025

Attention, Visualized

Education

Authors

Dylan Ebert

Editors

No items found.

Acknowledgements

Attention, Visualized

Build an intuition for the attention mechanism, the core innovation behind transformers.

Building blocks

Language models are trained to predict the most likely next token. Let's say we have a sequence of input tokens: "the", "dog", "was". Each of these tokens is mapped to a unique vector representing its meaning, called a word embedding. The embedding of "dog" encodes everything the model knows about dogs.

These embeddings are fed to a neural network, which predicts the most likely next token—in this case, "barking".

Simple enough. Now consider: what about a hot dog?

The problem

We now have a new sequence: "the", "hot", "dog", "was".

The embedding of "dog" is the same as before, still encoding the meaning of dog. Similarly, "hot" gets its own embedding that encodes information like high temperature.

Given these embeddings, the model may predict "panting". A reasonable completion for a dog, but it misses the context. In this sentence, we're most likely talking about a hot dog, not a dog that's hot.

Ideally, the meaning of "dog" should be contextual. In this case, the food, not the animal. This is where attention comes in.

Dot product (recap)

Before we get into attention, let's recap the dot product.

Given two vectors a and b, each with d components, the dot product tells us how similar they are. To compute it, we multiply corresponding components and add up the results:

{\color{query}a_1} {\color{key}b_1} + {\color{query}a_2} {\color{key}b_2} + \cdots + {\color{query}a_d} {\color{key}b_d}

This can also be written compactly as a^Tb:

{\color{query}a}^\top {\color{key}b}

Similar vectors produce large dot products; opposite vectors produce negative ones. As the dimension d grows, these values tend to get larger, so we divide by $\sqrt{d}$ to keep them in a reasonable range:

\frac{{\color{query}a}^\top {\color{key}b}}{\sqrt{d}}

This scaled dot product is exactly what's used in our attention mechanism.

Attention scores

We use the dot product to measure how much each token should attend to the others. For "dog", we compute:

{\color{query}q_3}{\color{key}K}^\top

From "dog"'s embedding, we project a query vector ${\color{query}q_3}$ —think of this as asking: "what am I looking for?"

From each other token, we project a key vector. For example, "hot" produces ${\color{key}k_2}$ , representing: "here's what I provide."

The dot product between query and key gives us an attention score—a single number measuring how relevant that token is to "dog".

Normalizing with softmax

The attention scores tell us how relevant each token is to "dog". But these raw numbers can be any magnitude. To turn them into something meaningful, we apply two operations.

First, we scale by $\sqrt{d}$ as we saw earlier. Then, softmax converts the scores into attention weights—values between 0 and 1 that sum to 1:

\text{softmax}\left(\frac{{\color{query}q_3}{\color{key}K}^\top}{\sqrt{d}}\right)

These weights are a probability distribution over tokens. They tell us exactly how much "dog" should attend to each token in the sequence.

Capturing context

Each token also produces a value vector. While the key says "here's what I offer", the value contains the actual content to share.

We multiply our attention weights by V to get a weighted sum of meaning from every token:

\text{softmax}\left(\frac{{\color{query}q_3}{\color{key}K}^\top}{\sqrt{d}}\right){\color{value}V}

The output is a new vector for "dog" that understands it likely refers to the food, not the animal.

Putting it all together

So far, we've focused on a single token. But in practice, every token computes its own query and attends to the full sequence—all in parallel. This is self-attention:

\text{Attention}({\color{query}Q}, {\color{key}K}, {\color{value}V}) = \text{softmax}\left(\frac{{\color{query}Q}{\color{key}K}^\top}{\sqrt{d}}\right){\color{value}V}

Each token projects a query (what it's looking for) and a key (what it offers). The dot product between them produces attention scores, which softmax normalizes into weights. These weights then combine the value vectors—the actual content each token contributes—into a single, context-aware representation.

That's attention: a learned way for tokens to share information based on relevance.

Adaptive Engine Adapt Evaluate Serve

Use Cases RAG Text-to-SQL Customer Support

Company Technology About Blog

Socials LinkedIn Twitter YouTube