Copyright © 2024
Adaptive ML, Inc.
All rights reserved
Adaptive ML, Inc.
All rights reserved






Attention, Visualized
Build an intuition for the attention mechanism, the core innovation behind transformers.
Language models are trained to predict the most likely next token. Let's say we have a sequence of input tokens: "the", "dog", "was". Each of these tokens is mapped to a unique vector representing its meaning, called a word embedding. The embedding of "dog" encodes everything the model knows about dogs.
These embeddings are fed to a neural network, which predicts the most likely next token—in this case, "barking".
Simple enough. Now consider: what about a hot dog?
We now have a new sequence: "the", "hot", "dog", "was".
The embedding of "dog" is the same as before, still encoding the meaning of dog. Similarly, "hot" gets its own embedding that encodes information like high temperature.
Given these embeddings, the model may predict "panting". A reasonable completion for a dog, but it misses the context. In this sentence, we're most likely talking about a hot dog, not a dog that's hot.
Ideally, the meaning of "dog" should be contextual. In this case, the food, not the animal. This is where attention comes in.
Want more visual explanations like this? Subscribe to our newsletterBefore we get into attention, let's recap the dot product.
Given two vectors a and b, each with d components, the dot product tells us how similar they are. To compute it, we multiply corresponding components and add up the results:
This can also be written compactly as aTb:
Similar vectors produce large dot products; opposite vectors produce negative ones. As the dimension d grows, these values tend to get larger, so we divide by \sqrt{d} to keep them in a reasonable range:
This scaled dot product is exactly what's used in our attention mechanism.
We use the dot product to measure how much each token should attend to the others. For "dog", we compute:
From "dog"'s embedding, we project a query vector {\color{query}q_3}—think of this as asking: "what am I looking for?"
From each other token, we project a key vector. For example, "hot" produces {\color{key}k_2}, representing: "here's what I provide."
The dot product between query and key gives us an attention score—a single number measuring how relevant that token is to "dog".
The attention scores tell us how relevant each token is to "dog". But these raw numbers can be any magnitude. To turn them into something meaningful, we apply two operations.
First, we scale by \sqrt{d} as we saw earlier. Then, softmax converts the scores into attention weights—values between 0 and 1 that sum to 1:
These weights are a probability distribution over tokens. They tell us exactly how much "dog" should attend to each token in the sequence.
Each token also produces a value vector. While the key says "here's what I offer", the value contains the actual content to share.
We multiply our attention weights by V to get a weighted sum of meaning from every token:
The output is a new vector for "dog" that understands it likely refers to the food, not the animal.
So far, we've focused on a single token. But in practice, every token computes its own query and attends to the full sequence—all in parallel. This is self-attention:
Each token projects a query (what it's looking for) and a key (what it offers). The dot product between them produces attention scores, which softmax normalizes into weights. These weights then combine the value vectors—the actual content each token contributes—into a single, context-aware representation.
That's attention: a learned way for tokens to share information based on relevance.
Want more visual explanations like this? Subscribe to our newsletter




