All articles
Machine Learning

Static vs contextual embeddings: how transformers turn words into meaning

A practical guide to traditional word embeddings, contextual embeddings, why transformers rely on them, and how self-attention calculates them step by step.

11 min read

Words are not fixed objects

When people first learn about embeddings, the idea sounds simple: take a word and convert it into a list of numbers so a model can work with it.

That is true, but it is only the beginning.

The word "bank" in "I deposited money in the bank" does not mean the same thing as "bank" in "We sat on the bank of the river." If a model gives both uses of bank the exact same vector, it is already starting with the wrong assumption.

That is the core reason contextual embeddings matter. They let a model represent a word based on the words around it, not as a fixed dictionary entry.

In this post, we will cover:

  1. What a general or static embedding is
  2. What a contextual embedding is
  3. Why transformers depend on contextual embeddings
  4. How contextual embeddings are calculated using self-attention
  5. A simple worked example with real intuition

One note on terminology: people sometimes say general embedding for the older style of embedding, but the more common technical term is static embedding. I will use both ideas together here.

What is an embedding?

An embedding is a dense numeric representation of a token, word, sentence, image, or another piece of data.

Instead of representing a word as a giant one-hot vector like:

Text
dog = [0, 0, 0, 1, 0, 0, 0, ...]

we represent it as a compact learned vector like:

Text
dog = [0.18, -0.42, 0.77, 0.05, ...]

The magic is that similar concepts often end up near each other in vector space. For example, dog, puppy, and pet may be closer to one another than dog and airplane.

This gives neural networks a much richer starting point than raw IDs.

Static embeddings: one word, one vector

Older NLP models such as Word2Vec, GloVe, and in many cases fastText learn a fixed embedding table.

That means:

  • every time the token bank appears, it starts with the same vector
  • every time the token apple appears, it starts with the same vector
  • the embedding does not change based on the sentence

So a static embedding table might look conceptually like this:

WordEmbedding
bank[0.4, -0.2, 0.9, ...]
money[0.8, 0.1, 0.3, ...]
river[-0.5, 0.7, -0.1, ...]

This was a huge improvement over one-hot encoding because it captured useful semantic structure. Word analogies such as:

Text
king - man + woman ≈ queen

became possible.

But static embeddings have a major weakness: one token gets one meaning.

Why static embeddings are not enough

Language is full of ambiguity.

  • bank can mean a financial institution or the side of a river
  • bat can mean an animal or a piece of sports equipment
  • light can mean illumination or something not heavy

Static embeddings struggle because they collapse all senses into one fixed vector.

They also miss other kinds of context:

  • sentiment: "This movie is sick" can be praise or criticism depending on usage
  • syntax: the role of a word changes depending on sentence structure
  • long-range dependency: a word may depend on another word far away in the sentence
  • task nuance: the meaning needed for translation, summarization, and question answering is not always the same

In short, static embeddings know something about language, but not enough about the current sentence.

Contextual embeddings: one token, many possible vectors

A contextual embedding is a representation of a token after the model has looked at its surrounding context.

That means the token bank can become one vector in:

Text
I deposited money in the bank.

and a different vector in:

Text
We sat on the bank of the river.

This is the key difference:

PropertyStatic embeddingContextual embedding
Vector for a wordFixedChanges with context
Handles ambiguity wellNoYes
Captures sentence meaningWeaklyStrongly
Used in classic NLP pipelinesOftenLess often
Used in transformersOnly as the starting lookupYes, throughout the model

A very important detail: in transformers, we still begin with a token embedding lookup table. But that lookup is only the starting point. After self-attention layers mix information across tokens, the hidden state for each token becomes contextual.

That hidden state is what people usually mean by a contextual embedding.

Why transformers use contextual embeddings

Transformers were designed to let each token look at other relevant tokens in the sequence. This solves a deep problem in language: words do not carry full meaning by themselves.

Consider these sentences:

  1. "The bank approved the loan."
  2. "The fisherman rested on the bank."

If the model used only a static vector for bank, it would blur these two meanings together. But a transformer can make bank attend to words like approved and loan in the first sentence, and to fisherman in the second.

That is why contextual embeddings are so useful in transformers:

  • they resolve ambiguity
  • they capture relationships between words
  • they allow the same token to mean different things in different places
  • they improve downstream tasks like translation, search, QA, summarization, and generation

Without contextual embeddings, a transformer would lose one of its biggest strengths.

The big picture: where contextual embeddings come from

The heart of the transformer is self-attention.

The core equation is:

Text
Attention(Q, K, V) = softmax((QK^T) / sqrt(d_k)) V

If that looks intimidating, do not worry. The idea is simpler than the notation.

  • Q means Query: what this token is looking for
  • K means Key: what each token offers
  • V means Value: the information each token can pass along

A contextual embedding is produced when a token gathers weighted information from other tokens through that attention process.

Step by step: how contextual embedding is calculated

Let us walk through a sentence:

Text
I deposited money in the bank

1. Tokenize the input

The sentence is split into tokens:

Text
[I, deposited, money, in, the, bank]

Each token gets an ID.

2. Look up the initial token embeddings

Each token ID is mapped to a learned embedding vector from an embedding matrix.

Text
e_I, e_deposited, e_money, e_in, e_the, e_bank

At this point, bank is still just its basic lookup vector. It is not contextual yet.

3. Add positional information

Transformers need to know order, so we add positional embeddings or positional encodings:

Text
h_i^0 = e_i + p_i

Now each token has both content information and position information.

4. Create Query, Key, and Value vectors

For each token representation h_i, the model applies learned linear projections:

Text
q_i = h_i W_Q
k_i = h_i W_K
v_i = h_i W_V

For the whole sentence, this becomes three matrices:

Text
Q = H W_Q
K = H W_K
V = H W_V

This is the stage often shown in attention diagrams: every token gets its own query, key, and value vector.

Here is the same flow in a more visual form:

Diagram showing how Query, Key, and Value matrices interact in self-attention to produce a context-aware output embedding.

The sketch shows the Q, K, and V projections, the score matrix from QK^T, the softmax weighting step, and the final weighted combination of the value vectors.

5. Compute attention scores

For each token, we compare its query with every key using a dot product.

For token i attending to token j:

Text
score(i, j) = (q_i · k_j) / sqrt(d_k)

This tells us how relevant token j is when token i builds its updated representation.

If we stack all pairwise scores together, we get a score matrix:

Text
S = (QK^T) / sqrt(d_k)

6. Apply softmax

The raw scores are turned into probabilities:

Text
A = softmax(S)

Each row of A sums to 1. A row tells us how much attention one token pays to all tokens in the sequence.

7. Mix the Value vectors

Now we take a weighted sum of the value vectors:

Text
Z = A V

This is the crucial step.

The updated representation for token i is the weighted combination of the value vectors from all tokens. So if bank strongly attends to money and deposited, its new vector will move toward a financial meaning.

That weighted sum is the beginning of the token's contextual embedding.

8. Use multiple heads

Transformers do not do this just once. They use multi-head attention.

That means several separate sets of W_Q, W_K, and W_V learn different relationships in parallel:

  • one head may focus on syntax
  • one may track coreference
  • one may focus on nearby words
  • one may capture long-range semantic links

The outputs of all heads are concatenated and projected again.

9. Pass through feed-forward layers and repeat

After attention, the model applies:

  • residual connections
  • layer normalization
  • a position-wise feed-forward network

Then the process repeats across many layers.

By the final layers, each token representation contains rich information from the rest of the sentence. That final hidden state is usually what we call the contextual embedding.

A simple intuition for the attention diagram

If you think about a standard attention diagram with Q, K, and V blocks:

  1. QK^T tells you how strongly each token relates to every other token
  2. softmax turns those relationships into weights
  3. multiplying by V mixes information from all tokens
  4. the output is the new context-aware representation

So the contextual embedding is not stored in a static dictionary. It is computed on the fly from the current sequence.

Worked example: how "bank" changes with context

Let us compare two contexts for the token bank.

Sentence A

Text
money bank loan

Imagine the attention weights for the token bank become:

Text
[0.51, 0.19, 0.30]

meaning:

  • 51% attention to money
  • 19% to bank itself
  • 30% to loan

Now suppose the value vectors are:

Text
money = [1.0, 0.0]
bank  = [0.5, 0.5]
loan  = [0.8, 0.2]

The contextual embedding for bank becomes the weighted sum:

Text
0.51 * [1.0, 0.0]
+ 0.19 * [0.5, 0.5]
+ 0.30 * [0.8, 0.2]
= [0.845, 0.155]

That output leans strongly toward the first dimension, which we can imagine as the "financial" direction.

Sentence B

Text
river bank water

Now imagine the attention weights for bank are:

Text
[0.35, 0.15, 0.50]

and the value vectors are:

Text
river = [0.0, 1.0]
bank  = [0.5, 0.5]
water = [0.1, 0.9]

Then the new contextual embedding becomes:

Text
0.35 * [0.0, 1.0]
+ 0.15 * [0.5, 0.5]
+ 0.50 * [0.1, 0.9]
= [0.125, 0.875]

Now the vector leans strongly toward the second dimension, which we can imagine as the "river-side" direction.

The exact numbers in real transformers are much larger and learned automatically, but the logic is the same: same token, different context, different embedding.

An important subtlety

Contextual embeddings are not usually created in a single dramatic step. They become richer layer by layer.

  • early layers often capture local and syntactic information
  • middle layers often capture phrase-level meaning
  • deeper layers often capture higher-level semantics and task-relevant structure

So if someone asks, "What is the contextual embedding of a word in a transformer?" the honest answer is often:

it is the hidden state of that token at a chosen layer, usually a later one or the final layer

Why this matters in practice

Contextual embeddings are one of the main reasons transformer models perform so well.

They help with:

  • search and retrieval: queries and documents can be compared in a context-aware way
  • translation: the same word can map to different outputs based on sentence meaning
  • question answering: the model can locate the relevant part of the passage
  • text generation: every next token prediction depends on contextual understanding
  • classification: sentiment, intent, and topic all depend on usage, not just raw words

This is also why modern NLP moved so aggressively from static embeddings to transformer-based representations.

Final takeaway

Static embeddings assign one learned vector to each token. They are useful, compact, and historically important, but they treat meaning as fixed.

Contextual embeddings do something much closer to how language actually works. A token begins with a base embedding, then the transformer updates it by looking at other tokens through self-attention:

Text
contextual embedding = softmax((QK^T) / sqrt(d_k)) V

plus multi-head attention, feed-forward layers, residual connections, and repeated stacking across layers.

That is why transformers use contextual embeddings: meaning is not stored in isolated words. Meaning emerges from relationships.