⟵ Back to Course Overview

Transformer Self-Attention Space

Contextual Homonym Resolution in $\mathbb{R}^2$
Task Description: Self-attention calculates directional alignment between word embeddings to build context. Drag the circular Query handles on the vector plot to change the hidden state $\mathbf{x}_i$ of each word. Notice the word homonyms: "Can (1)" (verb) and "can (6)" (noun) start with identical spellings but require completely different attention distributions over tokens like "pick" and "up" to resolve their semantic roles.
$$\text{Score}(i,j) = \frac{\mathbf{q}_i \mathbf{k}_j^T}{\sqrt{d_k}}, \quad A_{i,j} = \frac{e^{\text{Score}(i,j)}}{\sum_{m} e^{\text{Score}(i,m)}}$$

Interactive Embedding Space

Drag colored nodes to rotate word vectors.

Attention Map Matrix ($A_{i,j}$)

Hover over cells to examine explicit normalization steps.