Self-Attention Mechanism

Task Description: Self-attention calculates directional alignment between word embeddings to build context. Drag the circular Query handles on the vector plot to change the hidden state $\mathbf{x}_i$ of each word. Notice the word homonyms: "Can (1)" (verb) and "can (6)" (noun) start with identical spellings but require completely different attention distributions over tokens like "pick" and "up" to resolve their semantic roles.

Transformer Self-Attention Space

Interactive Embedding Space

Attention Map Matrix ($A_{i,j}$)