⟵ Back to Course Overview

Text Tokenization Laboratory

Discrete Sequence Decomposition and Vocabulary Mapping
Task Description: Tokenization transforms continuous strings into sets of indices that parameterize neural embedding layers. Modify the input text string below and switch between Word, Character, and Subword (Byte-Pair Encoding proxy) tokenizer frameworks to observe adjustments to structural vocabulary footprints and sequence sequence tracking.
Metric Attribute Computed Target Value
Sequence Token Length ($N$) 0
Unique Vocabulary Count ($|\mathcal{V}_{\text{local}}|$) 0
Average Token Length (chars) 0.0
Strategy:
$$\mathbf{x} = [t_1, t_2, \dots, t_N], \quad t_i \in \mathcal{V}$$