Skip to main content
AI Foundations

⏱ About 20 min20 XP

Tokens and Tokenization

You have just learned that a language model predicts the next token given preceding tokens. But what exactly is a token? It is not simply a word. Modern language models break text into subword units — pieces that are sometimes full words, sometimes word fragments, and sometimes single characters. This design choice, called tokenization, is not arbitrary. It solves a real engineering problem, and understanding it will help you reason clearly about both the capabilities and the odd failures of LLMs.

The Problem Tokenization Solves

Imagine you wanted to build a language model that worked at the word level: each word in the vocabulary is one unit. English has hundreds of thousands of words, and any real corpus also contains names, technical terms, misspellings, URLs, and code. A purely word-level model faces a choice: include only the most common words (leaving most text unrepresentable) or build an enormous vocabulary (making the model slow and memory-hungry). Worse, word-level models cannot handle words they have never seen — a new medication name or a newly coined slang term simply does not exist in the vocabulary. At the other extreme, a character-level model that treats each letter as a token can represent any text, but it must predict thousands of characters to generate one sentence, and each character carries very little information about meaning. Subword tokenization is the compromise: split text into fragments that are common enough to be reused efficiently, but rare enough that unusual words can be built from familiar pieces. The word 'unhelpfulness,' for example, might be tokenized as 'un,' 'help,' 'ful,' 'ness' — four pieces, each of which appears in thousands of other words.

Subword Tokenization

Modern LLMs use subword tokenization algorithms such as Byte-Pair Encoding (BPE) or WordPiece. These algorithms analyze a training corpus and build a vocabulary of 30,000 to 100,000 subword units that can represent any text, including unseen words, by decomposing them into known fragments.

Walk through a concrete example using Byte-Pair Encoding (BPE), the algorithm used by GPT-family models. BPE starts with a vocabulary of individual characters (or bytes). It then repeatedly finds the most frequent pair of adjacent symbols in the training corpus and merges them into a new symbol. After enough merges — typically thousands — the vocabulary contains common words as single units, common word pieces, and the original characters as a fallback. Example tokenization (approximate GPT-4 tokenization): 'Hello, world!' -> ['Hello', ',', ' world', '!'] 'tokenization' -> ['token', 'ization'] 'ChatGPT' -> ['Chat', 'G', 'PT'] 'antidisestablishmentarianism' -> ['anti', 'dis', 'est', 'ablish', 'ment', 'arian', 'ism'] Notice: spaces are often attached to the following word as a leading space character (' world' rather than 'world'). Punctuation is usually its own token. Capitalization matters — 'hello' and 'Hello' may be different tokens. Each token is then mapped to an integer ID. 'Hello' might be token 9906; ',' might be token 11; ' world' might be token 1917. The model never processes the raw characters — it processes these integer IDs, and its first layer converts them into vectors (numerical lists) before any computation begins.

Complete the statements about tokenization.

BPE tokenization starts with characters and repeatedly merges the most pair of symbols to build a vocabulary of subword units.

Why Tokenization Shapes Model Behavior

Tokenization is not merely an implementation detail — it has visible consequences for how models behave. Arithmetic and spelling: Models often struggle with character-level tasks (counting letters, spelling unusual words) because they do not see characters directly — they see tokens. Asking a GPT-family model 'How many r's are in strawberry?' is tricky because 'strawberry' may be tokenized as 'straw' and 'berry,' hiding the individual letters. Language coverage: Languages with large character sets (Chinese, Japanese, Arabic) or highly inflected morphology (Turkish, Finnish) tend to tokenize into more tokens per word than English. This means the same amount of text takes more tokens, making these languages more expensive to process and sometimes reducing model quality compared to English. Context window: The context window — the maximum amount of text a model can consider at once — is measured in tokens, not words or characters. A model with a 128,000-token context window can consider roughly 90,000-100,000 words of English, but fewer words of a morphologically rich language. Prompt cost: APIs charge per token. Verbose prompts and outputs cost more than concise ones. Understanding tokenization helps you write efficient prompts.

The Spelling Illusion

Because LLMs operate on tokens — not characters — they do not inherently 'see' the letters inside a word. Tasks that seem trivial to a human (count the vowels, reverse the spelling, detect rhymes) can be surprisingly difficult for an LLM, because the model must reason about character structure from token-level representations.

Why do modern LLMs use subword tokenization rather than whole-word tokenization?

A student asks an LLM to count the number of letters in the word 'orange.' The model answers incorrectly. Which explanation is most consistent with what you know about tokenization?

Tokenize by Hand

  1. Use the BPE logic to tokenize a short phrase step by step.
  2. Phrase: 'rethinking learning'
  3. Step 1: Write out each character as a separate unit: r-e-t-h-i-n-k-i-n-g (space) l-e-a-r-n-i-n-g
  4. Step 2: Identify the most frequent character pair in this phrase. Count every adjacent pair across both words. Which pair appears most often?
  5. Step 3: Merge that pair into a new unit. Rewrite the sequence with the merged unit.
  6. Step 4: Repeat steps 2-3 two more times.
  7. Step 5: Compare your result with a classmate. Did you get the same tokenization?
  8. Step 6: Discuss: how would your result change if you had started with a much larger corpus — say, a million words of English text? Which pairs would be merged earliest? What does that tell you about which subword units end up in a real model's vocabulary?