Skip to main content
Frontier & Future AI

⏱ About 15 min15 XP

How AI Generates Text

When you type a question into an AI assistant and a well-structured paragraph appears in seconds, it can feel like magic. There is no magic — but the mechanism is genuinely fascinating. Language models generate text by answering one question repeatedly: given everything I have seen so far, what word (or word-piece) is most likely to come next? They answer that question thousands of times in rapid succession to build a sentence, a paragraph, or an entire essay.

Tokens: The Building Blocks of Language Models

Language models do not process text character by character or word by word in the everyday sense. They work with tokens — chunks of text that are usually a word, part of a word, or a punctuation mark. The word 'unbelievable' might be split into 'un', 'believ', and 'able'. The sentence 'The cat sat.' might be four tokens: 'The', ' cat', ' sat', and '.'. This tokenization approach lets the model handle rare words (break them into familiar pieces), different languages (each with its own token patterns), and code (where symbols and keywords are their own tokens). A large language model might process sequences of thousands of tokens at once — holding all of that in a working context as it decides what token to generate next.

What Is a Token?

A token is the basic unit a language model reads and writes — typically a short word, a word fragment, or a punctuation symbol. Most English words are one or two tokens. Understanding tokens helps explain why AI sometimes makes strange errors at the level of syllables or unusual word endings.

Next-Token Prediction

The central operation of a language model is next-token prediction. Given a sequence of tokens, the model produces a probability distribution over its entire vocabulary — every possible next token — and then samples from or selects from that distribution. For example, given the tokens 'The Eiffel Tower is located in', a well-trained model would assign a very high probability to 'Paris' and very low probabilities to 'elephants' or 'seventeen'. It picks a token according to those probabilities, appends it to the sequence, and then does the whole process again for the next position. This repeats until the model generates a special end-of-sequence token, reaches a length limit, or the system tells it to stop. One answer to a complex question might require the model to run this prediction loop several hundred times.

A key parameter called temperature controls how the model samples. A temperature near zero makes the model always pick the highest-probability token, producing predictable, repetitive text. A higher temperature introduces randomness — the model occasionally picks a less-likely token, producing more surprising or creative outputs. Creative writing tasks usually use higher temperatures; factual Q&A tasks use lower ones.

Temperature and Creativity

Temperature is a dial that trades predictability for creativity. Low temperature: safe, reliable, sometimes boring. High temperature: surprising and varied, sometimes incoherent. Most AI tools set this automatically depending on the task.

The Transformer Architecture

Almost every major language model today is built on an architecture called the Transformer, introduced in a 2017 research paper titled 'Attention Is All You Need'. The breakthrough idea of the Transformer is a mechanism called self-attention. Self-attention allows the model to look at every other token in the context window when deciding what to generate next — not just the most recent few. When processing the word 'it' in the sentence 'The trophy did not fit in the suitcase because it was too big', self-attention lets the model figure out that 'it' refers to 'trophy' by weighing the relationship between all the words simultaneously. This ability to model long-range relationships in text is what lets Transformers generate coherent multi-paragraph responses, follow complex instructions, and maintain context across a long conversation.

Match each text generation concept to its accurate description.

Terms

Token
Next-token prediction
Temperature
Self-attention
Context window

Definitions

The core operation: assigning probabilities to every possible next word-piece and selecting one
A parameter that controls how much randomness is introduced when selecting the next token
The basic unit a language model reads and writes, typically a word or word fragment
A mechanism that lets the model weigh the relationship between every token in the context simultaneously
The maximum number of tokens a model can consider at once when generating a response

Drag terms onto their definitions, or click a term then click a definition to match.

Training a Language Model

A language model is trained on a massive corpus of text — web pages, books, code repositories, scientific papers, and more — often totaling trillions of tokens. During training, the model is given a sequence of tokens with the last one hidden, and it tries to predict that hidden token. It compares its prediction to the actual token, measures the error (loss), and adjusts its billions of internal parameters to do better next time. This process, called next-token prediction pretraining, runs on the entire corpus many times using enormous clusters of specialized computer chips. After pretraining, many models go through additional phases — fine-tuning and reinforcement learning from human feedback (RLHF) — that teach the model to follow instructions, be helpful, and avoid harmful outputs. The base model learns what language looks like; fine-tuning teaches it how a useful assistant should behave.

Pretraining vs. Fine-Tuning

Pretraining teaches a model the patterns of human language from trillions of tokens. Fine-tuning (including RLHF) shapes the model's behavior to be a helpful, honest, and safe assistant. Both stages are essential for the AI tools people use daily.

Complete the sentence about how language models generate text.

A language model generates text by repeatedly predicting the most likely next , using a mechanism called to weigh relationships between all tokens in the .

What does a language model actually do at each step of text generation?

What is the advantage of the Transformer's self-attention mechanism over simpler text processing approaches?

Be the Language Model

  1. Step 1: Read this sentence fragment and write down the five words you think are most likely to complete it, in order from most to least likely: 'Every morning she made herself a cup of hot ____.'
  2. Step 2: Assign rough percentages to each of your five candidates (they should add up to 100 or close to it).
  3. Step 3: Now imagine you set the temperature very low — you always pick the most likely word. Write the resulting completed sentence.
  4. Step 4: Imagine you set the temperature very high — you pick randomly from your top five words with equal probability. Write three different completed sentences this way.
  5. Step 5: Reflect: which temperature setting produces more interesting sentences? Which is more reliable? Write two sentences connecting this to why different AI tasks use different temperature settings.