How Text Generators Work
If you had to bet on what word comes after 'The cat sat on the,' you would say 'mat' or 'floor' or 'couch' — not 'democracy' or 'seventeen.' You do this effortlessly because you have absorbed enormous amounts of language since before you can remember. You know which words fit, which sound natural, which carry the right meaning in context. Large language models learn a very similar skill — at a scale no human could match. They are trained on hundreds of billions of words: books, articles, websites, code repositories, scientific papers, conversations, and more. From that training, they build a statistical map of language so detailed that they can predict, at any given moment, what piece of text is likely to come next. Then they use that prediction, repeatedly, to generate entire sentences, paragraphs, and documents. That is the core of how text generation works.
Next-Token Prediction
A language model does not write a sentence the way you might — by thinking of the whole idea first and then choosing words. It works one unit at a time, where each unit is called a token (we will explore tokens deeply in the next lesson). Here is the process in slow motion. Suppose you ask the model: 'Explain gravity in one sentence.' The model looks at that input and asks: given everything I have learned, what is the most sensible next token? It calculates a probability distribution over every token in its vocabulary — maybe 50,000 or 100,000 options — and selects one based on those probabilities. Let's say it picks 'Gravity.' Now it has a new, longer input: 'Explain gravity in one sentence. Gravity.' It runs the same process again and picks 'is.' Then 'a.' Then 'force.' And so on, until it generates a stop signal. The output feels fluent and coherent because the model has learned — from billions of examples — how human language flows. Every token choice is informed by every token that came before it in the conversation.
A language model generates text by predicting one token at a time. At each step, it considers the entire prior context — your input plus everything it has already written — and selects the next token based on learned probabilities. It repeats this until the response is complete.
There is an important subtlety here: the model does not always pick the highest-probability token. That would produce repetitive, boring output. Instead, a parameter called temperature controls how much randomness is mixed into the selection. Low temperature means the model mostly picks the most probable token — safe, predictable, less creative. High temperature means it sometimes picks surprising, lower-probability tokens — riskier, but potentially more inventive. This is why asking the same question twice can yield different answers. The model is sampling from a probability distribution, not looking up a stored response. Each run is a fresh generation.
What Training Actually Taught the Model
When we say a model was 'trained on billions of words,' what exactly did it learn? It learned to predict missing text so accurately that it was forced to build sophisticated internal representations of the world. To predict whether the next word after 'the Eiffel Tower is located in' is 'Paris' rather than 'London,' you need to know something about geography. To predict the next line of a Python function, you need to understand programming logic. Researchers discovered that sufficiently large models, trained on enough data, develop emergent capabilities — abilities that were not explicitly trained for. They can translate between languages, solve math word problems, write code, explain analogies, and much more. These abilities were not programmed in; they emerged from the structure of language itself, absorbed at scale. This is both the power and the mystery of large language models. Their capabilities are not fully understood even by the researchers who built them.
Flashcards — click each card to reveal the answer
It can feel like a language model is looking things up, but it is not consulting a database. It is generating text that matches the statistical patterns of its training data. This means it can produce confident-sounding text that is factually wrong — a phenomenon called hallucination. We will cover this in Lesson 8.
Fill in the missing words to complete the description of text generation.
Why does asking a language model the same question twice sometimes give different answers?
What does a language model fundamentally learn during training?
Be the Language Model
- Write this sentence starter on a piece of paper: 'The best thing about summer is'
- Without thinking too hard, quickly write the first five completions that come to mind.
- Now do the same thing but pretend you are being very 'safe' and conventional — pick the most predictable endings.
- Then try again, being more creative or unusual.
- You just experienced the difference between low-temperature and high-temperature generation.
- Discuss: when would you want a language model to be predictable? When would you want it to be creative?