What LLMs Cannot Do
Intellectual honesty requires equal rigor when discussing both capabilities and limits. This lesson is dedicated entirely to what large language models genuinely cannot do — and to why these are not temporary engineering bugs but structural properties of the technology as currently built. Some limits will be partially addressed by future improvements; others are intrinsic to the statistical learning paradigm. A user who understands these limits makes far better decisions about when and how to trust LLM output.
Hallucination: Confident Confabulation
The most important limitation of LLMs is hallucination: the generation of text that is fluent, specific, confident-sounding, and false. The term is borrowed from psychology (where it refers to perception without a corresponding external stimulus) and captures something real about the phenomenon — the model is producing output that feels real and is not. Hallucination is not a special failure mode that occurs only when the model is 'confused.' It is a consequence of how LLMs work. A language model produces text that is statistically plausible given its context. Plausibility and truth are correlated — true statements appear frequently in text — but they are not identical. The model has no external verification mechanism. It cannot check whether the citation it just generated exists, whether the statistic it quoted is accurate, or whether the historical date it stated is correct. It simply predicts what text is likely. Common patterns of hallucination: - Fabricated citations: The model generates the name of a real-seeming paper with plausible-sounding authors and a journal that exists — but the specific paper does not. - False statistics: Numerical claims that are specific enough to sound authoritative but were never established in the training data. - Invented details: When asked about a real person or event, the model fills gaps in its knowledge with plausible-sounding details it has generated. - Wrong answers delivered confidently: On factual questions with specific correct answers, the model may state an incorrect answer with the same tone and certainty it would use for a correct one. The rate of hallucination varies significantly by task, domain, and prompt structure. Questions about widely documented, frequently discussed topics in the training data tend to elicit more reliable responses. Questions about niche topics, very recent events, or very specific details are higher risk.
An LLM that produces fluent, confident, specific text has not thereby produced accurate text. The model's training optimizes for predicting plausible text — not for factual truth. Every specific factual claim in LLM output — especially citations, statistics, dates, names, and numbers — should be independently verified before being relied upon.
No grounded access to the external world: A base LLM has no ability to access the internet, query a database, run code, or check external sources during inference. Everything it produces comes from patterns in its training data and the information in its current context window. Some systems augment LLMs with tools (web search, code execution, database queries) — but these are additions to the architecture, not properties of the LLM itself. No real-time information: Because LLM parameters are fixed at training time, the model cannot know about events after its training cutoff. This is not a minor inconvenience for rapidly changing domains — in finance, medicine, current events, or technology, information that is eighteen months old can be substantially wrong. Limited and exact arithmetic: LLMs are not calculators. They can approximate arithmetic — especially for small numbers that appear frequently in training text — but they make errors on multi-step calculations, large numbers, and precise decimal arithmetic. A model that writes flawlessly about calculus may still miscompute 17 times 34. For arithmetic, use a calculator. Context window limits: The model can only process and attend to text within its context window. Information outside that window — including its own earlier outputs in a long conversation — is unavailable. Very long conversations or documents can exceed this limit, causing the model to lose earlier context.
Sensitivity to Phrasing and Lack of Understanding
LLMs exhibit sensitivity to phrasing that is inconsistent with deep understanding. The same question asked in two different ways — identical in meaning — can produce substantially different answers. A small change in wording can move a model from a correct answer to an incorrect one, or vice versa. This is not how genuine understanding works. A student who truly understands a concept can answer questions about it regardless of how the question is phrased. An LLM is producing statistically likely text given the exact sequence of tokens it received. If a particular phrasing pattern is more common in its training data, it will produce a different output than for a less common phrasing. Relatedly: LLMs can produce text that mimics the surface form of expert reasoning — step-by-step arguments, logical deductions, mathematical derivations — without the underlying reasoning being valid. The model has learned what reasoning looks like textually; it has not necessarily learned to reason. This is one of the deepest open questions in LLM research: to what extent do LLMs perform genuine reasoning, as opposed to sophisticated pattern matching over reasoning-shaped text? No persistent memory by default: In most deployments, an LLM has no memory of previous conversations. Each conversation begins from scratch. The model does not remember that you asked it something yesterday, or that you corrected it last week. Any context it needs must be supplied in the current conversation.
An LLM can produce text that looks like a logical argument, a mathematical proof, or a medical diagnosis — complete with numbered steps and confident conclusions — without the underlying process being logically valid. The form of reasoning is learned from training data; the substance may not be there. Always evaluate the reasoning, not just its presentation.
Flashcards — click each card to reveal the answer
A student uses an LLM to write a research paper and includes three citations the LLM provided. None of the three papers exist. This is an example of:
Why might an LLM answer a math problem correctly when expressed one way but incorrectly when expressed another way?
Probe an LLM Limitation
- This is a structured investigation into a specific LLM limitation.
- Choose one of the following limitation types to investigate:
- A) Hallucination of citations or specific facts
- B) Arithmetic errors
- C) Phrasing sensitivity
- Step 1: Design three specific test prompts targeting your chosen limitation. For (A): ask for citations on a narrow topic. For (B): construct multi-step arithmetic problems. For (C): write the same question three different ways.
- Step 2: Predict what you expect to observe before testing. Write your predictions down.
- Step 3 (if you have access to an LLM): Run your prompts. Record the outputs verbatim.
- Step 3 (if no access): Describe in detail what you would predict the outputs to be, based on what you have learned about how LLMs work.
- Step 4: Analyze your results (or predictions). What pattern do you observe? Is it consistent with the mechanism described in this lesson?
- Step 5: Write a one-paragraph note to a hypothetical user who relies on an LLM for this kind of task, explaining the limitation and what they should do to work safely with it.