Long-Term and Existential Questions
This lesson deals with questions that sit at the edge of what anyone can know with confidence: whether highly capable AI systems could pose risks to human welfare or survival on a civilizational scale, and what — if anything — should be done about it now. These are not questions with settled answers. Serious researchers disagree, and the uncertainty is real. The goal is not to convince you of a particular conclusion but to help you reason carefully about deeply uncertain risks — a skill that matters far beyond AI.
The Basic Argument for Concern
The case for taking long-term AI risks seriously rests on a chain of reasoning. It begins with capabilities: AI systems have improved dramatically in cognitive task performance over a short time, and there is no clear theoretical ceiling on further improvement. It continues with the alignment problem (covered in Lesson 3): we do not currently have reliable methods for ensuring that highly capable systems pursue goals consistent with human values. It concludes with stakes: a sufficiently capable misaligned system — one that pursues goals inconsistent with human welfare at large scale — could cause harm that is difficult or impossible to reverse. This argument does not require believing that AI will become 'conscious' or 'malevolent' in a science-fiction sense. The concern is structural: a system optimizing powerfully for a misspecified goal could cause catastrophic harm without any intent, in the same way that a powerful storm damages property without intending anything at all. AI safety researcher Stuart Russell has framed the core problem as follows: current AI systems are designed to be 'fully rational' — they pursue their objectives completely and without doubt. A more robust design would be AI systems that are inherently uncertain about their objectives and deferential to human correction. This reframing — from AI that knows what to do to AI that is uncertain and corrigible — is the philosophical basis of much current safety research.
A corrigible AI system is one that accepts correction, oversight, and shutdown by its principals — it does not resist attempts to modify or constrain it. Corrigibility is not the default behavior of a system designed to maximize an objective, because being shut down would prevent it from achieving its goal. Making AI systems reliably corrigible is an active research problem.
Several research agendas are actively working on these problems. Scalable oversight: How do you supervise an AI system whose outputs are too complex or numerous for humans to evaluate directly? Proposed approaches include using AI systems to assist in evaluating other AI systems (recursive reward modeling), and training models to explain their reasoning in ways humans can audit. Interpretability research: If we could understand what goals a model has actually learned — by inspecting its internal representations — we could detect misalignment before deployment. This is the mechanistic interpretability agenda pursued by groups like Anthropic's interpretability team. Progress exists but remains partial: we can identify some circuits within neural networks, but a complete picture of what a large model 'believes' or 'wants' is far beyond current capability. Robust alignment methods: Constitutional AI, RLHF (Reinforcement Learning from Human Feedback), and related techniques attempt to train models to follow human values. These have produced notably safer-behaving large language models compared to earlier systems, but they remain vulnerable to adversarial inputs and do not provide formal guarantees. Institutional and coordination mechanisms: Technical safety alone may be insufficient if competitive pressures incentivize developers to deploy insufficiently safe systems. Governance mechanisms — safety standards, pre-deployment evaluation requirements, international agreements — are therefore a complement to, not a substitute for, technical safety work.
Honest Treatment of Uncertainty and Disagreement
It would be dishonest not to present the full range of serious opinion on these questions. Those who think long-term risks deserve high priority argue: the potential magnitude of harm is enormous; some of the leading researchers who understand these systems most deeply are among the most concerned; and the costs of investing in safety research now are low compared to the potential benefits if risks are real. Those who are skeptical of near-term existential risk argue: current AI systems, whatever their capabilities, show no signs of autonomous goal-directed behavior resembling the threat scenarios; the analogies to sufficiently capable future systems involve speculative extrapolations far beyond observed capability; and focusing on speculative long-term risks may divert attention from concrete, demonstrable harms happening now — bias, surveillance, misinformation, labor displacement. Some researchers occupy intermediate positions: they take the theoretical arguments seriously while remaining uncertain about timelines, and they advocate for safety research that addresses both near-term and long-term risks simultaneously. How should you reason about this? A useful framework is expected value under uncertainty: even a modest probability of a very large harm may warrant significant precautionary investment. But expected-value reasoning requires estimates of both probability and magnitude, and here the disagreements are genuine. Probability estimates for catastrophic AI risk among domain experts span several orders of magnitude. Another consideration is option value: investing in safety research, interpretability, and governance now preserves the option to act more decisively later, as the technology and risks become clearer. This is a relatively low-cost precaution if risks turn out to be low, and a highly valuable precaution if they turn out to be high.
The two failure modes in reasoning about long-term AI risk are dismissal ('this is science fiction') and panic ('AI will definitely destroy humanity'). Both are unjustified by the evidence. The honest position acknowledges genuine uncertainty while taking the arguments seriously enough to act on them proportionately — investing in safety research, governance, and interpretability regardless of where one stands on probability estimates.
Prompt Challenge
Write a prompt asking an AI assistant to help you think through a long-term AI risk scenario rigorously.
Your prompt should…
- Ask about a specific scenario where advanced systems might pursue misaligned goals
- Tell the assistant to present multiple perspectives including skeptical views
- Mention that uncertainty about probability and timeline should be acknowledged
A researcher argues that a highly capable AI system optimizing for a misspecified goal could cause serious harm without any malevolent intent. Which concept from Lesson 3 does this argument MOST directly invoke?
A skeptic argues that existential AI risk concerns are 'science fiction' and should be ignored. What is the strongest counter-argument?
Reasoning Under Uncertainty
- Consider the following scenario: A leading AI lab is close to releasing a new model that appears significantly more capable than any previous system. Internal safety evaluations found no clear red flags, but the evaluation methods are not formally proven to catch all risks. A regulator must decide whether to require a six-month delay for additional safety evaluation before deployment.
- Write a structured analysis addressing:
- 1. Arguments FOR requiring the delay, including what additional evaluation might reveal.
- 2. Arguments AGAINST requiring the delay, including costs and opportunity costs.
- 3. Which factors would most change your recommendation in one direction or the other?
- 4. Your final recommendation, with a clear statement of the uncertainty in your reasoning.
- Note: there is no objectively correct answer. You are being evaluated on the quality of your reasoning, your honesty about uncertainty, and your ability to take seriously arguments on both sides.