The Impossibility of Satisfying All Fairness Criteria
The previous lesson introduced four mathematical definitions of fairness. An obvious question follows: can a single classifier satisfy all of them at once? The answer, under any realistic conditions, is no — and this is not a limitation of current technology but a mathematical impossibility proven by multiple researchers independently in 2016. Understanding this result is one of the most important things you can learn in this module, because it means that every deployment of a classification system necessarily involves a value judgment about which form of fairness to prioritize.
The Core Impossibility Result
The impossibility result, proven by Chouldechova (2017) and by Kleinberg, Mullainathan, and Raghavan (2016), can be stated precisely: if a binary classifier is calibrated (satisfies predictive parity), and the base rate of the positive outcome Y=1 differs between groups, then the classifier cannot simultaneously satisfy equalized odds. Let us unpack what this means with concrete numbers. Suppose we have two groups, Group A and Group B. In Group A, 20% of individuals have the property Y=1 (for example, will reoffend). In Group B, 40% do. These are the base rates. Now suppose the classifier is perfectly calibrated: when it assigns a risk score of 0.4, exactly 40% of those individuals (regardless of group) will have Y=1. This is predictive parity satisfied. A calibrated classifier, when applied to Group A (20% base rate), will produce more false positives relative to its true positives — because fewer people actually have Y=1, there are many more Y=0 individuals to potentially misclassify. In Group B (40% base rate), this ratio improves. The mathematics force the false positive rates to differ across groups as long as the base rates differ. Equalizing the false positive rates while maintaining calibration is algebraically impossible. The result is not about imperfect models. It applies to any classifier, including theoretically perfect ones, whenever base rates differ.
Let p_A and p_B be the true positive rates, f_A and f_B be the false positive rates, and v_A and v_B be the positive predictive values for groups A and B. Calibration requires PPV to be equal. Base rate differences constrain the algebraic relationship between TPR, FPR, and PPV through Bayes' theorem. With different base rates, calibration and equal FPRs cannot simultaneously hold. This is not a modeling failure — it is a consequence of probability theory.
The COMPAS controversy made this impossibility concrete. COMPAS is a commercial risk assessment tool used in criminal sentencing and parole decisions in the United States. A 2016 ProPublica investigation found that the tool had a higher false positive rate for Black defendants (labeling people as high-risk who would not reoffend) and a higher false negative rate for white defendants (labeling people as low-risk who would reoffend). Northpointe, the company that developed COMPAS, responded with data showing that the tool was calibrated: a score of 7 meant roughly 70% probability of reoffending for both Black and white defendants. Both claims are correct. How can both be correct simultaneously? Because the base rates differ: in the dataset studied, Black defendants had a higher base rate of prior convictions and were labeled as higher-risk on average. Given these different base rates, the mathematics of Bayes' theorem make it impossible for the tool to simultaneously have equal false positive rates and equal positive predictive values. ProPublica applied an equalized-odds standard; Northpointe applied a predictive-parity standard. The tool violated one and satisfied the other — necessarily, given the base rates.
Why Base Rates Differ — and Why This Matters
The impossibility result assumes base rates differ across groups. This is worth examining carefully, because base rates in criminal justice, lending, and health do differ — and the reasons those differences exist are ethically important. In criminal justice, higher prior conviction rates in some communities are partly explained by differential policing, differential prosecution, and differential plea-bargaining across racial lines — not solely by underlying behavior. The base rate difference is itself partly a product of systemic inequality. A model that uses historically generated base rates is, in part, perpetuating the consequences of that systemic inequality. This creates a dilemma: the base rate difference is empirically real (the data reflects it), and a calibrated model will reflect it. But the base rate difference is also, in part, a product of unjust historical processes, and a model that treats this difference as informative will perpetuate it into future decisions. There is no purely technical resolution to this dilemma. It requires normative decisions — about whether to treat the base rate as informative, whether to reweight data to partially correct for historical inequity, whether to abandon calibration in favor of equalized odds, or whether to abandon algorithmic decision-making in a given context altogether. These are political and ethical choices dressed in statistical language.
Choosing to satisfy predictive parity is a value judgment — it prioritizes calibration and equal score interpretation. Choosing to satisfy equalized odds is also a value judgment — it prioritizes equal error rates. Choosing not to deploy a classifier in a high-stakes domain is also a value judgment. There is no choice that avoids taking a position on whose interests are protected and whose are not. Technical neutrality is an illusion.
Match each fairness criterion to the moral intuition it prioritizes when they must be traded off.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Two groups have different base rates for loan default: Group X defaults 10% of the time; Group Y defaults 30% of the time. A perfectly calibrated risk model correctly reflects these rates in its scores. A regulator demands the model also have equal false positive rates across groups. What does the impossibility theorem say about this demand?
The COMPAS risk assessment tool was shown to have different false positive rates for Black and white defendants, AND to be calibrated across racial groups. Critics and defenders each claimed the other was wrong. The impossibility theorem reveals that:
The Fairness Trade-Off Negotiation
- Your class will work in groups to simulate a policy decision. You are an advisory board for a county that is considering using an algorithmic risk tool to inform bail decisions. You have been provided with the following data about the proposed tool's performance:
- Group A (500 individuals, 15% true re-arrest rate):
- True positive rate: 60%
- False positive rate: 25%
- Positive predictive value: 30%
- Selection rate: 35%
- Group B (500 individuals, 45% true re-arrest rate):
- True positive rate: 55%
- False positive rate: 10%
- Positive predictive value: 75%
- Selection rate: 52.5%
- Round 1: Each group is assigned one of the following positions to argue.
- Position 1: The tool should not be used because it violates equalized odds (unequal FPR: 25% vs. 10%).
- Position 2: The tool should be used because it is well-calibrated (PPV reflects base rates appropriately).
- Position 3: The tool should only be used if modified to equalize FPRs, even if this requires accepting lower calibration.
- Round 2: Switch positions. Argue the opposite.
- Round 3: As a full class, attempt to reach a recommendation. Document: which fairness criterion your recommendation implicitly prioritizes, who bears the cost of that prioritization, and whether you can identify any technical modification that would reduce the trade-off (not eliminate it — the theorem proves elimination is impossible).