Real-World Cases and Harms
The formal fairness definitions of the previous lessons are not abstract exercises. They were developed precisely because algorithmic systems had already caused documented, measurable harm to real people before researchers began formalizing the problems. This lesson examines five landmark cases in depth. Each is chosen because it is well-documented, has been verified by independent researchers, and illustrates a distinct mechanism through which bias in ML systems produces real-world harm. Understanding these cases is essential preparation for the work of auditing and mitigating bias.
Case 1: COMPAS and Criminal Risk Assessment
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a commercial risk assessment instrument used by courts in multiple U.S. states to inform bail, sentencing, and parole decisions. The tool assigns defendants a recidivism risk score from 1 to 10. Judges are not required to follow the score, but research indicates scores substantially influence outcomes. In 2016, the investigative newsroom ProPublica published 'Machine Bias,' an analysis of COMPAS scores applied to 7,000 defendants in Broward County, Florida. Their findings: Black defendants who did not go on to reoffend were nearly twice as likely to be rated high-risk as white defendants who did not reoffend (false positive rate: 44.9% for Black defendants vs. 23.5% for white defendants). Meanwhile, white defendants who did reoffend were more often rated low-risk than Black defendants who reoffended (false negative rate: 47.7% for white defendants vs. 28.0% for Black defendants). Northpointe, the company, responded accurately: the tool was calibrated across racial groups — a score of 7 predicted the same probability of reoffending regardless of race. As Lesson 4 showed, both findings are consistent with each other given different base rates. But the human consequences are not symmetrical: a Black defendant falsely labeled high-risk faces longer pretrial detention, harsher sentences, and potentially lost employment — real, concrete harms. The impossibility theorem explains the mathematics; it does not resolve the justice question.
When a risk assessment tool generates a false positive for a defendant, a real person may spend more time in pretrial detention, accept a harsher plea bargain out of fear, lose their job, or lose custody of their children. The statistical disparity in false positive rates translates directly into differential harm experienced by real human beings. Reporting aggregate statistics without examining distribution of harm understates the severity of the problem.
Case 2: Gender Shades and Facial Recognition
In 2018, MIT researcher Joy Buolamwini and Timnit Gebru published 'Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.' They audited commercial facial analysis systems from IBM, Microsoft, and Face++ on a dataset of 1,270 faces balanced across gender and skin tone, using the Fitzpatrick skin type scale (a dermatological classification system with six types ranging from lightest to darkest). Their findings were striking. Across all three commercial systems, error rates were highest for darker-skinned women and lowest for lighter-skinned men. One system misclassified darker-skinned women 34.7% of the time while misclassifying lighter-skinned men 0.3% of the time — a more than 100-fold disparity. The overall accuracy figures advertised by these companies (which were high) masked dramatic subgroup disparities. The mechanism was clear: the training datasets used to develop these systems were heavily skewed toward lighter-skinned faces. One widely used benchmark dataset (IJB-A) was estimated to be 79.6% male and to contain a majority of lighter-skinned individuals. Systems trained to minimize overall error on this skewed population learned primarily from and for that population. The Gender Shades study prompted IBM, Microsoft, and Face++ to significantly improve their systems' performance on darker-skinned faces within months of publication — demonstrating that the disparity was not inevitable but was a consequence of choices in data collection and benchmarking. The study also established a methodological standard: audit results must be disaggregated by intersectional subgroups, not just by race or gender in isolation, because disparities may be concentrated at intersections.
Case 3: Amazon's Hiring Algorithm
Amazon developed an internal machine learning tool beginning around 2014 to automate the resume screening process, intending to identify top engineering and technical candidates. The tool assigned resumes a star rating from one to five. It was never used in production decisions because internal testing revealed it penalized resumes that included the word 'women's' (as in 'women's chess club' or 'women's college') and systematically downgraded graduates of all-women's colleges. The mechanism was historical bias in training data: the system learned from a decade of past hiring decisions at Amazon, which, like most technology companies, had predominantly male technical staff. Resumes that resembled those of previously hired employees — who were mostly men — received higher scores. The system had effectively learned 'similarity to past hires' rather than 'job-relevant qualifications,' and past hires were not demographically representative of qualified candidates. Amazon's engineers attempted to modify the tool to remove the gender-penalizing features, but internal analysis determined they could not guarantee the system would not find other proxies for gender. Amazon shut down the project in 2017 and acknowledged there was no guarantee the tool would not discriminate in other ways. The case illustrates that post-hoc technical patches to biased systems are often inadequate — the bias had entered at the data collection and problem framing stages and could not be excised from the trained model.
Match each landmark case to the primary mechanism through which bias entered the system.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Case 4: Healthcare Allocation and the Spending Proxy
In 2019, Obermeyer, Powers, Vogeli, and Mullainathan published a landmark study in Science examining a commercial algorithm used by health systems across the United States to identify high-risk patients for enrollment in 'care management' programs — intensive care coordination that can significantly improve outcomes for chronically ill patients. The algorithm used healthcare costs as a proxy for health needs: the assumption was that sicker patients would spend more on healthcare. The system was trained on historical cost data and used to rank patients by predicted future costs, enrolling the highest-cost patients in the care management program. The researchers found that the algorithm was far less likely to refer Black patients to the care management program than white patients with the same underlying health needs. The estimated effect: at any given predicted risk score, Black patients were sicker than white patients with the same score. The algorithm effectively required Black patients to be substantially sicker to receive the same level of referral as white patients. The cause was the proxy variable. Black patients historically spent less on healthcare at equal levels of illness — not because they were less sick, but because they had historically received less care, due to systemic barriers including insurance gaps, geographic inaccessibility of providers, and well-documented disparities in how symptoms are treated across racial lines. The spending proxy was faithful to historical patterns; those historical patterns reflected systemic inequity; the algorithm learned to perpetuate that inequity. The researchers estimated that correcting the algorithm — replacing the spending proxy with direct illness measures — would more than double the number of Black patients identified for the care management program.
Case 5: Predictive Policing and Feedback Loops. Predictive policing tools like PredPol (now called Geolitica) use historical crime data to forecast 'hot spots' where police should patrol. The tools are commercially deployed in dozens of U.S. cities. A 2021 study by Lum and Isaac showed how these systems can generate self-reinforcing feedback loops. When predictions direct police to certain neighborhoods, those neighborhoods receive more patrol, which generates more arrests, which generates more crime data, which reinforces the prediction that those neighborhoods need more patrol. The tool's predictions become self-fulfilling in areas where historical police presence is high — independent of underlying crime rates. The racial dimension is significant because historical over-policing in the United States has been documented to be racially non-uniform. If historical data reflects targeted policing of predominantly Black and Latino neighborhoods, the predictive system will direct future patrol to those same neighborhoods, amplifying the historical disparity rather than measuring underlying behavior. This case illustrates a failure mode specific to deployed systems that operate in feedback with the social world: the tool changes the data-generating process, making the training distribution increasingly invalid as the tool is used. Static fairness analysis performed at deployment does not capture this dynamic.
The Gender Shades study found that commercial facial recognition systems had error rates exceeding 34% for darker-skinned women and below 1% for lighter-skinned men. Which description most accurately characterizes this as a fairness problem?
In the Obermeyer healthcare algorithm case, correcting the bias required replacing the healthcare spending proxy with direct illness measures. Why didn't simply removing the race variable from the model's inputs fix the problem?
Case Analysis: Harm and Mechanism
- Select one of the five cases studied in this lesson and write a structured case analysis addressing each of the following.
- 1. System description: What decision was the system making? Who deployed it? What were the intended beneficiaries?
- 2. Bias mechanism: At which stage(s) of the ML pipeline did bias enter? Be specific — name the variable, data source, or design choice that introduced the bias, and explain the causal mechanism.
- 3. Fairness criterion violated: Which of the formal fairness definitions from Lesson 3 does the documented disparity violate? Show, in terms of the lesson's notation, what the violation looks like.
- 4. Human harm: Describe concretely what happened to real people as a result. Who was harmed, in what way, and to what degree?
- 5. Remedy attempted or proposed: What fix was applied or recommended? Did the fix address the root cause at the stage it entered, or was it a patch downstream?
- 6. Open question: Identify one aspect of the case that remains unresolved — a harm that was not remedied, a question that the technical analysis cannot answer, or a systemic factor that no individual model fix can address.
- Your analysis should be written as if presenting to a committee of policymakers who are technically literate but not ML specialists.