Sources of Bias in the ML Pipeline
When an ML system produces biased outcomes, the first instinct is often to look at the training algorithm or the model architecture. These are rarely where the problem originates. Bias enters ML systems at every stage of development — from the moment a team decides what problem to solve, through data collection, labeling, feature engineering, model selection, and deployment. A rigorous understanding of the sources of bias requires walking through the pipeline systematically, because the remedies differ dramatically depending on where the problem enters.
Stage 1: Problem Framing
Bias can enter before a single line of code is written. The choice of what problem to solve, whose needs to prioritize, and what outcomes to optimize encodes values and can systematically disadvantage groups. Consider predictive policing tools that forecast where crimes will occur. If the underlying goal is framed as 'predict where police reports will be filed' rather than 'predict where crimes actually occur,' the system learns to predict police activity — which is determined partly by where police are deployed — not underlying crime. In neighborhoods with historically higher policing, more reports are filed. The model then predicts more crime there, which directs more police there, which generates more reports. This feedback loop amplifies historical over-policing regardless of actual crime rates. The bias was introduced at the framing stage: 'crime' was operationalized as 'reported crime,' which is a function of police presence. Framing bias also appears in decisions about whose outcomes to optimize. A system that maximizes average accuracy across all users may achieve that by ignoring a small demographic minority — high average performance is compatible with near-complete failure for underrepresented groups.
Every abstract concept (crime, creditworthiness, health risk, academic potential) must be translated into a measurable variable before a model can learn it. This translation — operationalization — is not neutral. How you measure a concept determines which groups are helped and harmed by the resulting system. Asking 'what are we actually measuring?' is one of the most important questions in any ML project.
Stage 2: Data Collection and Historical Bias
Historical bias enters when the world being measured is itself unequal. If past hiring decisions were discriminatory, a model trained on historical hiring data will learn to replicate those decisions. The data is 'accurate' in the sense that it faithfully records what happened — but what happened was unjust, and the model learns to perpetuate it. Amazon's internal resume screening tool, developed around 2014 and abandoned by 2017, is one of the most cited examples. The system was trained on a decade of resumes from people who had been hired at Amazon. Because the tech industry was and is male-dominated, the vast majority of those resumes belonged to men. The model learned that resumes resembling those of previous hires were better — and therefore penalized resumes that included words like 'women's' (as in 'women's chess club') and downgraded graduates of all-women's colleges. The training data faithfully reflected historical hiring patterns; those patterns were discriminatory; the model learned to continue them. Representation bias is related but distinct. It occurs when some groups are simply underrepresented in the data, not because of historical discrimination but because of who had access to the technology or context that generated the data. Early facial recognition datasets were overwhelmingly composed of light-skinned faces because most academic researchers and the datasets they built from were based in wealthy, predominantly white-majority countries. The resulting systems performed poorly on darker-skinned faces not out of malice but out of data myopia.
Stage 3: Labeling and Measurement Bias
Supervised learning requires labels. Labels are created by people, and people bring biases to the labeling process. Annotation bias occurs when human annotators apply different standards to the same content based on perceived group membership. Research on toxicity detection models has found that text written in African American Vernacular English (AAVE) is more frequently labeled as toxic by annotators than semantically equivalent text written in Standard American English. Models trained on these labels then predict higher toxicity scores for AAVE text — not because AAVE is more toxic, but because annotators were not consistent across dialects. Measurement bias occurs when the measurement instrument itself is less accurate for some groups. Pulse oximeters — devices that measure blood oxygen by shining light through a fingertip — are systematically less accurate for patients with darker skin tones because they were calibrated primarily on lighter-skinned populations. If a health model is trained using pulse oximeter readings as a proxy for true blood oxygen, it will have higher error for darker-skinned patients. The bias is in the instrument, not the model, but the model inherits it. Label quality also varies by context: ground-truth labels for 'recidivism' (re-arrest within a set period) are themselves influenced by policing patterns. A person who is never re-arrested because they are never policed again is labeled the same as someone who genuinely reformed — but these are different underlying states.
Match each bias source to the pipeline stage at which it primarily enters.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Stages 4–6: Features, Training, and Deployment
Feature engineering bias arises from decisions about which variables to include, how to encode them, and what transformations to apply. As the previous lesson established, proxy variables can carry protected attribute information even when the protected attribute itself is excluded. But feature choices also determine what the model can and cannot represent: a model that lacks features capturing relevant context for one group (for example, no features capturing informal credit history for people excluded from traditional banking) cannot account for factors that would make its predictions more accurate for that group. Training bias can emerge from the objective function: a model optimizing overall accuracy may sacrifice accuracy on small subgroups to maximize the aggregate. If a dataset contains 95% majority-class examples and 5% minority-class examples, a model minimizing average error will tend to perform well on the majority and poorly on the minority. This is not a failure of the algorithm — it is doing exactly what the loss function asks. Changing what the algorithm optimizes can change outcomes. Deployment bias arises from context mismatch — using a system outside the distribution on which it was trained, or using it for a purpose different from the one for which it was designed. A model trained on clinical data from one health system may perform worse when deployed in a different health system serving a different patient population. An emotion-recognition system trained for entertainment applications may be deployed in job interviews, producing consequential outputs in a domain its training did not prepare it for.
Biases introduced at each stage of the pipeline do not cancel out — they tend to compound. A problem framed around a biased proxy variable, using historical data reflecting past discrimination, with labels produced by inconsistent annotators, features that carry proxy information, a loss function optimizing aggregate accuracy, deployed in a new context — each stage adds to the inequity. Fairness is not achieved by fixing one stage in isolation.
A health system trains a model to predict which patients need additional care. The model uses healthcare spending as a proxy for illness severity. Later analysis finds that, at equal levels of illness, Black patients have historically been assigned lower spending than white patients by the healthcare system. What type of bias has entered and at which stage?
A sentiment analysis model trained on social media data from 2015–2020 is deployed in 2026 to moderate content on a new platform. Users on the new platform include a large number of non-native English speakers whose writing patterns were not well-represented in the 2015–2020 dataset. What is the most precise description of this failure mode?
Pipeline Audit
- Choose one of the following AI systems and perform a structured pipeline audit, identifying at least one potential source of bias at each of the six stages discussed in this lesson.
- Option A: A university admissions algorithm that scores applicants using high school GPA, standardized test scores, extracurricular activity records, and a written personal statement.
- Option B: A parole decision support tool that predicts whether a person recently released from prison will be re-arrested within two years.
- Option C: A hiring tool that screens job applications for a large technology company using resume text and self-reported work history.
- For each stage (problem framing, data collection, labeling, feature engineering, training, deployment), write:
- 1. A specific, plausible way bias might enter at that stage for your chosen system
- 2. The group most likely harmed by that bias
- 3. One concrete step that could reduce (but not necessarily eliminate) that source of bias
- After completing the audit, discuss: is there any stage at which bias can be eliminated entirely? What does this imply for how such systems should be overseen?