Auditing AI Systems
Claiming that a system is fair is easy. Demonstrating it with evidence is hard. An AI audit is a structured, evidence-based assessment of a system's behavior against specified criteria. Auditing is the primary mechanism by which accountability for AI fairness is operationalized — and understanding how audits work, what they can prove, and where they fall short is essential for anyone who will build, evaluate, or be governed by AI systems.
Types of AI Audits
Not all audits are equivalent. They differ in who conducts them, what access the auditor has, and what questions they can answer. Internal audits are conducted by the organization that built or deployed the system. The advantage is access: internal auditors can examine training data, model architecture, code, and deployment logs. The disadvantage is independence: an organization auditing its own system has an incentive to find it acceptable. Internal audits are valuable but insufficient for public accountability. Regulatory audits are conducted by government agencies with legal authority — in the United States, agencies such as the Consumer Financial Protection Bureau (CFPB), Equal Employment Opportunity Commission (EEOC), and Department of Housing and Urban Development (HUD) have authority to examine AI systems used in lending, hiring, and housing. Regulatory audits have legal compulsion behind them but may lack technical expertise and are resource-constrained relative to the scale of deployment. Third-party audits are conducted by independent organizations — academic researchers, civil society groups, or specialized audit firms — with no direct financial relationship to the audited organization. Third-party audits provide the strongest independence but face an access problem: private companies are not generally required to share training data, model weights, or proprietary system details with independent auditors. Most third-party audits are conducted under access constraints. Black-box auditing, sometimes called external or API-based auditing, is a form of third-party audit conducted without access to the system's internals. The auditor probes the system by providing carefully designed inputs and observing outputs, using the input-output relationship to infer properties of the system's behavior. This is the approach used in the Gender Shades study and by journalists at ProPublica in the COMPAS analysis.
The most rigorous audit requires access to training data, model architecture, and deployment context. Most systems of greatest public concern — commercial risk assessment tools, social media ranking algorithms, hiring filters — are proprietary. Independent auditors typically cannot access what they need most. This is why audit frameworks and regulatory compulsion to disclose are important complements to technical auditing methods.
What a Technical Audit Measures
A technical fairness audit typically follows a structured process. First, the auditor specifies the fairness criteria to be evaluated — choosing among demographic parity, equalized odds, equal opportunity, predictive parity, or other definitions — and documents this choice explicitly. Second, the auditor defines the protected groups to be compared and identifies a dataset on which to evaluate the system. Third, the auditor collects predictions from the system for all individuals in the evaluation dataset. Fourth, the auditor computes the relevant fairness metrics for each group and their differences. The core metrics in a typical technical audit include: Selection rate by group (for demographic parity assessment) True positive rate by group (for equal opportunity) False positive rate by group (for equalized odds) Positive predictive value by group (for predictive parity) Calibration curves by group (for overall calibration quality) Accuracy by subgroup, including intersectional subgroups (race-by-gender, etc.) Beyond group-level metrics, auditors may examine individual-level consistency: do similar individuals receive similar predictions regardless of protected attribute? Counterfactual fairness testing produces synthetic individuals identical to real subjects in all features except the protected attribute and asks whether the model's output changes — a change would indicate the model is using the protected attribute directly or through a proxy. Auditors also examine calibration across groups: does the model's predicted probability accurately reflect actual observed rates for each group? A model that is well-calibrated overall but poorly calibrated for specific subgroups may produce systematically misleading risk scores for those groups.
Flashcards — click each card to reveal the answer
What Audits Cannot Establish
A rigorous audit can determine whether a system satisfies a specified fairness criterion on a specific evaluation dataset at a specific point in time. This is valuable but leaves important questions open. Audits cannot establish overall fairness. A system that satisfies demographic parity may violate equalized odds. An audit that measures only demographic parity provides no evidence about equalized odds. The criteria measured, and the criteria omitted, are audit design choices that determine what conclusions can be drawn. Audits cannot guarantee future performance. An audit reflects the system's behavior on an evaluation dataset collected at a specific time. Distribution shift — changes in the population, context, or environment after audit — can cause a formerly audited system to develop new fairness failures. The Gender Shades findings prompted improvements; those improvements were documented; but the companies' systems continue to evolve, and ongoing monitoring is necessary. Audits cannot certify the appropriateness of a system's use. An audit determines whether a system meets its specified fairness criteria — it says nothing about whether the system should be used in a given context at all, whether its outputs are appropriately used by decision-makers, or whether the decision domain requires human judgment that no algorithm can substitute. Audits can miss intersectional disparities if they are not designed to look for them. An audit that disaggregates by race and by gender separately may miss disparities concentrated at the intersection. An audit that covers only protected attributes specified in existing law may miss disparities along other important dimensions.
Match each audit limitation to the type of question it leaves unanswered.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
A housing algorithm is audited by an independent researcher using the API-probing method: submitting thousands of synthetic rental applications and recording approvals and rejections. The researcher finds no disparate impact by race or gender. A housing advocacy group challenges the audit's conclusions. On what grounds might the challenge be valid?
A company conducts an internal fairness audit and publishes a report finding its AI hiring tool meets demographic parity. A regulatory agency then conducts an independent audit using a different evaluation dataset and finds the tool produces higher false negative rates for female applicants. How should these conflicting results be interpreted?
Design an Audit Protocol
- You have been asked to design a fairness audit protocol for a county government's automated benefits eligibility system. The system processes applications for housing assistance and scores applicants from 0-100, with scores above 70 triggering automatic eligibility approval. Scores below 70 require case worker review.
- Your audit protocol must specify:
- 1. Audit type: Will this be an internal, regulatory, or third-party audit? What level of access to training data, model internals, and deployment logs will you require? What access do you expect you will actually have, and how will you adapt if full access is denied?
- 2. Protected groups: Which groups will you compare? List the protected attributes you will analyze, including at least two intersectional combinations.
- 3. Fairness criteria: Specify at least three fairness criteria you will evaluate. For each, write the formal criterion and explain in plain language what it measures and why it matters for a housing eligibility system.
- 4. Evaluation dataset: How will you obtain data with known true outcomes? What are the risks of using historical application data as your evaluation set?
- 5. Counterfactual tests: Design one counterfactual test appropriate for this system. Describe the synthetic variations you would create and what finding would constitute evidence of bias.
- 6. Limitations statement: Write a three-sentence paragraph that honestly describes what your audit can and cannot establish. This will appear at the front of your audit report.
- 7. Post-audit monitoring: The audit is a snapshot. Propose a monitoring mechanism that would detect fairness degradation over time after the audit.