20 Fairness, Bias & Health Equity

AI systems trained on historical healthcare data can inherit and amplify existing disparities. This chapter examines how bias enters clinical AI, how to measure it, and what can be done to build more equitable systems.

20.1 The Obermeyer Case Study

Clinical Context: In 2019, Obermeyer et al. published a landmark study in Science revealing that a widely-used algorithm for identifying high-risk patients was systematically discriminating against Black patients (Obermeyer et al. 2019). The algorithm was used on roughly 200 million patients across the US healthcare system.

20.1.1 What Went Wrong

The algorithm predicted which patients would benefit from enrollment in a care management program. The label used for training? Healthcare costs in the following year.

The implicit assumption: sicker patients cost more. But this assumption ignores that Black patients, on average, have less access to healthcare due to systemic barriers. Equal illness does not produce equal healthcare spending when access is unequal.

The result: at any given risk score, Black patients were significantly sicker than White patients. To achieve the same predicted risk score, a Black patient needed to have more chronic conditions. The algorithm selected healthier White patients over sicker Black patients for care programs.

20.1.2 Quantifying the Disparity

The study found that reducing the bias would have increased the percentage of Black patients identified for extra care from 17.7% to 46.5%—the algorithm was missing more than half of the Black patients who should have qualified.

This wasn’t intentional discrimination. Race wasn’t even an input variable. The bias arose from the choice of prediction target: cost instead of health need. That choice encoded historical inequities into the algorithm.

20.1.3 Lessons Learned

Labels encode values: The outcome you predict shapes who the model serves. “What to predict” is a moral choice, not just a technical one.
Proxy discrimination: Excluding protected attributes doesn’t prevent discrimination. Correlated features (zip code, insurance type) can reconstruct them.
Disparities compound: An algorithm deployed at scale amplifies small per-person biases into large population-level harms.

20.2 Fairness Definitions and Metrics

There is no single definition of fairness—different definitions correspond to different ethical principles, and some are mutually exclusive. Understanding the options helps you choose what matters for your application.

20.2.1 Demographic Parity

Demographic parity requires equal positive prediction rates across groups:

\[ P(\hat{Y}=1 | A=0) = P(\hat{Y}=1 | A=1) \]

where $A$ is a protected attribute (e.g., race, sex).

Example: A hiring algorithm satisfies demographic parity if it recommends the same proportion of male and female candidates.

Limitation: Demographic parity ignores qualifications. If base rates truly differ (more men apply for engineering jobs), forcing equal selection may reduce overall accuracy.

20.2.2 Equalized Odds

Equalized odds requires equal true positive rates and false positive rates across groups:

\[ P(\hat{Y}=1 | Y=1, A=0) = P(\hat{Y}=1 | Y=1, A=1) \]

\[ P(\hat{Y}=1 | Y=0, A=0) = P(\hat{Y}=1 | Y=0, A=1) \]

This means the model is equally accurate for positive and negative cases in both groups. A relaxed version, equal opportunity, only requires equal true positive rates.

For medical diagnosis: equalized odds ensures that sick patients have equal probability of being correctly identified regardless of demographic group.

20.2.3 Calibration Across Groups

Calibration requires that predictions mean the same thing across groups:

\[ P(Y=1 | \hat{Y}=p, A=0) = P(Y=1 | \hat{Y}=p, A=1) = p \]

If the model outputs 70% risk for a Black patient and 70% risk for a White patient, both should have 70% probability of the outcome.

The Obermeyer algorithm was calibrated (70% predicted cost meant 70% actual cost for all groups) but not fair because cost was an inappropriate proxy for health need.

20.2.4 The Impossibility Theorem

A sobering result: when base rates differ between groups, you cannot simultaneously achieve demographic parity, equalized odds, and calibration. You must choose which fairness criteria matter most for your context.

For clinical AI:

Screening: Often prioritize sensitivity (equal opportunity)
Treatment allocation: Often prioritize calibration (predictions mean the same thing)
Resource allocation: May require demographic parity if access disparities exist

20.3 Sources of Bias in Clinical Data

Bias can enter at every stage of the data pipeline. Understanding the sources helps identify mitigation strategies.

20.3.1 Sampling Bias

Who is in your training data?

Patients from academic medical centers may differ from community hospitals
Datasets from one country may not generalize to others
Patients with insurance are overrepresented vs. uninsured
Clinical trial participants are often younger, healthier, and less diverse

If your training population doesn’t match the deployment population, model performance will suffer for underrepresented groups.

20.3.2 Measurement Bias

Are outcomes measured consistently across groups?

Pain assessment tools developed on White patients may underestimate pain in Black patients
Pulse oximeters are less accurate on darker skin, leading to missed hypoxemia
Diagnostic criteria normed on men may miss disease presentations in women (e.g., atypical heart attack symptoms)

If the ground truth labels are biased, the model learns biased predictions.

20.3.3 Historical Bias

Even accurate labels can encode past discrimination:

Referral patterns reflect which patients physicians historically sent for advanced care
Diagnosis rates reflect who had access to specialists
Treatment records reflect unequal insurance coverage and drug formularies

The Obermeyer case is a prime example: historical cost data accurately reflected past spending but not health needs.

20.3.4 Label Bias

The outcome definition may not capture what you care about:

Using “readmission” as a proxy for quality disadvantages hospitals serving sicker populations
Using “prescription filled” as adherence misses patients who can’t afford medication
Using “death within 30 days” misses patients who die at 32 days

20.4 Subgroup Analysis

Clinical Context: Your model has 90% accuracy overall. But what about for elderly patients? For patients with rare conditions? For patients from underrepresented racial groups? Subgroup analysis reveals disparities hidden by aggregate metrics.

20.4.1 Standard Practice

Subgroup analysis should be routine, not optional. For every model, report performance stratified by:

Age groups (e.g., <40, 40-60, >60)
Sex/gender
Race/ethnicity (where available and appropriate)
Disease severity or comorbidity count
Institution or data source

import pandas as pd
from sklearn.metrics import roc_auc_score, recall_score

def subgroup_analysis(y_true, y_pred, y_prob, groups, group_col):
    """Compute metrics by subgroup."""
    results = []
    for group in groups[group_col].unique():
        mask = groups[group_col] == group
        n = mask.sum()
        if n < 10:
            continue

        results.append({
            'group': group,
            'n': n,
            'prevalence': y_true[mask].mean(),
            'auroc': roc_auc_score(y_true[mask], y_prob[mask]),
            'sensitivity': recall_score(y_true[mask], y_pred[mask]),
        })

    return pd.DataFrame(results)

# Example: analyze by age group
df_results = subgroup_analysis(
    y_true, y_pred, y_prob,
    demographics, 'age_group'
)
print(df_results)

20.4.2 HW4 Example: Age Group Analysis

In the diabetes prediction homework, you analyze model performance across age groups:

# Define age groups
age_groups = pd.cut(X_test['age'],
                    bins=[0, 30, 50, 100],
                    labels=['<30', '30-50', '>50'])

# Compute AUROC by age group
for group in ['<30', '30-50', '>50']:
    mask = age_groups == group
    auroc = roc_auc_score(y_test[mask], y_prob[mask])
    print(f"Age {group}: AUROC = {auroc:.3f} (n={mask.sum()})")

If AUROC for age >50 is 0.82 but age <30 is 0.65, investigate why. Common causes:

Fewer young patients in training data
Different disease presentations by age
Missing age-relevant features

20.4.3 Statistical Considerations

Small subgroups have high variance. Report:

Confidence intervals for all metrics
Sample sizes per group
Whether differences are statistically significant

A 5-point AUROC gap may not be significant if one group has only 50 samples.

20.5 Mitigation Strategies

No silver bullet exists. Different strategies work for different bias sources.

20.5.1 Preprocessing: Fix the Data

Resampling: Oversample underrepresented groups or undersample majority groups
Reweighting: Assign higher loss weights to underrepresented groups
Data augmentation: Generate synthetic examples for minority groups
Better labels: Define outcomes that capture true health status, not healthcare utilization

from sklearn.utils import resample

# Oversample minority group
minority_mask = demographics['race'] == 'minority'
X_minority = X_train[minority_mask]
y_minority = y_train[minority_mask]

X_upsampled, y_upsampled = resample(
    X_minority, y_minority,
    n_samples=len(X_train[~minority_mask]),
    random_state=42
)

X_balanced = pd.concat([X_train[~minority_mask], X_upsampled])
y_balanced = pd.concat([y_train[~minority_mask], y_upsampled])

20.5.2 In-Processing: Train Fairly

Adversarial debiasing: Train a model that cannot predict protected attributes from its representations
Fairness constraints: Add penalties to the loss function for fairness violations
Multi-objective optimization: Optimize accuracy and fairness jointly

Libraries like Fairlearn provide implementations:

from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds

# Train with equalized odds constraint
mitigator = ExponentiatedGradient(
    base_estimator,
    constraints=EqualizedOdds()
)
mitigator.fit(X_train, y_train, sensitive_features=A_train)

20.5.3 Post-Processing: Adjust Predictions

Threshold adjustment: Use different classification thresholds per group to equalize metrics
Calibration: Ensure predictions are well-calibrated within each group
Reject option: Abstain from prediction when confidence differs substantially across groups

import numpy as np

# Different thresholds to equalize sensitivity
thresholds = {'group_A': 0.5, 'group_B': 0.35}

y_pred_adjusted = np.where(
    group == 'A',
    y_prob > thresholds['group_A'],
    y_prob > thresholds['group_B']
)

20.5.4 Limitations of Technical Fixes

Technical interventions can help but cannot solve structural problems:

If training data doesn’t include a group, no algorithm can serve them fairly
If ground truth labels are biased, fairness constraints just rearrange the bias
If the deployment context differs from training, fairness may not transfer

Fairness in AI requires both technical tools and institutional change: better data collection, diverse development teams, community input, and ongoing monitoring.

20.6 Building Equitable AI Systems

Beyond technical metrics, consider:

Who is harmed? Identify the stakeholders most at risk from errors
Who benefits? Ensure gains are distributed equitably
Who decides? Include affected communities in development
What are the alternatives? Is AI the right solution, or would resources be better spent on direct care?

Document fairness considerations in model cards (Mitchell et al. 2019) and monitor for drift after deployment. Fairness is not a checkbox—it requires ongoing attention throughout the AI lifecycle.

# Fairness, Bias & Health Equity {#sec-fairness-bias-equity} AI systems trained on historical healthcare data can inherit and amplify existing disparities. This chapter examines how bias enters clinical AI, how to measure it, and what can be done to build more equitable systems. ## The Obermeyer Case Study *Clinical Context:* In 2019, Obermeyer et al. published a landmark study in *Science* revealing that a widely-used algorithm for identifying high-risk patients was systematically discriminating against Black patients [@obermeyer2019dissecting]. The algorithm was used on roughly 200 million patients across the US healthcare system. ### What Went Wrong The algorithm predicted which patients would benefit from enrollment in a care management program. The label used for training? *Healthcare costs* in the following year. The implicit assumption: sicker patients cost more. But this assumption ignores that Black patients, on average, have less access to healthcare due to systemic barriers. Equal illness does not produce equal healthcare spending when access is unequal. The result: at any given risk score, Black patients were significantly sicker than White patients. To achieve the same predicted risk score, a Black patient needed to have more chronic conditions. The algorithm selected healthier White patients over sicker Black patients for care programs. ### Quantifying the Disparity The study found that reducing the bias would have increased the percentage of Black patients identified for extra care from 17.7% to 46.5%—the algorithm was missing more than half of the Black patients who should have qualified. This wasn't intentional discrimination. Race wasn't even an input variable. The bias arose from the *choice of prediction target*: cost instead of health need. That choice encoded historical inequities into the algorithm. ### Lessons Learned - **Labels encode values**: The outcome you predict shapes who the model serves. "What to predict" is a moral choice, not just a technical one. - **Proxy discrimination**: Excluding protected attributes doesn't prevent discrimination. Correlated features (zip code, insurance type) can reconstruct them. - **Disparities compound**: An algorithm deployed at scale amplifies small per-person biases into large population-level harms. ## Fairness Definitions and Metrics There is no single definition of fairness—different definitions correspond to different ethical principles, and some are mutually exclusive. Understanding the options helps you choose what matters for your application. ### Demographic Parity **Demographic parity** requires equal positive prediction rates across groups: $$ P(\hat{Y}=1 | A=0) = P(\hat{Y}=1 | A=1) $$ where $A$ is a protected attribute (e.g., race, sex). Example: A hiring algorithm satisfies demographic parity if it recommends the same proportion of male and female candidates. Limitation: Demographic parity ignores qualifications. If base rates truly differ (more men apply for engineering jobs), forcing equal selection may reduce overall accuracy. ### Equalized Odds **Equalized odds** requires equal true positive rates *and* false positive rates across groups: $$ P(\hat{Y}=1 | Y=1, A=0) = P(\hat{Y}=1 | Y=1, A=1) $$ $$ P(\hat{Y}=1 | Y=0, A=0) = P(\hat{Y}=1 | Y=0, A=1) $$ This means the model is equally accurate for positive and negative cases in both groups. A relaxed version, **equal opportunity**, only requires equal true positive rates. For medical diagnosis: equalized odds ensures that sick patients have equal probability of being correctly identified regardless of demographic group. ### Calibration Across Groups **Calibration** requires that predictions mean the same thing across groups: $$ P(Y=1 | \hat{Y}=p, A=0) = P(Y=1 | \hat{Y}=p, A=1) = p $$ If the model outputs 70% risk for a Black patient and 70% risk for a White patient, both should have 70% probability of the outcome. The Obermeyer algorithm was *calibrated* (70% predicted cost meant 70% actual cost for all groups) but *not* fair because cost was an inappropriate proxy for health need. ### The Impossibility Theorem A sobering result: when base rates differ between groups, you cannot simultaneously achieve demographic parity, equalized odds, and calibration. You must choose which fairness criteria matter most for your context. For clinical AI: - **Screening**: Often prioritize sensitivity (equal opportunity) - **Treatment allocation**: Often prioritize calibration (predictions mean the same thing) - **Resource allocation**: May require demographic parity if access disparities exist ## Sources of Bias in Clinical Data Bias can enter at every stage of the data pipeline. Understanding the sources helps identify mitigation strategies. ### Sampling Bias Who is in your training data? - Patients from academic medical centers may differ from community hospitals - Datasets from one country may not generalize to others - Patients with insurance are overrepresented vs. uninsured - Clinical trial participants are often younger, healthier, and less diverse If your training population doesn't match the deployment population, model performance will suffer for underrepresented groups. ### Measurement Bias Are outcomes measured consistently across groups? - Pain assessment tools developed on White patients may underestimate pain in Black patients - Pulse oximeters are less accurate on darker skin, leading to missed hypoxemia - Diagnostic criteria normed on men may miss disease presentations in women (e.g., atypical heart attack symptoms) If the ground truth labels are biased, the model learns biased predictions. ### Historical Bias Even accurate labels can encode past discrimination: - Referral patterns reflect which patients physicians historically sent for advanced care - Diagnosis rates reflect who had access to specialists - Treatment records reflect unequal insurance coverage and drug formularies The Obermeyer case is a prime example: historical cost data accurately reflected past spending but not health needs. ### Label Bias The outcome definition may not capture what you care about: - Using "readmission" as a proxy for quality disadvantages hospitals serving sicker populations - Using "prescription filled" as adherence misses patients who can't afford medication - Using "death within 30 days" misses patients who die at 32 days ## Subgroup Analysis *Clinical Context:* Your model has 90% accuracy overall. But what about for elderly patients? For patients with rare conditions? For patients from underrepresented racial groups? Subgroup analysis reveals disparities hidden by aggregate metrics. ### Standard Practice Subgroup analysis should be routine, not optional. For every model, report performance stratified by: - Age groups (e.g., <40, 40-60, >60) - Sex/gender - Race/ethnicity (where available and appropriate) - Disease severity or comorbidity count - Institution or data source ```{python} #| eval: false import pandas as pd from sklearn.metrics import roc_auc_score, recall_score def subgroup_analysis(y_true, y_pred, y_prob, groups, group_col): """Compute metrics by subgroup.""" results = [] for group in groups[group_col].unique(): mask = groups[group_col] == group n = mask.sum() if n < 10: continue results.append({ 'group': group, 'n': n, 'prevalence': y_true[mask].mean(), 'auroc': roc_auc_score(y_true[mask], y_prob[mask]), 'sensitivity': recall_score(y_true[mask], y_pred[mask]), }) return pd.DataFrame(results) # Example: analyze by age group df_results = subgroup_analysis( y_true, y_pred, y_prob, demographics, 'age_group' ) print(df_results) ``` ### HW4 Example: Age Group Analysis In the diabetes prediction homework, you analyze model performance across age groups: ```{python} #| eval: false # Define age groups age_groups = pd.cut(X_test['age'], bins=[0, 30, 50, 100], labels=['<30', '30-50', '>50']) # Compute AUROC by age group for group in ['<30', '30-50', '>50']: mask = age_groups == group auroc = roc_auc_score(y_test[mask], y_prob[mask]) print(f"Age {group}: AUROC = {auroc:.3f} (n={mask.sum()})") ``` If AUROC for age >50 is 0.82 but age <30 is 0.65, investigate why. Common causes: - Fewer young patients in training data - Different disease presentations by age - Missing age-relevant features ### Statistical Considerations Small subgroups have high variance. Report: - Confidence intervals for all metrics - Sample sizes per group - Whether differences are statistically significant A 5-point AUROC gap may not be significant if one group has only 50 samples. ## Mitigation Strategies No silver bullet exists. Different strategies work for different bias sources. ### Preprocessing: Fix the Data - **Resampling**: Oversample underrepresented groups or undersample majority groups - **Reweighting**: Assign higher loss weights to underrepresented groups - **Data augmentation**: Generate synthetic examples for minority groups - **Better labels**: Define outcomes that capture true health status, not healthcare utilization ```{python} #| eval: false from sklearn.utils import resample # Oversample minority group minority_mask = demographics['race'] == 'minority' X_minority = X_train[minority_mask] y_minority = y_train[minority_mask] X_upsampled, y_upsampled = resample( X_minority, y_minority, n_samples=len(X_train[~minority_mask]), random_state=42 ) X_balanced = pd.concat([X_train[~minority_mask], X_upsampled]) y_balanced = pd.concat([y_train[~minority_mask], y_upsampled]) ``` ### In-Processing: Train Fairly - **Adversarial debiasing**: Train a model that cannot predict protected attributes from its representations - **Fairness constraints**: Add penalties to the loss function for fairness violations - **Multi-objective optimization**: Optimize accuracy and fairness jointly Libraries like Fairlearn provide implementations: ```{python} #| eval: false from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds # Train with equalized odds constraint mitigator = ExponentiatedGradient( base_estimator, constraints=EqualizedOdds() ) mitigator.fit(X_train, y_train, sensitive_features=A_train) ``` ### Post-Processing: Adjust Predictions - **Threshold adjustment**: Use different classification thresholds per group to equalize metrics - **Calibration**: Ensure predictions are well-calibrated within each group - **Reject option**: Abstain from prediction when confidence differs substantially across groups ```{python} #| eval: false import numpy as np # Different thresholds to equalize sensitivity thresholds = {'group_A': 0.5, 'group_B': 0.35} y_pred_adjusted = np.where( group == 'A', y_prob > thresholds['group_A'], y_prob > thresholds['group_B'] ) ``` ### Limitations of Technical Fixes Technical interventions can help but cannot solve structural problems: - If training data doesn't include a group, no algorithm can serve them fairly - If ground truth labels are biased, fairness constraints just rearrange the bias - If the deployment context differs from training, fairness may not transfer Fairness in AI requires both technical tools and institutional change: better data collection, diverse development teams, community input, and ongoing monitoring. ## Building Equitable AI Systems Beyond technical metrics, consider: - **Who is harmed?** Identify the stakeholders most at risk from errors - **Who benefits?** Ensure gains are distributed equitably - **Who decides?** Include affected communities in development - **What are the alternatives?** Is AI the right solution, or would resources be better spent on direct care? Document fairness considerations in model cards [@mitchell2019model] and monitor for drift after deployment. Fairness is not a checkbox—it requires ongoing attention throughout the AI lifecycle.