import pandas as pd
from sklearn.metrics import roc_auc_score, recall_score
def subgroup_analysis(y_true, y_pred, y_prob, groups, group_col):
"""Compute metrics by subgroup."""
results = []
for group in groups[group_col].unique():
mask = groups[group_col] == group
n = mask.sum()
if n < 10:
continue
results.append({
'group': group,
'n': n,
'prevalence': y_true[mask].mean(),
'auroc': roc_auc_score(y_true[mask], y_prob[mask]),
'sensitivity': recall_score(y_true[mask], y_pred[mask]),
})
return pd.DataFrame(results)
# Example: analyze by age group
df_results = subgroup_analysis(
y_true, y_pred, y_prob,
demographics, 'age_group'
)
print(df_results)20 Fairness, Bias & Health Equity
AI systems trained on historical healthcare data can inherit and amplify existing disparities. This chapter examines how bias enters clinical AI, how to measure it, and what can be done to build more equitable systems.
20.1 The Obermeyer Case Study
Clinical Context: In 2019, Obermeyer et al. published a landmark study in Science revealing that a widely-used algorithm for identifying high-risk patients was systematically discriminating against Black patients (Obermeyer et al. 2019). The algorithm was used on roughly 200 million patients across the US healthcare system.
20.1.1 What Went Wrong
The algorithm predicted which patients would benefit from enrollment in a care management program. The label used for training? Healthcare costs in the following year.
The implicit assumption: sicker patients cost more. But this assumption ignores that Black patients, on average, have less access to healthcare due to systemic barriers. Equal illness does not produce equal healthcare spending when access is unequal.
The result: at any given risk score, Black patients were significantly sicker than White patients. To achieve the same predicted risk score, a Black patient needed to have more chronic conditions. The algorithm selected healthier White patients over sicker Black patients for care programs.
20.1.2 Quantifying the Disparity
The study found that reducing the bias would have increased the percentage of Black patients identified for extra care from 17.7% to 46.5%—the algorithm was missing more than half of the Black patients who should have qualified.
This wasn’t intentional discrimination. Race wasn’t even an input variable. The bias arose from the choice of prediction target: cost instead of health need. That choice encoded historical inequities into the algorithm.
20.1.3 Lessons Learned
- Labels encode values: The outcome you predict shapes who the model serves. “What to predict” is a moral choice, not just a technical one.
- Proxy discrimination: Excluding protected attributes doesn’t prevent discrimination. Correlated features (zip code, insurance type) can reconstruct them.
- Disparities compound: An algorithm deployed at scale amplifies small per-person biases into large population-level harms.
20.2 Fairness Definitions and Metrics
There is no single definition of fairness—different definitions correspond to different ethical principles, and some are mutually exclusive. Understanding the options helps you choose what matters for your application.
20.2.1 Demographic Parity
Demographic parity requires equal positive prediction rates across groups:
\[ P(\hat{Y}=1 | A=0) = P(\hat{Y}=1 | A=1) \]
where \(A\) is a protected attribute (e.g., race, sex).
Example: A hiring algorithm satisfies demographic parity if it recommends the same proportion of male and female candidates.
Limitation: Demographic parity ignores qualifications. If base rates truly differ (more men apply for engineering jobs), forcing equal selection may reduce overall accuracy.
20.2.2 Equalized Odds
Equalized odds requires equal true positive rates and false positive rates across groups:
\[ P(\hat{Y}=1 | Y=1, A=0) = P(\hat{Y}=1 | Y=1, A=1) \]
\[ P(\hat{Y}=1 | Y=0, A=0) = P(\hat{Y}=1 | Y=0, A=1) \]
This means the model is equally accurate for positive and negative cases in both groups. A relaxed version, equal opportunity, only requires equal true positive rates.
For medical diagnosis: equalized odds ensures that sick patients have equal probability of being correctly identified regardless of demographic group.
20.2.3 Calibration Across Groups
Calibration requires that predictions mean the same thing across groups:
\[ P(Y=1 | \hat{Y}=p, A=0) = P(Y=1 | \hat{Y}=p, A=1) = p \]
If the model outputs 70% risk for a Black patient and 70% risk for a White patient, both should have 70% probability of the outcome.
The Obermeyer algorithm was calibrated (70% predicted cost meant 70% actual cost for all groups) but not fair because cost was an inappropriate proxy for health need.
20.2.4 The Impossibility Theorem
A sobering result: when base rates differ between groups, you cannot simultaneously achieve demographic parity, equalized odds, and calibration. You must choose which fairness criteria matter most for your context.
For clinical AI:
- Screening: Often prioritize sensitivity (equal opportunity)
- Treatment allocation: Often prioritize calibration (predictions mean the same thing)
- Resource allocation: May require demographic parity if access disparities exist
20.3 Sources of Bias in Clinical Data
Bias can enter at every stage of the data pipeline. Understanding the sources helps identify mitigation strategies.
20.3.1 Sampling Bias
Who is in your training data?
- Patients from academic medical centers may differ from community hospitals
- Datasets from one country may not generalize to others
- Patients with insurance are overrepresented vs. uninsured
- Clinical trial participants are often younger, healthier, and less diverse
If your training population doesn’t match the deployment population, model performance will suffer for underrepresented groups.
20.3.2 Measurement Bias
Are outcomes measured consistently across groups?
- Pain assessment tools developed on White patients may underestimate pain in Black patients
- Pulse oximeters are less accurate on darker skin, leading to missed hypoxemia
- Diagnostic criteria normed on men may miss disease presentations in women (e.g., atypical heart attack symptoms)
If the ground truth labels are biased, the model learns biased predictions.
20.3.3 Historical Bias
Even accurate labels can encode past discrimination:
- Referral patterns reflect which patients physicians historically sent for advanced care
- Diagnosis rates reflect who had access to specialists
- Treatment records reflect unequal insurance coverage and drug formularies
The Obermeyer case is a prime example: historical cost data accurately reflected past spending but not health needs.
20.3.4 Label Bias
The outcome definition may not capture what you care about:
- Using “readmission” as a proxy for quality disadvantages hospitals serving sicker populations
- Using “prescription filled” as adherence misses patients who can’t afford medication
- Using “death within 30 days” misses patients who die at 32 days
20.4 Subgroup Analysis
Clinical Context: Your model has 90% accuracy overall. But what about for elderly patients? For patients with rare conditions? For patients from underrepresented racial groups? Subgroup analysis reveals disparities hidden by aggregate metrics.
20.4.1 Standard Practice
Subgroup analysis should be routine, not optional. For every model, report performance stratified by:
- Age groups (e.g., <40, 40-60, >60)
- Sex/gender
- Race/ethnicity (where available and appropriate)
- Disease severity or comorbidity count
- Institution or data source
20.4.2 HW4 Example: Age Group Analysis
In the diabetes prediction homework, you analyze model performance across age groups:
# Define age groups
age_groups = pd.cut(X_test['age'],
bins=[0, 30, 50, 100],
labels=['<30', '30-50', '>50'])
# Compute AUROC by age group
for group in ['<30', '30-50', '>50']:
mask = age_groups == group
auroc = roc_auc_score(y_test[mask], y_prob[mask])
print(f"Age {group}: AUROC = {auroc:.3f} (n={mask.sum()})")If AUROC for age >50 is 0.82 but age <30 is 0.65, investigate why. Common causes:
- Fewer young patients in training data
- Different disease presentations by age
- Missing age-relevant features
20.4.3 Statistical Considerations
Small subgroups have high variance. Report:
- Confidence intervals for all metrics
- Sample sizes per group
- Whether differences are statistically significant
A 5-point AUROC gap may not be significant if one group has only 50 samples.
20.5 Mitigation Strategies
No silver bullet exists. Different strategies work for different bias sources.
20.5.1 Preprocessing: Fix the Data
- Resampling: Oversample underrepresented groups or undersample majority groups
- Reweighting: Assign higher loss weights to underrepresented groups
- Data augmentation: Generate synthetic examples for minority groups
- Better labels: Define outcomes that capture true health status, not healthcare utilization
from sklearn.utils import resample
# Oversample minority group
minority_mask = demographics['race'] == 'minority'
X_minority = X_train[minority_mask]
y_minority = y_train[minority_mask]
X_upsampled, y_upsampled = resample(
X_minority, y_minority,
n_samples=len(X_train[~minority_mask]),
random_state=42
)
X_balanced = pd.concat([X_train[~minority_mask], X_upsampled])
y_balanced = pd.concat([y_train[~minority_mask], y_upsampled])20.5.2 In-Processing: Train Fairly
- Adversarial debiasing: Train a model that cannot predict protected attributes from its representations
- Fairness constraints: Add penalties to the loss function for fairness violations
- Multi-objective optimization: Optimize accuracy and fairness jointly
Libraries like Fairlearn provide implementations:
from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds
# Train with equalized odds constraint
mitigator = ExponentiatedGradient(
base_estimator,
constraints=EqualizedOdds()
)
mitigator.fit(X_train, y_train, sensitive_features=A_train)20.5.3 Post-Processing: Adjust Predictions
- Threshold adjustment: Use different classification thresholds per group to equalize metrics
- Calibration: Ensure predictions are well-calibrated within each group
- Reject option: Abstain from prediction when confidence differs substantially across groups
import numpy as np
# Different thresholds to equalize sensitivity
thresholds = {'group_A': 0.5, 'group_B': 0.35}
y_pred_adjusted = np.where(
group == 'A',
y_prob > thresholds['group_A'],
y_prob > thresholds['group_B']
)20.5.4 Limitations of Technical Fixes
Technical interventions can help but cannot solve structural problems:
- If training data doesn’t include a group, no algorithm can serve them fairly
- If ground truth labels are biased, fairness constraints just rearrange the bias
- If the deployment context differs from training, fairness may not transfer
Fairness in AI requires both technical tools and institutional change: better data collection, diverse development teams, community input, and ongoing monitoring.
20.6 Building Equitable AI Systems
Beyond technical metrics, consider:
- Who is harmed? Identify the stakeholders most at risk from errors
- Who benefits? Ensure gains are distributed equitably
- Who decides? Include affected communities in development
- What are the alternatives? Is AI the right solution, or would resources be better spent on direct care?
Document fairness considerations in model cards (Mitchell et al. 2019) and monitor for drift after deployment. Fairness is not a checkbox—it requires ongoing attention throughout the AI lifecycle.