4 Statistics for Medical AI

Clinical Context: A machine learning paper claims 95% accuracy for detecting pancreatic cancer from CT scans. Is this impressive? Useful? Before you can answer, you need to know: 95% of what? On which patients? How was it validated? This chapter provides the statistical foundation to critically evaluate medical AI systems—and to build ones that actually work in practice.

Statistics and machine learning share deep roots, but they often speak different languages. Classical statistics focuses on inference—drawing conclusions about populations from samples, testing hypotheses, quantifying uncertainty. Machine learning focuses on prediction—building models that generalize to new data. In medical AI, we need both: inference to understand what our models are doing, and prediction to make them clinically useful.

This chapter bridges your existing statistics knowledge to the specific needs of medical AI. We’ll cover probability foundations with clinical applications, explain why traditional hypothesis testing doesn’t translate directly to model evaluation, and develop thorough understanding of the metrics that matter in healthcare: sensitivity, specificity, PPV, NPV, ROC curves, and calibration.

4.1 Probability Foundations

4.1.1 Random Variables and Distributions

Clinical Context: A patient’s blood glucose level, whether a tumor is malignant, the number of ER visits in a year—these are all random variables. Understanding their distributions is fundamental to building models that predict them.

You’ve likely encountered probability distributions in biostatistics. Here’s a quick review focused on their ML applications.

Discrete distributions describe countable outcomes:

Bernoulli: Single binary outcome (disease present/absent)
Binomial: Number of successes in n trials (patients responding to treatment)
Poisson: Count of rare events (adverse events per patient-year)

Continuous distributions describe measurements:

Normal (Gaussian): Many biological measurements approximate this
Log-normal: Right-skewed data like length of stay, drug concentrations
Exponential: Time until an event (survival analysis)

In machine learning, we often model the conditional distribution $P(Y|X)$—the probability of an outcome $Y$ given features $X$. A logistic regression models the probability of disease given patient characteristics. A neural network might model the probability distribution over possible diagnoses.

4.1.2 Conditional Probability and Bayes’ Theorem

Clinical Context: A screening test is positive. What’s the probability the patient actually has the disease? This question—deceptively simple—is where conditional probability becomes essential.

Bayes’ theorem relates conditional probabilities:

\[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\]

In medical terms, translating to disease ($D$) and positive test ($T^+$):

\[P(D|T^+) = \frac{P(T^+|D) \cdot P(D)}{P(T^+)}\]

The terms have clinical names:

$P(D|T^+)$: Positive predictive value (PPV) — probability of disease given positive test
$P(T^+|D)$: Sensitivity — probability of positive test given disease
$P(D)$: Prevalence or pre-test probability
$P(T^+)$: Probability of positive test (in anyone)

The denominator expands using the law of total probability:

\[P(T^+) = P(T^+|D) \cdot P(D) + P(T^+|\neg D) \cdot P(\neg D)\]

Where $P(T^+|\neg D)$ is the false positive rate (1 - specificity).

4.1.3 The Mammography Problem

This classic example illustrates why base rates matter—even for AI systems.

Consider a breast cancer screening scenario:

Prevalence in screening population: 1% ($P(D) = 0.01$)
Sensitivity of the test: 90% ($P(T^+|D) = 0.90$)
Specificity of the test: 90% ($P(\neg T^+|\neg D) = 0.90$)

A woman tests positive. What’s the probability she has cancer?

Many people—including physicians—intuitively guess around 90%. Let’s calculate:

\[P(D|T^+) = \frac{0.90 \times 0.01}{0.90 \times 0.01 + 0.10 \times 0.99} = \frac{0.009}{0.009 + 0.099} = \frac{0.009}{0.108} \approx 0.083\]

The answer is about 8%, not 90%. Despite a “90% accurate” test, most positive results are false positives because the disease is rare.

This has profound implications for medical AI:

Performance metrics must account for prevalence: A model’s real-world utility depends on the population it’s deployed in
High sensitivity isn’t enough: If specificity is imperfect, false positives overwhelm true positives in low-prevalence settings
AI systems inherit this problem: A deep learning model with 95% sensitivity and 95% specificity for a 1% prevalence condition will still have a PPV of only 16%

4.1.4 Pre-Test and Post-Test Probability

Clinicians often think in terms of how a test changes their assessment:

Pre-test probability: Prior belief about disease probability (based on symptoms, demographics, clinical judgment)
Post-test probability: Updated probability after seeing test results

This is Bayesian reasoning in action. An AI prediction is just another “test”—it should update clinical judgment, not replace it.

Likelihood ratios provide a convenient way to update probabilities:

\[\text{Positive LR} = \frac{\text{Sensitivity}}{1 - \text{Specificity}}\]

\[\text{Post-test odds} = \text{Pre-test odds} \times \text{Likelihood ratio}\]

A positive LR of 10 means a positive result increases the odds of disease 10-fold. A well-calibrated AI model should provide meaningful likelihood ratios—dramatically changing probability estimates when positive.

4.1.5 Independence and Conditional Independence

Two events are independent if knowing one tells you nothing about the other:

\[P(A \cap B) = P(A) \cdot P(B)\]

Conditional independence is subtler: $A$ and $B$ are conditionally independent given $C$ if:

\[P(A \cap B | C) = P(A|C) \cdot P(B|C)\]

This matters in medical AI because:

Naive Bayes classifiers assume features are conditionally independent given the class label—often violated in medical data
Graphical models represent conditional independence structures
Confounding occurs when two variables appear related but are both caused by a third

Example: In hospital data, ice cream sales might correlate with drowning deaths. They’re not causally related—both are caused by summer weather. Conditional on season, they’re independent.

4.2 From Hypothesis Testing to Model Evaluation

Clinical Context: You’ve likely been trained to look for “p < 0.05” as the marker of a meaningful result. In machine learning, this framework doesn’t directly apply—and understanding why will help you avoid common evaluation mistakes.

4.2.1 The Classical Framework (Brief Review)

Traditional hypothesis testing proceeds:

State a null hypothesis ($H_0$: no effect)
Collect data and compute a test statistic
Calculate the p-value: probability of data this extreme if $H_0$ is true
Reject $H_0$ if p-value < significance level (often 0.05)

This framework connects to diagnostic testing:

Statistical Term	Diagnostic Testing Equivalent
Type I error (false positive)	False positive rate (1 - specificity)
Type II error (false negative)	False negative rate (1 - sensitivity)
Power (1 - Type II error rate)	Sensitivity
Significance level	False positive rate

4.2.2 Why P-Values Don’t Evaluate Models

Here’s the fundamental disconnect: hypothesis testing asks whether an effect exists; machine learning asks how well a model predicts.

Consider comparing two models:

Model A: AUC = 0.82
Model B: AUC = 0.85

Is Model B better? A traditional approach might test:

$H_0$: AUC$_A$ = AUC$_B$

With enough data, you’ll get p < 0.05. But this tells you the difference is statistically detectable, not that it’s clinically meaningful. A p-value of 0.001 for a 0.01 AUC improvement isn’t impressive—it just means you have a lot of data.

Conversely, with small datasets, important differences might not reach significance. The p-value depends on sample size, not just effect size.

4.2.3 The Replication Crisis and Its Lessons

The past decade has revealed widespread problems with p-value-based research:

Publication bias: Only “significant” results get published
P-hacking: Trying multiple analyses until p < 0.05
HARKing: Hypothesizing After Results are Known
Overfitting to significance: Models tuned to achieve p < 0.05 on one dataset

Medical AI research isn’t immune. A 2019 analysis found that many published medical AI studies:

Tested on the same data used for model development
Reported only the best-performing model variant
Used subgroup analyses that weren’t pre-specified

4.2.4 What ML Does Differently

Machine learning evaluation focuses on predictive performance on held-out data:

Train/test split: Never evaluate on training data
Cross-validation: Estimate performance across multiple splits
External validation: Test on completely independent datasets

Instead of asking “is this effect real?”, we ask “how well does this model generalize?”

Key metrics replace p-values:

Discrimination: Can the model rank-order risk? (AUC, concordance)
Calibration: Are probability outputs accurate? (calibration curves, Brier score)
Clinical utility: Does the model improve decisions? (net benefit, decision curves)

The shift in mindset: we’re not trying to prove something is true; we’re trying to build something that works.

4.2.5 Multiple Comparisons in Model Selection

When you compare many models, some will appear best by chance. This is the machine learning version of multiple testing.

Suppose you try 20 model variants and report the best test-set performance. Even if all models were identical, one would appear ~5% better than average just by luck.

Solutions:

Nested cross-validation: Tune hyperparameters in inner loop, evaluate in outer loop
Pre-registration: Specify your evaluation plan before looking at results
Conservative reporting: Report confidence intervals, not just point estimates
External validation: The ultimate test is independent data

4.3 Evaluation Metrics for Classification

Clinical Context: Your model outputs a probability of sepsis. How do you quantify whether these predictions are useful? The choice of metric shapes what “good” means—and different stakeholders care about different things.

4.3.1 The Confusion Matrix

All classification metrics derive from the confusion matrix:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

From these four counts, we compute everything else.

import numpy as np
from sklearn.metrics import confusion_matrix, classification_report

# Example predictions
y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0])  # 4 positive, 6 negative
y_pred = np.array([1, 1, 0, 0, 0, 0, 0, 1, 0, 0])  # Model predictions

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
# Output:
# [[5 1]
#  [2 2]]
# TN=5, FP=1, FN=2, TP=2

print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))

4.3.2 Sensitivity and Specificity

Sensitivity (True Positive Rate, Recall): Of those with the condition, how many did we detect?

\[\text{Sensitivity} = \frac{TP}{TP + FN}\]

Specificity (True Negative Rate): Of those without the condition, how many did we correctly identify as negative?

\[\text{Specificity} = \frac{TN}{TN + FP}\]

Clinical interpretation:

High sensitivity: Few missed cases (low false negatives). Critical for ruling OUT disease. A negative result is reassuring.
High specificity: Few false alarms (low false positives). Critical for ruling IN disease. A positive result is convincing.

The clinical mnemonic: SnNout (Sensitive test, Negative result rules out) and SpPin (Specific test, Positive result rules in).

4.3.3 The Sensitivity-Specificity Tradeoff

Most models output a probability, and we choose a threshold to convert to binary predictions. Moving this threshold trades sensitivity for specificity:

Lower threshold (call more things positive): Higher sensitivity, lower specificity
Higher threshold (call fewer things positive): Lower sensitivity, higher specificity

The right threshold depends on clinical context:

Scenario	Priority	Threshold
Cancer screening	Don’t miss cases	Low (favor sensitivity)
Surgical indication	Avoid unnecessary surgery	High (favor specificity)
ICU admission	Balance both	Depends on capacity

from sklearn.metrics import precision_recall_curve, roc_curve
import numpy as np

# Model probability outputs (not binary predictions)
y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
y_prob = np.array([0.9, 0.8, 0.4, 0.3, 0.2, 0.1, 0.15, 0.6, 0.05, 0.1])

# Different thresholds give different sensitivity/specificity
for threshold in [0.3, 0.5, 0.7]:
    y_pred = (y_prob >= threshold).astype(int)
    tp = np.sum((y_true == 1) & (y_pred == 1))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fp = np.sum((y_true == 0) & (y_pred == 1))

    sensitivity = tp / (tp + fn)
    specificity = tn / (tn + fp)
    print(f"Threshold {threshold}: Sensitivity={sensitivity:.2f}, Specificity={specificity:.2f}")

4.3.4 PPV and NPV: What Clinicians Care About

While sensitivity and specificity describe the test, clinicians need to answer: “Given this result, what’s the probability of disease?”

Positive Predictive Value (PPV): If the test is positive, what’s the probability of disease?

\[\text{PPV} = \frac{TP}{TP + FP}\]

Negative Predictive Value (NPV): If the test is negative, what’s the probability of no disease?

\[\text{NPV} = \frac{TN}{TN + FN}\]

Critical insight: PPV and NPV depend on prevalence. The same test performs differently in different populations.

def calculate_ppv_npv(sensitivity, specificity, prevalence):
    """Calculate PPV and NPV from sensitivity, specificity, and prevalence."""
    # Using Bayes' theorem
    tp_rate = sensitivity * prevalence
    fp_rate = (1 - specificity) * (1 - prevalence)
    fn_rate = (1 - sensitivity) * prevalence
    tn_rate = specificity * (1 - prevalence)

    ppv = tp_rate / (tp_rate + fp_rate)
    npv = tn_rate / (tn_rate + fn_rate)

    return ppv, npv

# Same test (90% sensitivity, 90% specificity) in different populations
for prevalence in [0.01, 0.10, 0.50]:
    ppv, npv = calculate_ppv_npv(0.90, 0.90, prevalence)
    print(f"Prevalence {prevalence*100:.0f}%: PPV={ppv:.2f}, NPV={npv:.2f}")

# Output:
# Prevalence 1%: PPV=0.08, NPV=1.00
# Prevalence 10%: PPV=0.50, NPV=0.99
# Prevalence 50%: PPV=0.90, NPV=0.90

This is why a screening test that works in a specialty clinic (high prevalence) may generate mostly false positives in primary care (low prevalence).

4.3.5 ROC Curves and AUC

The Receiver Operating Characteristic (ROC) curve plots sensitivity vs. (1 - specificity) across all possible thresholds.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

# Simulated model outputs
np.random.seed(42)
n = 1000
y_true = np.random.binomial(1, 0.3, n)  # 30% positive

# Better model: higher scores for positives
y_prob_good = np.where(y_true == 1,
                        np.random.beta(5, 2, n),   # Positives: higher scores
                        np.random.beta(2, 5, n))   # Negatives: lower scores

# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_prob_good)
auc = roc_auc_score(y_true, y_prob_good)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'Model (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

AUC (Area Under the ROC Curve) summarizes discrimination in a single number:

AUC = 0.5: No discrimination (random guessing)
AUC = 0.7-0.8: Acceptable discrimination
AUC = 0.8-0.9: Excellent discrimination
AUC > 0.9: Outstanding discrimination

Interpretation: AUC is the probability that a randomly chosen positive case ranks higher than a randomly chosen negative case.

4.3.6 Limitations of AUC

AUC is popular but has important limitations:

Ignores calibration: A model can have high AUC but output poorly calibrated probabilities
Averages across all thresholds: Clinically, you’ll use one threshold—performance there matters more
Insensitive to class imbalance: Can look good even when PPV is terrible
Doesn’t reflect clinical utility: Two models with the same AUC might have very different clinical value

Use AUC as one metric among several, not the sole criterion.

4.3.7 Precision-Recall Curves

For imbalanced datasets (rare diseases), precision-recall curves are often more informative than ROC.

Precision = PPV = $\frac{TP}{TP + FP}$

Recall = Sensitivity = $\frac{TP}{TP + FN}$

from sklearn.metrics import precision_recall_curve, average_precision_score

# Imbalanced data: 5% positive
np.random.seed(42)
n = 1000
y_true_imbalanced = np.random.binomial(1, 0.05, n)
y_prob_imbalanced = np.where(y_true_imbalanced == 1,
                              np.random.beta(4, 2, n),
                              np.random.beta(2, 4, n))

precision, recall, thresholds = precision_recall_curve(y_true_imbalanced, y_prob_imbalanced)
ap = average_precision_score(y_true_imbalanced, y_prob_imbalanced)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-', linewidth=2, label=f'Model (AP = {ap:.3f})')
plt.axhline(y=0.05, color='k', linestyle='--', label='Random (prevalence)')
plt.xlabel('Recall (Sensitivity)')
plt.ylabel('Precision (PPV)')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Average Precision (AP) summarizes the PR curve. For rare outcomes, this is often more meaningful than AUC.

4.3.8 Calibration: Are the Probabilities Right?

A model might correctly rank patients by risk but output wrong probability values. Calibration measures whether predicted probabilities match observed frequencies.

If a model says “30% risk of readmission” for a group of patients, about 30% should actually be readmitted.

from sklearn.calibration import calibration_curve

# Generate predictions and outcomes
np.random.seed(42)
n = 2000
y_prob = np.random.beta(2, 5, n)  # Model predictions
# Create outcomes that are systematically different from predictions (miscalibrated)
y_true = np.random.binomial(1, np.clip(y_prob * 1.5, 0, 1), n)

# Compute calibration curve
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)

plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, 'bo-', label='Model')
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Brier Score combines discrimination and calibration:

\[\text{Brier Score} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2\]

Lower is better. Brier score is the mean squared error of probability predictions.

from sklearn.metrics import brier_score_loss

brier = brier_score_loss(y_true, y_prob)
print(f"Brier Score: {brier:.4f}")
# Perfect predictions: 0, Random guessing (p=0.5): 0.25

Why calibration matters clinically:

Shared decision-making requires accurate probabilities (“Your risk is about 15%”)
Clinical guidelines often use risk thresholds (“Treat if risk > 10%”)
Poorly calibrated models can systematically over- or under-treat

4.3.9 Decision Curve Analysis and Net Benefit

Clinical Context: Your sepsis model has AUC 0.85 and is well-calibrated. But should clinicians actually use it? AUC measures discrimination; calibration measures probability accuracy. Neither directly answers: “Does using this model improve clinical decisions compared to alternatives?” Decision curve analysis (DCA) fills this gap (Vickers and Elkin 2006).

The Clinical Decision Problem

Consider a model predicting 30-day mortality. At what probability threshold should you change management (e.g., palliative care consultation)? This depends on the relative costs of:

False positives: Unnecessary interventions, patient/family distress
False negatives: Missed opportunities to improve end-of-life care

Different clinicians may have different thresholds depending on patient preferences and clinical context. DCA evaluates model utility across a range of thresholds.

Net Benefit

The net benefit at threshold $p_t$ is:

\[ \text{Net Benefit} = \frac{TP}{n} - \frac{FP}{n} \times \frac{p_t}{1 - p_t} \]

The weighting factor $\frac{p_t}{1-p_t}$ reflects the relative harm of false positives vs. false negatives implied by the threshold choice. At threshold 0.2, you’re implicitly saying a false positive is 4x less harmful than a false negative.

import numpy as np
import matplotlib.pyplot as plt

def calculate_net_benefit(y_true, y_prob, thresholds):
    """Calculate net benefit at each threshold."""
    n = len(y_true)
    net_benefits = []

    for pt in thresholds:
        y_pred = (y_prob >= pt).astype(int)

        tp = np.sum((y_pred == 1) & (y_true == 1))
        fp = np.sum((y_pred == 1) & (y_true == 0))

        # Net benefit formula
        nb = (tp / n) - (fp / n) * (pt / (1 - pt))
        net_benefits.append(nb)

    return np.array(net_benefits)

def decision_curve_analysis(y_true, y_prob, model_name="Model"):
    """Generate decision curve analysis plot."""
    thresholds = np.arange(0.01, 0.99, 0.01)

    # Model net benefit
    nb_model = calculate_net_benefit(y_true, y_prob, thresholds)

    # "Treat all" strategy (everyone gets intervention)
    prevalence = y_true.mean()
    nb_treat_all = prevalence - (1 - prevalence) * thresholds / (1 - thresholds)

    # "Treat none" strategy (baseline = 0)
    nb_treat_none = np.zeros_like(thresholds)

    # Plot
    plt.figure(figsize=(10, 6))
    plt.plot(thresholds, nb_model, 'b-', linewidth=2, label=model_name)
    plt.plot(thresholds, nb_treat_all, 'r--', linewidth=1.5, label='Treat All')
    plt.plot(thresholds, nb_treat_none, 'k-', linewidth=1, label='Treat None')

    plt.xlabel('Threshold Probability')
    plt.ylabel('Net Benefit')
    plt.title('Decision Curve Analysis')
    plt.legend()
    plt.xlim([0, 1])
    plt.ylim([-0.1, max(prevalence, 0.3)])
    plt.grid(True, alpha=0.3)
    plt.show()

    return thresholds, nb_model

# Example usage
thresholds, nb = decision_curve_analysis(y_true, y_prob, "Sepsis Model")

Interpreting Decision Curves

The decision curve shows net benefit (y-axis) across threshold probabilities (x-axis). Compare your model to two reference strategies:

Treat All: Intervene on everyone regardless of model
Treat None: Never intervene (net benefit = 0)

Good model: Net benefit exceeds both “treat all” and “treat none” across clinically relevant thresholds.

Key insights:

A model above “treat all” at low thresholds means it correctly identifies low-risk patients who don’t need intervention
A model above “treat none” at high thresholds means it correctly identifies high-risk patients who need intervention
The threshold range where the model is superior defines its clinical utility

Comparing Multiple Models

def compare_models_dca(y_true, models_dict):
    """Compare multiple models using decision curve analysis."""
    thresholds = np.arange(0.01, 0.99, 0.01)
    prevalence = y_true.mean()

    plt.figure(figsize=(10, 6))

    # Reference strategies
    nb_treat_all = prevalence - (1 - prevalence) * thresholds / (1 - thresholds)
    plt.plot(thresholds, nb_treat_all, 'r--', linewidth=1.5, label='Treat All')
    plt.plot(thresholds, np.zeros_like(thresholds), 'k-', linewidth=1, label='Treat None')

    # Each model
    colors = ['blue', 'green', 'orange', 'purple']
    for i, (name, y_prob) in enumerate(models_dict.items()):
        nb = calculate_net_benefit(y_true, y_prob, thresholds)
        plt.plot(thresholds, nb, color=colors[i % len(colors)],
                linewidth=2, label=name)

    plt.xlabel('Threshold Probability')
    plt.ylabel('Net Benefit')
    plt.title('Decision Curve Analysis: Model Comparison')
    plt.legend()
    plt.xlim([0, 0.5])  # Focus on clinically relevant range
    plt.ylim([-0.05, max(prevalence, 0.3)])
    plt.grid(True, alpha=0.3)
    plt.show()

# Compare models
models = {
    'Logistic Regression': y_prob_lr,
    'Random Forest': y_prob_rf,
    'XGBoost': y_prob_xgb
}
compare_models_dca(y_true, models)

When to Use DCA

Decision curve analysis is most valuable when:

You need to justify clinical adoption (not just statistical performance)
The intervention has meaningful costs/harms
Different stakeholders may have different threshold preferences
You’re comparing a new model against existing clinical practice

Limitations:

Assumes threshold-based decisions (not all clinical decisions work this way)
Doesn’t account for downstream outcomes of decisions
Sensitive to prevalence in the evaluation cohort

DCA complements discrimination (AUC) and calibration—together they provide a complete picture of clinical utility.

4.4 Evaluation Metrics for Regression

Clinical Context: Predicting continuous outcomes—length of stay, lab values, time to event—requires different metrics than classification. Here we focus on regression evaluation.

4.4.1 Mean Squared Error and Variants

Mean Squared Error (MSE): Average squared difference between predictions and actual values.

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

Root Mean Squared Error (RMSE): Square root of MSE, in original units.

\[\text{RMSE} = \sqrt{\text{MSE}}\]

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Example: predicting length of stay (days)
y_true = np.array([3, 5, 2, 8, 4, 6, 3, 7, 5, 4])
y_pred = np.array([3.5, 4.5, 2.5, 7, 4.5, 5.5, 3, 8, 4.5, 4])

mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
print(f"MSE: {mse:.3f}")
print(f"RMSE: {rmse:.3f} days")

MSE penalizes large errors heavily (squared). This may or may not be desirable:

Appropriate: If large errors are disproportionately bad (missing a 10-day stay by 5 days is worse than missing two 3-day stays by 2 days each)
Less appropriate: If all errors matter equally

4.4.2 Mean Absolute Error

Mean Absolute Error (MAE): Average absolute difference.

\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|\]

MAE treats all errors linearly—a 2-day error counts twice as much as a 1-day error, not four times as much (like MSE).

mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae:.3f} days")

MAE is more robust to outliers than MSE. For length of stay prediction with occasional 30+ day stays, MAE may be more clinically meaningful.

4.4.3 R-Squared and Its Limitations

R-squared ($R^2$): Proportion of variance explained by the model.

\[R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}\]

$R^2 = 1$: Perfect predictions
$R^2 = 0$: Model no better than predicting the mean
$R^2 < 0$: Model worse than predicting the mean (possible for bad models on test data)

r2 = r2_score(y_true, y_pred)
print(f"R-squared: {r2:.3f}")

Limitations of $R^2$:

Scale-dependent comparison: Can’t compare across different outcomes
Doesn’t indicate clinical utility: High $R^2$ doesn’t mean predictions are useful
Can be misleading with narrow outcome range: Easy to get high $R^2$ when variance is small

4.4.4 Clinical Example: Risk Score Prediction

Many clinical risk scores (APACHE, SOFA, Framingham) predict continuous or semi-continuous outcomes. Evaluation should consider:

Discrimination: Does the score rank-order risk correctly? (Compare via concordance index)
Calibration: Do predicted values match observed outcomes?
Clinical utility: Do predictions at decision thresholds work?

4.5 Cross-Validation and Model Selection

Clinical Context: You’ve built a sepsis prediction model that achieves 0.89 AUC on your development data. Will it work on tomorrow’s patients? Next month’s? At other hospitals? Proper validation is the only way to know.

4.5.1 The Train/Validation/Test Split

The fundamental rule: never evaluate on training data. Models memorize training examples; training performance is optimistically biased.

Standard three-way split:

Training set (~60-70%): Fit model parameters
Validation set (~15-20%): Tune hyperparameters, select model variants
Test set (~15-20%): Final, one-time evaluation

from sklearn.model_selection import train_test_split

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Second split: training and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# Results in 60% train, 20% val, 20% test

4.5.2 Why This Matters: Overfitting

Overfitting occurs when a model learns patterns specific to the training data that don’t generalize. Signs:

Training performance much better than validation/test performance
Performance degrades on new data
Model captures noise rather than signal

Complex models overfit more easily. A deep neural network can memorize 1000 patient records perfectly—but this memorization won’t help with patient 1001.

4.5.3 K-Fold Cross-Validation

Cross-validation provides more robust performance estimates by averaging across multiple splits.

K-fold cross-validation: 1. Divide data into K equal folds 2. For each fold: train on K-1 folds, evaluate on held-out fold 3. Average performance across all folds

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression

# 5-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression()

scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"AUC: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

Stratified K-fold ensures each fold has approximately the same class distribution—essential for imbalanced medical datasets.

Common choices: - K=5 or K=10: Standard choices - Leave-one-out (K=n): Maximum use of data, but high variance and computationally expensive

4.5.4 Data Leakage: The Silent Killer

Data leakage occurs when information from the test set influences training. It’s the most common cause of models that work in development but fail in deployment.

Common sources in medical AI:

1. Patient-level leakage Multiple samples from the same patient appear in both training and test sets.

from sklearn.model_selection import GroupKFold

# Wrong: random split might put same patient in train and test
# Right: group by patient ID
group_kfold = GroupKFold(n_splits=5)
for train_idx, test_idx in group_kfold.split(X, y, groups=patient_ids):
    # Now no patient appears in both train and test
    pass

2. Temporal leakage Training on future data to predict the past.

from sklearn.model_selection import TimeSeriesSplit

# For time-series data: always train on past, predict future
ts_cv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in ts_cv.split(X):
    # train_idx always comes before test_idx in time
    pass

3. Feature leakage Features computed using test set information, or features that encode the outcome.

Examples: - Normalizing features using mean/std computed on full dataset (including test) - Including diagnosis codes that are assigned after the prediction timepoint - Using length of stay to predict mortality (LOS encodes survival time)

4. Preprocessing leakage Fitting transformers on full data before splitting.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Wrong: fit scaler on all data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Leaks test set statistics
X_train, X_test = train_test_split(X_scaled)

# Right: fit scaler only on training data
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Transform only, don't fit

4.5.5 Nested Cross-Validation

When tuning hyperparameters, you need two levels of validation:

Inner loop: Select best hyperparameters
Outer loop: Estimate performance of the tuning procedure

from sklearn.model_selection import GridSearchCV, cross_val_score

# Inner CV for hyperparameter tuning
param_grid = {'C': [0.1, 1, 10]}
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=inner_cv, scoring='roc_auc')

# Outer CV for performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='roc_auc')

print(f"Nested CV AUC: {nested_scores.mean():.3f} (+/- {nested_scores.std()*2:.3f})")

Without nested CV, you’re evaluating on the same data used to select hyperparameters—leading to optimistic estimates.

4.5.6 External Validation: The Reality Check

Clinical Context: Your sepsis model achieves AUC 0.89 with nested cross-validation at your hospital. You’re ready to publish and deploy. But will it work at the community hospital across town? At a pediatric hospital? In a different country? External validation—testing on completely independent data—is the only way to know (Steyerberg et al. 2010).

Why External Validation Matters

Internal validation (train/test splits, cross-validation) all use data from the same source. This shared source means:

Same patient population demographics
Same clinical workflows and documentation practices
Same lab equipment and reference ranges
Same EHR system and coding conventions
Same time period with same disease patterns

Models learn all of these factors, not just the clinical signal. External validation reveals how much of your performance depends on site-specific artifacts.

The Performance Drop Is Real

Studies consistently show 5-20% AUC degradation when models move to new sites:

Domain	Internal AUC	External AUC	Drop
Sepsis prediction	0.89	0.78	12%
Chest X-ray diagnosis	0.94	0.82	13%
Readmission risk	0.76	0.68	11%
Mortality prediction	0.91	0.82	10%

These are representative ranges from literature; actual drops vary by model and sites.

This isn’t failure—it’s reality. A model that shows no external validation drop is either exceptionally robust or hasn’t been properly tested.

Types of External Validation

Temporal validation: Test on data from a later time period at the same site.

# Split by time, not randomly
train_data = df[df['admission_date'] < '2022-01-01']
test_data = df[df['admission_date'] >= '2022-01-01']

# Train on historical, test on recent
model.fit(train_data[features], train_data[outcome])
y_prob = model.predict_proba(test_data[features])[:, 1]
temporal_auc = roc_auc_score(test_data[outcome], y_prob)

This catches temporal drift but not site-specific factors.

Geographic validation: Test on data from different institutions.

# Train on Site A, test on Site B
site_a_data = df[df['site'] == 'A']
site_b_data = df[df['site'] == 'B']

model.fit(site_a_data[features], site_a_data[outcome])
y_prob = model.predict_proba(site_b_data[features])[:, 1]
geographic_auc = roc_auc_score(site_b_data[outcome], y_prob)

This is the gold standard but requires multi-site data access.

Population validation: Test on different patient subgroups (pediatric vs. adult, ICU vs. general ward).

Prospective validation: Deploy the model and evaluate on new patients in real-time. The ultimate test, but requires deployment infrastructure.

What to Report

When publishing or deploying a model, report:

Internal performance: Cross-validation on development data
Temporal performance: Hold-out from later time period
External performance: At least one independent site (if possible)
Performance difference: Explicit comparison showing the drop
Subgroup performance: Across demographics, severity, etc.

If You Can’t Get External Data

Multi-site data isn’t always available. Alternatives:

Simulate site differences: Add noise, shift distributions, test robustness
Leave-one-site-out: If you have multi-site data, train on N-1 sites, test on held-out site
Temporal hold-out: Strictly time-separated test set
Be honest: State clearly that external validation is needed before deployment

def leave_one_site_out_cv(df, sites_col, features, outcome, model_class):
    """Leave-one-site-out cross-validation."""
    sites = df[sites_col].unique()
    results = []

    for test_site in sites:
        train_data = df[df[sites_col] != test_site]
        test_data = df[df[sites_col] == test_site]

        model = model_class()
        model.fit(train_data[features], train_data[outcome])
        y_prob = model.predict_proba(test_data[features])[:, 1]

        auc = roc_auc_score(test_data[outcome], y_prob)
        results.append({
            'test_site': test_site,
            'n_test': len(test_data),
            'auc': auc
        })

    return pd.DataFrame(results)

# Evaluate generalization across sites
results = leave_one_site_out_cv(df, 'hospital_id', feature_cols, 'outcome', LogisticRegression)
print(f"Mean AUC: {results['auc'].mean():.3f} (range: {results['auc'].min():.3f}-{results['auc'].max():.3f})")

Red Flags in Published Work

Be skeptical of papers that:

Report only internal validation
Use random splits on multi-site data (without site-aware splitting)
Claim generalization without testing it
Show AUC > 0.95 on tasks where clinical experts achieve 0.80

External validation isn’t optional—it’s the minimum bar for clinical credibility.

4.6 Comparing Models and Statistical Testing

Clinical Context: Your new deep learning model achieves AUC 0.85; the existing logistic regression gets 0.82. Is the improvement real, or just noise?

4.6.1 Bootstrap Confidence Intervals

Bootstrap resampling provides confidence intervals for any metric:

from sklearn.metrics import roc_auc_score
from sklearn.utils import resample
import numpy as np

def bootstrap_auc(y_true, y_prob, n_bootstrap=1000):
    """Compute bootstrap confidence interval for AUC."""
    aucs = []
    n = len(y_true)

    for _ in range(n_bootstrap):
        # Sample with replacement
        indices = resample(range(n), n_samples=n, random_state=None)

        # Skip if only one class in bootstrap sample
        if len(np.unique(y_true[indices])) < 2:
            continue

        auc = roc_auc_score(y_true[indices], y_prob[indices])
        aucs.append(auc)

    # 95% confidence interval
    lower = np.percentile(aucs, 2.5)
    upper = np.percentile(aucs, 97.5)

    return np.mean(aucs), lower, upper

# Example usage
mean_auc, lower, upper = bootstrap_auc(y_true, y_prob)
print(f"AUC: {mean_auc:.3f} (95% CI: {lower:.3f}-{upper:.3f})")

4.6.2 Comparing Two Models

To compare models on the same test set, use paired comparisons that account for correlation.

DeLong’s test compares AUCs:

# Using scipy for a simple comparison
# For DeLong's test, consider packages like 'pROC' (R) or specialized Python implementations

# Simple paired bootstrap comparison
def compare_models_bootstrap(y_true, y_prob_a, y_prob_b, n_bootstrap=1000):
    """Compare two models using paired bootstrap."""
    diffs = []
    n = len(y_true)

    for _ in range(n_bootstrap):
        indices = resample(range(n), n_samples=n)

        if len(np.unique(y_true[indices])) < 2:
            continue

        auc_a = roc_auc_score(y_true[indices], y_prob_a[indices])
        auc_b = roc_auc_score(y_true[indices], y_prob_b[indices])
        diffs.append(auc_a - auc_b)

    # Check if CI excludes zero
    lower = np.percentile(diffs, 2.5)
    upper = np.percentile(diffs, 97.5)
    mean_diff = np.mean(diffs)

    return mean_diff, lower, upper

diff, lower, upper = compare_models_bootstrap(y_true, y_prob_model_a, y_prob_model_b)
print(f"AUC difference: {diff:.3f} (95% CI: {lower:.3f}-{upper:.3f})")

if lower > 0:
    print("Model A significantly better")
elif upper < 0:
    print("Model B significantly better")
else:
    print("No significant difference")

McNemar’s test compares binary predictions:

from scipy.stats import chi2

def mcnemar_test(y_true, y_pred_a, y_pred_b):
    """McNemar's test for comparing two classifiers."""
    # Count discordant pairs
    b = np.sum((y_pred_a == y_true) & (y_pred_b != y_true))  # A right, B wrong
    c = np.sum((y_pred_a != y_true) & (y_pred_b == y_true))  # A wrong, B right

    # McNemar statistic (with continuity correction)
    statistic = (abs(b - c) - 1)**2 / (b + c)
    p_value = 1 - chi2.cdf(statistic, df=1)

    return statistic, p_value, b, c

stat, p, b, c = mcnemar_test(y_true, y_pred_model_a, y_pred_model_b)
print(f"McNemar's chi-squared: {stat:.3f}, p-value: {p:.4f}")
print(f"Model A correct & B wrong: {b}, Model A wrong & B correct: {c}")

4.6.3 Effect Sizes and Clinical Significance

Statistical significance doesn’t imply clinical significance. Consider:

Minimum clinically important difference: How much improvement matters?
Effect size: Standardized measure of difference magnitude
Decision-analytic perspective: Does the improvement change clinical decisions?

An AUC improvement from 0.82 to 0.83 might be statistically significant with enough data, but clinically meaningless. An improvement from 0.75 to 0.85 is likely both statistically and clinically significant.

For Clinical Readers

The following section on information theory explains the mathematical foundation of neural network training. If you’re reading for conceptual understanding, you can skip to Chapter 7—the key takeaway is that “cross-entropy loss” measures how wrong a model’s probability predictions are, and training minimizes this error.

4.7 Information Theory Essentials

Clinical Context: When training neural networks, you’ll minimize “cross-entropy loss.” Understanding where this comes from helps you choose appropriate loss functions and interpret training dynamics.

4.7.1 Entropy: Measuring Uncertainty

Entropy quantifies uncertainty in a probability distribution:

\[H(X) = -\sum_x P(x) \log P(x)\]

For a binary outcome with probability $p$:

\[H = -p \log p - (1-p) \log(1-p)\]

Properties: - Maximum entropy when $p = 0.5$ (maximum uncertainty) - Zero entropy when $p = 0$ or $p = 1$ (certainty)

import numpy as np
import matplotlib.pyplot as plt

def binary_entropy(p):
    """Calculate entropy of a Bernoulli distribution."""
    if p == 0 or p == 1:
        return 0
    return -p * np.log2(p) - (1-p) * np.log2(1-p)

# Plot entropy vs probability
p_values = np.linspace(0.01, 0.99, 100)
entropy_values = [binary_entropy(p) for p in p_values]

plt.figure(figsize=(8, 5))
plt.plot(p_values, entropy_values, 'b-', linewidth=2)
plt.xlabel('Probability p')
plt.ylabel('Entropy (bits)')
plt.title('Binary Entropy Function')
plt.grid(True, alpha=0.3)
plt.show()

4.7.2 Cross-Entropy and Log Loss

Cross-entropy measures how well a predicted distribution $Q$ matches the true distribution $P$:

\[H(P, Q) = -\sum_x P(x) \log Q(x)\]

For classification, $P$ is the true label (0 or 1) and $Q$ is the predicted probability:

\[\text{Cross-Entropy Loss} = -\frac{1}{n}\sum_i [y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i)]\]

This is also called log loss or binary cross-entropy.

Why cross-entropy for classification?

Gradient properties: Provides useful gradients for optimization
Probability interpretation: Minimizing cross-entropy = maximizing likelihood
Calibration incentive: Encourages well-calibrated probability outputs

from sklearn.metrics import log_loss

# Cross-entropy loss
y_true = np.array([1, 0, 1, 1, 0])
y_prob = np.array([0.9, 0.1, 0.8, 0.7, 0.3])

loss = log_loss(y_true, y_prob)
print(f"Cross-entropy loss: {loss:.4f}")

# What happens with overconfident wrong predictions?
y_prob_bad = np.array([0.9, 0.9, 0.8, 0.7, 0.3])  # Second prediction is wrong and confident
loss_bad = log_loss(y_true, y_prob_bad)
print(f"With overconfident error: {loss_bad:.4f}")  # Much higher loss

4.7.3 Connection to Loss Functions

The loss functions in later chapters derive from information theory:

Classification: Cross-entropy loss (log loss)
Multi-class: Categorical cross-entropy
Regression: MSE corresponds to Gaussian likelihood

Understanding this connection helps when: - Choosing loss functions for custom problems - Interpreting what “loss” measures - Understanding why calibrated probabilities emerge

4.8 Common Pitfalls in Medical AI Evaluation

Clinical Context: Published medical AI papers often have methodological flaws that inflate reported performance. Knowing these pitfalls helps you critically read the literature—and avoid the same mistakes.

4.8.1 Testing on Training Data

Surprisingly common, especially in smaller studies. Signs:

No mention of train/test split
Performance reported on “the dataset” without specifying hold-out
Implausibly high performance (AUC > 0.95 for difficult tasks)

Always ask: “What data were these numbers computed on?”

4.8.2 Data Leakage Patterns

Common in medical AI:

Leakage Type	Example	How to Detect
Patient leakage	Multiple scans from same patient in train/test	Check patient IDs
Temporal leakage	Training on 2020, testing on 2019	Check date distributions
Feature leakage	Including post-outcome variables	Review feature collection timing
Preprocessing leakage	Fitting scaler on all data	Review preprocessing code

4.8.3 Overfitting to the Test Set

When researchers iterate on test results, they implicitly tune to the test set. Signs:

Multiple papers on same dataset with steadily improving results
Model selection based on test performance
“Final” test set used multiple times

Solutions: Pre-registration, external validation, time-locked test sets.

4.8.4 Class Imbalance Mishandling

In medical data, outcomes are often rare (1-5% positive). Problems:

Accuracy is misleading (99% accuracy means nothing if predicting “no disease” always achieves 99%)
Random undersampling loses information
Oversampling can create leakage if done before splitting

Better approaches: - Use appropriate metrics (PR-AUC, F1, specificity at fixed sensitivity) - Apply resampling within cross-validation folds - Consider cost-sensitive learning

4.8.5 Dataset Shift

Models trained on one population may fail on another due to:

Covariate shift: Input distribution changes (different demographics)
Prior shift: Outcome prevalence changes (screening vs. symptomatic population)
Concept shift: Relationship between inputs and outcomes changes (different disease subtypes)

External validation is the only way to assess generalization.

4.8.6 Confusing Discrimination with Calibration

A model can perfectly rank-order risk (good discrimination) while outputting completely wrong probabilities (poor calibration). Many evaluations report only AUC.

Always assess calibration for models that output probabilities, especially if those probabilities will be communicated to patients or used for treatment thresholds.

4.9 Chapter Summary

Statistics provides the foundation for rigorous medical AI development:

Probability foundations: - Bayes’ theorem connects test characteristics to clinical utility - PPV and NPV depend on prevalence—not just sensitivity/specificity - Pre-test probability matters even for AI predictions

From hypothesis testing to ML evaluation: - P-values don’t evaluate model performance - The focus shifts from “is this effect real?” to “how well does this generalize?” - Multiple comparisons require careful handling

Classification metrics: - The confusion matrix is the foundation - Sensitivity/specificity describe the test; PPV/NPV answer clinical questions - AUC summarizes discrimination; calibration assesses probability accuracy - No single metric captures everything—use multiple

Validation strategy: - Never evaluate on training data - Cross-validation provides robust estimates - Data leakage is subtle and devastating - External validation is the gold standard

Model comparison: - Bootstrap confidence intervals quantify uncertainty - Statistical significance ≠ clinical significance - Effect sizes matter more than p-values

These concepts appear throughout the book. Every model we build will be evaluated using these metrics; every result should be interpreted with these caveats in mind.

4.10 Exercises

A breast cancer screening AI has 90% sensitivity and 95% specificity. In a screening population with 0.5% prevalence, calculate the PPV. What if the same AI is used in a diagnostic setting with 30% prevalence?
You’re comparing two sepsis prediction models. Model A has AUC 0.82; Model B has AUC 0.79. Model A was developed using random patient splitting; Model B used patient-level splitting. Which model should you trust more? Why?
A paper reports 94% accuracy for detecting diabetic retinopathy. What questions would you ask before concluding this is a useful model?
Implement a function that computes sensitivity, specificity, PPV, and NPV from predictions and labels. Test it on synthetic data with varying prevalence.
Create a dataset with 1000 samples where 50 samples from 10 patients are duplicated across train/test splits. Compare cross-validation performance with and without patient-level splitting. What do you observe?

# Statistics for Medical AI {#sec-statistics} *Clinical Context: A machine learning paper claims 95% accuracy for detecting pancreatic cancer from CT scans. Is this impressive? Useful? Before you can answer, you need to know: 95% of what? On which patients? How was it validated? This chapter provides the statistical foundation to critically evaluate medical AI systems—and to build ones that actually work in practice.* Statistics and machine learning share deep roots, but they often speak different languages. Classical statistics focuses on **inference**—drawing conclusions about populations from samples, testing hypotheses, quantifying uncertainty. Machine learning focuses on **prediction**—building models that generalize to new data. In medical AI, we need both: inference to understand what our models are doing, and prediction to make them clinically useful. This chapter bridges your existing statistics knowledge to the specific needs of medical AI. We'll cover probability foundations with clinical applications, explain why traditional hypothesis testing doesn't translate directly to model evaluation, and develop thorough understanding of the metrics that matter in healthcare: sensitivity, specificity, PPV, NPV, ROC curves, and calibration. ## Probability Foundations ### Random Variables and Distributions *Clinical Context: A patient's blood glucose level, whether a tumor is malignant, the number of ER visits in a year—these are all random variables. Understanding their distributions is fundamental to building models that predict them.* You've likely encountered probability distributions in biostatistics. Here's a quick review focused on their ML applications. **Discrete distributions** describe countable outcomes: - **Bernoulli**: Single binary outcome (disease present/absent) - **Binomial**: Number of successes in n trials (patients responding to treatment) - **Poisson**: Count of rare events (adverse events per patient-year) **Continuous distributions** describe measurements: - **Normal (Gaussian)**: Many biological measurements approximate this - **Log-normal**: Right-skewed data like length of stay, drug concentrations - **Exponential**: Time until an event (survival analysis) In machine learning, we often model the **conditional distribution** $P(Y|X)$—the probability of an outcome $Y$ given features $X$. A logistic regression models the probability of disease given patient characteristics. A neural network might model the probability distribution over possible diagnoses. ### Conditional Probability and Bayes' Theorem *Clinical Context: A screening test is positive. What's the probability the patient actually has the disease? This question—deceptively simple—is where conditional probability becomes essential.* Bayes' theorem relates conditional probabilities: $$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$ In medical terms, translating to disease ($D$) and positive test ($T^+$): $$P(D|T^+) = \frac{P(T^+|D) \cdot P(D)}{P(T^+)}$$ The terms have clinical names: - $P(D|T^+)$: **Positive predictive value (PPV)** — probability of disease given positive test - $P(T^+|D)$: **Sensitivity** — probability of positive test given disease - $P(D)$: **Prevalence** or **pre-test probability** - $P(T^+)$: Probability of positive test (in anyone) The denominator expands using the law of total probability: $$P(T^+) = P(T^+|D) \cdot P(D) + P(T^+|\neg D) \cdot P(\neg D)$$ Where $P(T^+|\neg D)$ is the **false positive rate** (1 - specificity). ### The Mammography Problem This classic example illustrates why base rates matter—even for AI systems. Consider a breast cancer screening scenario: - Prevalence in screening population: 1% ($P(D) = 0.01$) - Sensitivity of the test: 90% ($P(T^+|D) = 0.90$) - Specificity of the test: 90% ($P(\neg T^+|\neg D) = 0.90$) A woman tests positive. What's the probability she has cancer? Many people—including physicians—intuitively guess around 90%. Let's calculate: $$P(D|T^+) = \frac{0.90 \times 0.01}{0.90 \times 0.01 + 0.10 \times 0.99} = \frac{0.009}{0.009 + 0.099} = \frac{0.009}{0.108} \approx 0.083$$ The answer is about **8%**, not 90%. Despite a "90% accurate" test, most positive results are false positives because the disease is rare. This has profound implications for medical AI: 1. **Performance metrics must account for prevalence**: A model's real-world utility depends on the population it's deployed in 2. **High sensitivity isn't enough**: If specificity is imperfect, false positives overwhelm true positives in low-prevalence settings 3. **AI systems inherit this problem**: A deep learning model with 95% sensitivity and 95% specificity for a 1% prevalence condition will still have a PPV of only 16% ### Pre-Test and Post-Test Probability Clinicians often think in terms of how a test changes their assessment: - **Pre-test probability**: Prior belief about disease probability (based on symptoms, demographics, clinical judgment) - **Post-test probability**: Updated probability after seeing test results This is Bayesian reasoning in action. An AI prediction is just another "test"—it should update clinical judgment, not replace it. **Likelihood ratios** provide a convenient way to update probabilities: $$\text{Positive LR} = \frac{\text{Sensitivity}}{1 - \text{Specificity}}$$ $$\text{Post-test odds} = \text{Pre-test odds} \times \text{Likelihood ratio}$$ A positive LR of 10 means a positive result increases the odds of disease 10-fold. A well-calibrated AI model should provide meaningful likelihood ratios—dramatically changing probability estimates when positive. ### Independence and Conditional Independence Two events are **independent** if knowing one tells you nothing about the other: $$P(A \cap B) = P(A) \cdot P(B)$$ **Conditional independence** is subtler: $A$ and $B$ are conditionally independent given $C$ if: $$P(A \cap B | C) = P(A|C) \cdot P(B|C)$$ This matters in medical AI because: 1. **Naive Bayes classifiers** assume features are conditionally independent given the class label—often violated in medical data 2. **Graphical models** represent conditional independence structures 3. **Confounding** occurs when two variables appear related but are both caused by a third Example: In hospital data, ice cream sales might correlate with drowning deaths. They're not causally related—both are caused by summer weather. Conditional on season, they're independent. ## From Hypothesis Testing to Model Evaluation *Clinical Context: You've likely been trained to look for "p < 0.05" as the marker of a meaningful result. In machine learning, this framework doesn't directly apply—and understanding why will help you avoid common evaluation mistakes.* ### The Classical Framework (Brief Review) Traditional hypothesis testing proceeds: 1. State a null hypothesis ($H_0$: no effect) 2. Collect data and compute a test statistic 3. Calculate the p-value: probability of data this extreme if $H_0$ is true 4. Reject $H_0$ if p-value < significance level (often 0.05) This framework connects to diagnostic testing: | Statistical Term | Diagnostic Testing Equivalent | |-----------------|------------------------------| | Type I error (false positive) | False positive rate (1 - specificity) | | Type II error (false negative) | False negative rate (1 - sensitivity) | | Power (1 - Type II error rate) | Sensitivity | | Significance level | False positive rate | ### Why P-Values Don't Evaluate Models Here's the fundamental disconnect: **hypothesis testing asks whether an effect exists; machine learning asks how well a model predicts**. Consider comparing two models: - Model A: AUC = 0.82 - Model B: AUC = 0.85 Is Model B better? A traditional approach might test: $H_0$: AUC$_A$ = AUC$_B$ With enough data, you'll get p < 0.05. But this tells you the difference is **statistically detectable**, not that it's **clinically meaningful**. A p-value of 0.001 for a 0.01 AUC improvement isn't impressive—it just means you have a lot of data. Conversely, with small datasets, important differences might not reach significance. The p-value depends on sample size, not just effect size. ### The Replication Crisis and Its Lessons The past decade has revealed widespread problems with p-value-based research: 1. **Publication bias**: Only "significant" results get published 2. **P-hacking**: Trying multiple analyses until p < 0.05 3. **HARKing**: Hypothesizing After Results are Known 4. **Overfitting to significance**: Models tuned to achieve p < 0.05 on one dataset Medical AI research isn't immune. A 2019 analysis found that many published medical AI studies: - Tested on the same data used for model development - Reported only the best-performing model variant - Used subgroup analyses that weren't pre-specified ### What ML Does Differently Machine learning evaluation focuses on **predictive performance on held-out data**: 1. **Train/test split**: Never evaluate on training data 2. **Cross-validation**: Estimate performance across multiple splits 3. **External validation**: Test on completely independent datasets Instead of asking "is this effect real?", we ask "how well does this model generalize?" Key metrics replace p-values: - **Discrimination**: Can the model rank-order risk? (AUC, concordance) - **Calibration**: Are probability outputs accurate? (calibration curves, Brier score) - **Clinical utility**: Does the model improve decisions? (net benefit, decision curves) The shift in mindset: **we're not trying to prove something is true; we're trying to build something that works**. ### Multiple Comparisons in Model Selection When you compare many models, some will appear best by chance. This is the machine learning version of multiple testing. Suppose you try 20 model variants and report the best test-set performance. Even if all models were identical, one would appear ~5% better than average just by luck. Solutions: 1. **Nested cross-validation**: Tune hyperparameters in inner loop, evaluate in outer loop 2. **Pre-registration**: Specify your evaluation plan before looking at results 3. **Conservative reporting**: Report confidence intervals, not just point estimates 4. **External validation**: The ultimate test is independent data ## Evaluation Metrics for Classification *Clinical Context: Your model outputs a probability of sepsis. How do you quantify whether these predictions are useful? The choice of metric shapes what "good" means—and different stakeholders care about different things.* ### The Confusion Matrix All classification metrics derive from the confusion matrix: | | Predicted Positive | Predicted Negative | |--------------------|-------------------|-------------------| | **Actually Positive** | True Positive (TP) | False Negative (FN) | | **Actually Negative** | False Positive (FP) | True Negative (TN) | From these four counts, we compute everything else. ```{python} #| eval: false #| echo: true import numpy as np from sklearn.metrics import confusion_matrix, classification_report # Example predictions y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0]) # 4 positive, 6 negative y_pred = np.array([1, 1, 0, 0, 0, 0, 0, 1, 0, 0]) # Model predictions # Confusion matrix cm = confusion_matrix(y_true, y_pred) print("Confusion Matrix:") print(cm) # Output: # [[5 1] # [2 2]] # TN=5, FP=1, FN=2, TP=2 print("\nClassification Report:") print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive'])) ``` ### Sensitivity and Specificity **Sensitivity** (True Positive Rate, Recall): Of those with the condition, how many did we detect? $$\text{Sensitivity} = \frac{TP}{TP + FN}$$ **Specificity** (True Negative Rate): Of those without the condition, how many did we correctly identify as negative? $$\text{Specificity} = \frac{TN}{TN + FP}$$ Clinical interpretation: - **High sensitivity**: Few missed cases (low false negatives). Critical for ruling OUT disease. A negative result is reassuring. - **High specificity**: Few false alarms (low false positives). Critical for ruling IN disease. A positive result is convincing. The clinical mnemonic: **SnNout** (Sensitive test, Negative result rules out) and **SpPin** (Specific test, Positive result rules in). ### The Sensitivity-Specificity Tradeoff Most models output a probability, and we choose a **threshold** to convert to binary predictions. Moving this threshold trades sensitivity for specificity: - **Lower threshold** (call more things positive): Higher sensitivity, lower specificity - **Higher threshold** (call fewer things positive): Lower sensitivity, higher specificity The right threshold depends on clinical context: | Scenario | Priority | Threshold | |----------|----------|-----------| | Cancer screening | Don't miss cases | Low (favor sensitivity) | | Surgical indication | Avoid unnecessary surgery | High (favor specificity) | | ICU admission | Balance both | Depends on capacity | ```{python} #| eval: false #| echo: true from sklearn.metrics import precision_recall_curve, roc_curve import numpy as np # Model probability outputs (not binary predictions) y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0]) y_prob = np.array([0.9, 0.8, 0.4, 0.3, 0.2, 0.1, 0.15, 0.6, 0.05, 0.1]) # Different thresholds give different sensitivity/specificity for threshold in [0.3, 0.5, 0.7]: y_pred = (y_prob >= threshold).astype(int) tp = np.sum((y_true == 1) & (y_pred == 1)) fn = np.sum((y_true == 1) & (y_pred == 0)) tn = np.sum((y_true == 0) & (y_pred == 0)) fp = np.sum((y_true == 0) & (y_pred == 1)) sensitivity = tp / (tp + fn) specificity = tn / (tn + fp) print(f"Threshold {threshold}: Sensitivity={sensitivity:.2f}, Specificity={specificity:.2f}") ``` ### PPV and NPV: What Clinicians Care About While sensitivity and specificity describe the test, clinicians need to answer: "Given this result, what's the probability of disease?" **Positive Predictive Value (PPV)**: If the test is positive, what's the probability of disease? $$\text{PPV} = \frac{TP}{TP + FP}$$ **Negative Predictive Value (NPV)**: If the test is negative, what's the probability of no disease? $$\text{NPV} = \frac{TN}{TN + FN}$$ Critical insight: **PPV and NPV depend on prevalence**. The same test performs differently in different populations. ```{python} #| eval: false #| echo: true def calculate_ppv_npv(sensitivity, specificity, prevalence): """Calculate PPV and NPV from sensitivity, specificity, and prevalence.""" # Using Bayes' theorem tp_rate = sensitivity * prevalence fp_rate = (1 - specificity) * (1 - prevalence) fn_rate = (1 - sensitivity) * prevalence tn_rate = specificity * (1 - prevalence) ppv = tp_rate / (tp_rate + fp_rate) npv = tn_rate / (tn_rate + fn_rate) return ppv, npv # Same test (90% sensitivity, 90% specificity) in different populations for prevalence in [0.01, 0.10, 0.50]: ppv, npv = calculate_ppv_npv(0.90, 0.90, prevalence) print(f"Prevalence {prevalence*100:.0f}%: PPV={ppv:.2f}, NPV={npv:.2f}") # Output: # Prevalence 1%: PPV=0.08, NPV=1.00 # Prevalence 10%: PPV=0.50, NPV=0.99 # Prevalence 50%: PPV=0.90, NPV=0.90 ``` This is why a screening test that works in a specialty clinic (high prevalence) may generate mostly false positives in primary care (low prevalence). ### ROC Curves and AUC The **Receiver Operating Characteristic (ROC) curve** plots sensitivity vs. (1 - specificity) across all possible thresholds. ```{python} #| eval: false #| echo: true import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import roc_curve, roc_auc_score # Simulated model outputs np.random.seed(42) n = 1000 y_true = np.random.binomial(1, 0.3, n) # 30% positive # Better model: higher scores for positives y_prob_good = np.where(y_true == 1, np.random.beta(5, 2, n), # Positives: higher scores np.random.beta(2, 5, n)) # Negatives: lower scores # Compute ROC curve fpr, tpr, thresholds = roc_curve(y_true, y_prob_good) auc = roc_auc_score(y_true, y_prob_good) # Plot plt.figure(figsize=(8, 6)) plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'Model (AUC = {auc:.3f})') plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)') plt.xlabel('False Positive Rate (1 - Specificity)') plt.ylabel('True Positive Rate (Sensitivity)') plt.title('ROC Curve') plt.legend() plt.grid(True, alpha=0.3) plt.show() ``` **AUC (Area Under the ROC Curve)** summarizes discrimination in a single number: - **AUC = 0.5**: No discrimination (random guessing) - **AUC = 0.7-0.8**: Acceptable discrimination - **AUC = 0.8-0.9**: Excellent discrimination - **AUC > 0.9**: Outstanding discrimination Interpretation: AUC is the probability that a randomly chosen positive case ranks higher than a randomly chosen negative case. ### Limitations of AUC AUC is popular but has important limitations: 1. **Ignores calibration**: A model can have high AUC but output poorly calibrated probabilities 2. **Averages across all thresholds**: Clinically, you'll use one threshold—performance there matters more 3. **Insensitive to class imbalance**: Can look good even when PPV is terrible 4. **Doesn't reflect clinical utility**: Two models with the same AUC might have very different clinical value Use AUC as one metric among several, not the sole criterion. ### Precision-Recall Curves For **imbalanced datasets** (rare diseases), precision-recall curves are often more informative than ROC. **Precision** = PPV = $\frac{TP}{TP + FP}$ **Recall** = Sensitivity = $\frac{TP}{TP + FN}$ ```{python} #| eval: false #| echo: true from sklearn.metrics import precision_recall_curve, average_precision_score # Imbalanced data: 5% positive np.random.seed(42) n = 1000 y_true_imbalanced = np.random.binomial(1, 0.05, n) y_prob_imbalanced = np.where(y_true_imbalanced == 1, np.random.beta(4, 2, n), np.random.beta(2, 4, n)) precision, recall, thresholds = precision_recall_curve(y_true_imbalanced, y_prob_imbalanced) ap = average_precision_score(y_true_imbalanced, y_prob_imbalanced) plt.figure(figsize=(8, 6)) plt.plot(recall, precision, 'b-', linewidth=2, label=f'Model (AP = {ap:.3f})') plt.axhline(y=0.05, color='k', linestyle='--', label='Random (prevalence)') plt.xlabel('Recall (Sensitivity)') plt.ylabel('Precision (PPV)') plt.title('Precision-Recall Curve') plt.legend() plt.grid(True, alpha=0.3) plt.show() ``` **Average Precision (AP)** summarizes the PR curve. For rare outcomes, this is often more meaningful than AUC. ### Calibration: Are the Probabilities Right? A model might correctly rank patients by risk but output wrong probability values. **Calibration** measures whether predicted probabilities match observed frequencies. If a model says "30% risk of readmission" for a group of patients, about 30% should actually be readmitted. ```{python} #| eval: false #| echo: true from sklearn.calibration import calibration_curve # Generate predictions and outcomes np.random.seed(42) n = 2000 y_prob = np.random.beta(2, 5, n) # Model predictions # Create outcomes that are systematically different from predictions (miscalibrated) y_true = np.random.binomial(1, np.clip(y_prob * 1.5, 0, 1), n) # Compute calibration curve prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10) plt.figure(figsize=(8, 6)) plt.plot(prob_pred, prob_true, 'bo-', label='Model') plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration') plt.xlabel('Mean Predicted Probability') plt.ylabel('Fraction of Positives') plt.title('Calibration Curve') plt.legend() plt.grid(True, alpha=0.3) plt.show() ``` **Brier Score** combines discrimination and calibration: $$\text{Brier Score} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2$$ Lower is better. Brier score is the mean squared error of probability predictions. ```{python} #| eval: false #| echo: true from sklearn.metrics import brier_score_loss brier = brier_score_loss(y_true, y_prob) print(f"Brier Score: {brier:.4f}") # Perfect predictions: 0, Random guessing (p=0.5): 0.25 ``` Why calibration matters clinically: - Shared decision-making requires accurate probabilities ("Your risk is about 15%") - Clinical guidelines often use risk thresholds ("Treat if risk > 10%") - Poorly calibrated models can systematically over- or under-treat ### Decision Curve Analysis and Net Benefit *Clinical Context: Your sepsis model has AUC 0.85 and is well-calibrated. But should clinicians actually use it? AUC measures discrimination; calibration measures probability accuracy. Neither directly answers: "Does using this model improve clinical decisions compared to alternatives?" Decision curve analysis (DCA) fills this gap [@vickers2006decision].* **The Clinical Decision Problem** Consider a model predicting 30-day mortality. At what probability threshold should you change management (e.g., palliative care consultation)? This depends on the relative costs of: - **False positives**: Unnecessary interventions, patient/family distress - **False negatives**: Missed opportunities to improve end-of-life care Different clinicians may have different thresholds depending on patient preferences and clinical context. DCA evaluates model utility across a range of thresholds. **Net Benefit** The **net benefit** at threshold $p_t$ is: $$ \text{Net Benefit} = \frac{TP}{n} - \frac{FP}{n} \times \frac{p_t}{1 - p_t} $$ The weighting factor $\frac{p_t}{1-p_t}$ reflects the relative harm of false positives vs. false negatives implied by the threshold choice. At threshold 0.2, you're implicitly saying a false positive is 4x less harmful than a false negative. ```{python} #| eval: false import numpy as np import matplotlib.pyplot as plt def calculate_net_benefit(y_true, y_prob, thresholds): """Calculate net benefit at each threshold.""" n = len(y_true) net_benefits = [] for pt in thresholds: y_pred = (y_prob >= pt).astype(int) tp = np.sum((y_pred == 1) & (y_true == 1)) fp = np.sum((y_pred == 1) & (y_true == 0)) # Net benefit formula nb = (tp / n) - (fp / n) * (pt / (1 - pt)) net_benefits.append(nb) return np.array(net_benefits) def decision_curve_analysis(y_true, y_prob, model_name="Model"): """Generate decision curve analysis plot.""" thresholds = np.arange(0.01, 0.99, 0.01) # Model net benefit nb_model = calculate_net_benefit(y_true, y_prob, thresholds) # "Treat all" strategy (everyone gets intervention) prevalence = y_true.mean() nb_treat_all = prevalence - (1 - prevalence) * thresholds / (1 - thresholds) # "Treat none" strategy (baseline = 0) nb_treat_none = np.zeros_like(thresholds) # Plot plt.figure(figsize=(10, 6)) plt.plot(thresholds, nb_model, 'b-', linewidth=2, label=model_name) plt.plot(thresholds, nb_treat_all, 'r--', linewidth=1.5, label='Treat All') plt.plot(thresholds, nb_treat_none, 'k-', linewidth=1, label='Treat None') plt.xlabel('Threshold Probability') plt.ylabel('Net Benefit') plt.title('Decision Curve Analysis') plt.legend() plt.xlim([0, 1]) plt.ylim([-0.1, max(prevalence, 0.3)]) plt.grid(True, alpha=0.3) plt.show() return thresholds, nb_model # Example usage thresholds, nb = decision_curve_analysis(y_true, y_prob, "Sepsis Model") ``` **Interpreting Decision Curves** The decision curve shows net benefit (y-axis) across threshold probabilities (x-axis). Compare your model to two reference strategies: - **Treat All**: Intervene on everyone regardless of model - **Treat None**: Never intervene (net benefit = 0) **Good model**: Net benefit exceeds both "treat all" and "treat none" across clinically relevant thresholds. **Key insights**: 1. A model above "treat all" at low thresholds means it correctly identifies low-risk patients who don't need intervention 2. A model above "treat none" at high thresholds means it correctly identifies high-risk patients who need intervention 3. The threshold range where the model is superior defines its clinical utility **Comparing Multiple Models** ```{python} #| eval: false def compare_models_dca(y_true, models_dict): """Compare multiple models using decision curve analysis.""" thresholds = np.arange(0.01, 0.99, 0.01) prevalence = y_true.mean() plt.figure(figsize=(10, 6)) # Reference strategies nb_treat_all = prevalence - (1 - prevalence) * thresholds / (1 - thresholds) plt.plot(thresholds, nb_treat_all, 'r--', linewidth=1.5, label='Treat All') plt.plot(thresholds, np.zeros_like(thresholds), 'k-', linewidth=1, label='Treat None') # Each model colors = ['blue', 'green', 'orange', 'purple'] for i, (name, y_prob) in enumerate(models_dict.items()): nb = calculate_net_benefit(y_true, y_prob, thresholds) plt.plot(thresholds, nb, color=colors[i % len(colors)], linewidth=2, label=name) plt.xlabel('Threshold Probability') plt.ylabel('Net Benefit') plt.title('Decision Curve Analysis: Model Comparison') plt.legend() plt.xlim([0, 0.5]) # Focus on clinically relevant range plt.ylim([-0.05, max(prevalence, 0.3)]) plt.grid(True, alpha=0.3) plt.show() # Compare models models = { 'Logistic Regression': y_prob_lr, 'Random Forest': y_prob_rf, 'XGBoost': y_prob_xgb } compare_models_dca(y_true, models) ``` **When to Use DCA** Decision curve analysis is most valuable when: - You need to justify clinical adoption (not just statistical performance) - The intervention has meaningful costs/harms - Different stakeholders may have different threshold preferences - You're comparing a new model against existing clinical practice **Limitations**: - Assumes threshold-based decisions (not all clinical decisions work this way) - Doesn't account for downstream outcomes of decisions - Sensitive to prevalence in the evaluation cohort DCA complements discrimination (AUC) and calibration—together they provide a complete picture of clinical utility. ## Evaluation Metrics for Regression *Clinical Context: Predicting continuous outcomes—length of stay, lab values, time to event—requires different metrics than classification. Here we focus on regression evaluation.* ### Mean Squared Error and Variants **Mean Squared Error (MSE)**: Average squared difference between predictions and actual values. $$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$ **Root Mean Squared Error (RMSE)**: Square root of MSE, in original units. $$\text{RMSE} = \sqrt{\text{MSE}}$$ ```{python} #| eval: false #| echo: true from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score import numpy as np # Example: predicting length of stay (days) y_true = np.array([3, 5, 2, 8, 4, 6, 3, 7, 5, 4]) y_pred = np.array([3.5, 4.5, 2.5, 7, 4.5, 5.5, 3, 8, 4.5, 4]) mse = mean_squared_error(y_true, y_pred) rmse = np.sqrt(mse) print(f"MSE: {mse:.3f}") print(f"RMSE: {rmse:.3f} days") ``` MSE penalizes large errors heavily (squared). This may or may not be desirable: - **Appropriate**: If large errors are disproportionately bad (missing a 10-day stay by 5 days is worse than missing two 3-day stays by 2 days each) - **Less appropriate**: If all errors matter equally ### Mean Absolute Error **Mean Absolute Error (MAE)**: Average absolute difference. $$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$ MAE treats all errors linearly—a 2-day error counts twice as much as a 1-day error, not four times as much (like MSE). ```{python} #| eval: false #| echo: true mae = mean_absolute_error(y_true, y_pred) print(f"MAE: {mae:.3f} days") ``` MAE is more robust to outliers than MSE. For length of stay prediction with occasional 30+ day stays, MAE may be more clinically meaningful. ### R-Squared and Its Limitations **R-squared ($R^2$)**: Proportion of variance explained by the model. $$R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}$$ - $R^2 = 1$: Perfect predictions - $R^2 = 0$: Model no better than predicting the mean - $R^2 < 0$: Model worse than predicting the mean (possible for bad models on test data) ```{python} #| eval: false #| echo: true r2 = r2_score(y_true, y_pred) print(f"R-squared: {r2:.3f}") ``` Limitations of $R^2$: 1. **Scale-dependent comparison**: Can't compare across different outcomes 2. **Doesn't indicate clinical utility**: High $R^2$ doesn't mean predictions are useful 3. **Can be misleading with narrow outcome range**: Easy to get high $R^2$ when variance is small ### Clinical Example: Risk Score Prediction Many clinical risk scores (APACHE, SOFA, Framingham) predict continuous or semi-continuous outcomes. Evaluation should consider: - **Discrimination**: Does the score rank-order risk correctly? (Compare via concordance index) - **Calibration**: Do predicted values match observed outcomes? - **Clinical utility**: Do predictions at decision thresholds work? ## Cross-Validation and Model Selection *Clinical Context: You've built a sepsis prediction model that achieves 0.89 AUC on your development data. Will it work on tomorrow's patients? Next month's? At other hospitals? Proper validation is the only way to know.* ### The Train/Validation/Test Split The fundamental rule: **never evaluate on training data**. Models memorize training examples; training performance is optimistically biased. Standard three-way split: - **Training set** (~60-70%): Fit model parameters - **Validation set** (~15-20%): Tune hyperparameters, select model variants - **Test set** (~15-20%): Final, one-time evaluation ```{python} #| eval: false #| echo: true from sklearn.model_selection import train_test_split # First split: separate test set X_temp, X_test, y_temp, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Second split: training and validation X_train, X_val, y_train, y_val = train_test_split( X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp ) # Results in 60% train, 20% val, 20% test ``` ### Why This Matters: Overfitting **Overfitting** occurs when a model learns patterns specific to the training data that don't generalize. Signs: - Training performance much better than validation/test performance - Performance degrades on new data - Model captures noise rather than signal Complex models overfit more easily. A deep neural network can memorize 1000 patient records perfectly—but this memorization won't help with patient 1001. ### K-Fold Cross-Validation Cross-validation provides more robust performance estimates by averaging across multiple splits. **K-fold cross-validation**: 1. Divide data into K equal folds 2. For each fold: train on K-1 folds, evaluate on held-out fold 3. Average performance across all folds ```{python} #| eval: false #| echo: true from sklearn.model_selection import cross_val_score, StratifiedKFold from sklearn.linear_model import LogisticRegression # 5-fold cross-validation cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) model = LogisticRegression() scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc') print(f"AUC: {scores.mean():.3f} (+/- {scores.std()*2:.3f})") ``` **Stratified** K-fold ensures each fold has approximately the same class distribution—essential for imbalanced medical datasets. Common choices: - **K=5 or K=10**: Standard choices - **Leave-one-out (K=n)**: Maximum use of data, but high variance and computationally expensive ### Data Leakage: The Silent Killer **Data leakage** occurs when information from the test set influences training. It's the most common cause of models that work in development but fail in deployment. Common sources in medical AI: **1. Patient-level leakage** Multiple samples from the same patient appear in both training and test sets. ```{python} #| eval: false #| echo: true from sklearn.model_selection import GroupKFold # Wrong: random split might put same patient in train and test # Right: group by patient ID group_kfold = GroupKFold(n_splits=5) for train_idx, test_idx in group_kfold.split(X, y, groups=patient_ids): # Now no patient appears in both train and test pass ``` **2. Temporal leakage** Training on future data to predict the past. ```{python} #| eval: false #| echo: true from sklearn.model_selection import TimeSeriesSplit # For time-series data: always train on past, predict future ts_cv = TimeSeriesSplit(n_splits=5) for train_idx, test_idx in ts_cv.split(X): # train_idx always comes before test_idx in time pass ``` **3. Feature leakage** Features computed using test set information, or features that encode the outcome. Examples: - Normalizing features using mean/std computed on full dataset (including test) - Including diagnosis codes that are assigned after the prediction timepoint - Using length of stay to predict mortality (LOS encodes survival time) **4. Preprocessing leakage** Fitting transformers on full data before splitting. ```{python} #| eval: false #| echo: true from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline # Wrong: fit scaler on all data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Leaks test set statistics X_train, X_test = train_test_split(X_scaled) # Right: fit scaler only on training data X_train, X_test = train_test_split(X) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Transform only, don't fit ``` ### Nested Cross-Validation When tuning hyperparameters, you need two levels of validation: - **Inner loop**: Select best hyperparameters - **Outer loop**: Estimate performance of the tuning procedure ```{python} #| eval: false #| echo: true from sklearn.model_selection import GridSearchCV, cross_val_score # Inner CV for hyperparameter tuning param_grid = {'C': [0.1, 1, 10]} inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=inner_cv, scoring='roc_auc') # Outer CV for performance estimation outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='roc_auc') print(f"Nested CV AUC: {nested_scores.mean():.3f} (+/- {nested_scores.std()*2:.3f})") ``` Without nested CV, you're evaluating on the same data used to select hyperparameters—leading to optimistic estimates. ### External Validation: The Reality Check *Clinical Context: Your sepsis model achieves AUC 0.89 with nested cross-validation at your hospital. You're ready to publish and deploy. But will it work at the community hospital across town? At a pediatric hospital? In a different country? External validation—testing on completely independent data—is the only way to know [@steyerberg2010calibration].* **Why External Validation Matters** Internal validation (train/test splits, cross-validation) all use data from the same source. This shared source means: - Same patient population demographics - Same clinical workflows and documentation practices - Same lab equipment and reference ranges - Same EHR system and coding conventions - Same time period with same disease patterns Models learn all of these factors, not just the clinical signal. External validation reveals how much of your performance depends on site-specific artifacts. **The Performance Drop Is Real** Studies consistently show **5-20% AUC degradation** when models move to new sites: | Domain | Internal AUC | External AUC | Drop | |--------|-------------|--------------|------| | Sepsis prediction | 0.89 | 0.78 | 12% | | Chest X-ray diagnosis | 0.94 | 0.82 | 13% | | Readmission risk | 0.76 | 0.68 | 11% | | Mortality prediction | 0.91 | 0.82 | 10% | *These are representative ranges from literature; actual drops vary by model and sites.* This isn't failure—it's reality. A model that shows no external validation drop is either exceptionally robust or hasn't been properly tested. **Types of External Validation** **Temporal validation**: Test on data from a later time period at the same site. ```{python} #| eval: false # Split by time, not randomly train_data = df[df['admission_date'] < '2022-01-01'] test_data = df[df['admission_date'] >= '2022-01-01'] # Train on historical, test on recent model.fit(train_data[features], train_data[outcome]) y_prob = model.predict_proba(test_data[features])[:, 1] temporal_auc = roc_auc_score(test_data[outcome], y_prob) ``` This catches temporal drift but not site-specific factors. **Geographic validation**: Test on data from different institutions. ```{python} #| eval: false # Train on Site A, test on Site B site_a_data = df[df['site'] == 'A'] site_b_data = df[df['site'] == 'B'] model.fit(site_a_data[features], site_a_data[outcome]) y_prob = model.predict_proba(site_b_data[features])[:, 1] geographic_auc = roc_auc_score(site_b_data[outcome], y_prob) ``` This is the gold standard but requires multi-site data access. **Population validation**: Test on different patient subgroups (pediatric vs. adult, ICU vs. general ward). **Prospective validation**: Deploy the model and evaluate on new patients in real-time. The ultimate test, but requires deployment infrastructure. **What to Report** When publishing or deploying a model, report: 1. **Internal performance**: Cross-validation on development data 2. **Temporal performance**: Hold-out from later time period 3. **External performance**: At least one independent site (if possible) 4. **Performance difference**: Explicit comparison showing the drop 5. **Subgroup performance**: Across demographics, severity, etc. **If You Can't Get External Data** Multi-site data isn't always available. Alternatives: - **Simulate site differences**: Add noise, shift distributions, test robustness - **Leave-one-site-out**: If you have multi-site data, train on N-1 sites, test on held-out site - **Temporal hold-out**: Strictly time-separated test set - **Be honest**: State clearly that external validation is needed before deployment ```{python} #| eval: false def leave_one_site_out_cv(df, sites_col, features, outcome, model_class): """Leave-one-site-out cross-validation.""" sites = df[sites_col].unique() results = [] for test_site in sites: train_data = df[df[sites_col] != test_site] test_data = df[df[sites_col] == test_site] model = model_class() model.fit(train_data[features], train_data[outcome]) y_prob = model.predict_proba(test_data[features])[:, 1] auc = roc_auc_score(test_data[outcome], y_prob) results.append({ 'test_site': test_site, 'n_test': len(test_data), 'auc': auc }) return pd.DataFrame(results) # Evaluate generalization across sites results = leave_one_site_out_cv(df, 'hospital_id', feature_cols, 'outcome', LogisticRegression) print(f"Mean AUC: {results['auc'].mean():.3f} (range: {results['auc'].min():.3f}-{results['auc'].max():.3f})") ``` **Red Flags in Published Work** Be skeptical of papers that: - Report only internal validation - Use random splits on multi-site data (without site-aware splitting) - Claim generalization without testing it - Show AUC > 0.95 on tasks where clinical experts achieve 0.80 External validation isn't optional—it's the minimum bar for clinical credibility. ## Comparing Models and Statistical Testing *Clinical Context: Your new deep learning model achieves AUC 0.85; the existing logistic regression gets 0.82. Is the improvement real, or just noise?* ### Bootstrap Confidence Intervals Bootstrap resampling provides confidence intervals for any metric: ```{python} #| eval: false #| echo: true from sklearn.metrics import roc_auc_score from sklearn.utils import resample import numpy as np def bootstrap_auc(y_true, y_prob, n_bootstrap=1000): """Compute bootstrap confidence interval for AUC.""" aucs = [] n = len(y_true) for _ in range(n_bootstrap): # Sample with replacement indices = resample(range(n), n_samples=n, random_state=None) # Skip if only one class in bootstrap sample if len(np.unique(y_true[indices])) < 2: continue auc = roc_auc_score(y_true[indices], y_prob[indices]) aucs.append(auc) # 95% confidence interval lower = np.percentile(aucs, 2.5) upper = np.percentile(aucs, 97.5) return np.mean(aucs), lower, upper # Example usage mean_auc, lower, upper = bootstrap_auc(y_true, y_prob) print(f"AUC: {mean_auc:.3f} (95% CI: {lower:.3f}-{upper:.3f})") ``` ### Comparing Two Models To compare models on the same test set, use **paired comparisons** that account for correlation. **DeLong's test** compares AUCs: ```{python} #| eval: false #| echo: true # Using scipy for a simple comparison # For DeLong's test, consider packages like 'pROC' (R) or specialized Python implementations # Simple paired bootstrap comparison def compare_models_bootstrap(y_true, y_prob_a, y_prob_b, n_bootstrap=1000): """Compare two models using paired bootstrap.""" diffs = [] n = len(y_true) for _ in range(n_bootstrap): indices = resample(range(n), n_samples=n) if len(np.unique(y_true[indices])) < 2: continue auc_a = roc_auc_score(y_true[indices], y_prob_a[indices]) auc_b = roc_auc_score(y_true[indices], y_prob_b[indices]) diffs.append(auc_a - auc_b) # Check if CI excludes zero lower = np.percentile(diffs, 2.5) upper = np.percentile(diffs, 97.5) mean_diff = np.mean(diffs) return mean_diff, lower, upper diff, lower, upper = compare_models_bootstrap(y_true, y_prob_model_a, y_prob_model_b) print(f"AUC difference: {diff:.3f} (95% CI: {lower:.3f}-{upper:.3f})") if lower > 0: print("Model A significantly better") elif upper < 0: print("Model B significantly better") else: print("No significant difference") ``` **McNemar's test** compares binary predictions: ```{python} #| eval: false #| echo: true from scipy.stats import chi2 def mcnemar_test(y_true, y_pred_a, y_pred_b): """McNemar's test for comparing two classifiers.""" # Count discordant pairs b = np.sum((y_pred_a == y_true) & (y_pred_b != y_true)) # A right, B wrong c = np.sum((y_pred_a != y_true) & (y_pred_b == y_true)) # A wrong, B right # McNemar statistic (with continuity correction) statistic = (abs(b - c) - 1)**2 / (b + c) p_value = 1 - chi2.cdf(statistic, df=1) return statistic, p_value, b, c stat, p, b, c = mcnemar_test(y_true, y_pred_model_a, y_pred_model_b) print(f"McNemar's chi-squared: {stat:.3f}, p-value: {p:.4f}") print(f"Model A correct & B wrong: {b}, Model A wrong & B correct: {c}") ``` ### Effect Sizes and Clinical Significance Statistical significance doesn't imply clinical significance. Consider: - **Minimum clinically important difference**: How much improvement matters? - **Effect size**: Standardized measure of difference magnitude - **Decision-analytic perspective**: Does the improvement change clinical decisions? An AUC improvement from 0.82 to 0.83 might be statistically significant with enough data, but clinically meaningless. An improvement from 0.75 to 0.85 is likely both statistically and clinically significant. ::: {.callout-note} ## For Clinical Readers The following section on information theory explains the mathematical foundation of neural network training. If you're reading for conceptual understanding, you can **skip to Chapter 7**—the key takeaway is that "cross-entropy loss" measures how wrong a model's probability predictions are, and training minimizes this error. ::: ## Information Theory Essentials *Clinical Context: When training neural networks, you'll minimize "cross-entropy loss." Understanding where this comes from helps you choose appropriate loss functions and interpret training dynamics.* ### Entropy: Measuring Uncertainty **Entropy** quantifies uncertainty in a probability distribution: $$H(X) = -\sum_x P(x) \log P(x)$$ For a binary outcome with probability $p$: $$H = -p \log p - (1-p) \log(1-p)$$ Properties: - Maximum entropy when $p = 0.5$ (maximum uncertainty) - Zero entropy when $p = 0$ or $p = 1$ (certainty) ```{python} #| eval: false #| echo: true import numpy as np import matplotlib.pyplot as plt def binary_entropy(p): """Calculate entropy of a Bernoulli distribution.""" if p == 0 or p == 1: return 0 return -p * np.log2(p) - (1-p) * np.log2(1-p) # Plot entropy vs probability p_values = np.linspace(0.01, 0.99, 100) entropy_values = [binary_entropy(p) for p in p_values] plt.figure(figsize=(8, 5)) plt.plot(p_values, entropy_values, 'b-', linewidth=2) plt.xlabel('Probability p') plt.ylabel('Entropy (bits)') plt.title('Binary Entropy Function') plt.grid(True, alpha=0.3) plt.show() ``` ### Cross-Entropy and Log Loss **Cross-entropy** measures how well a predicted distribution $Q$ matches the true distribution $P$: $$H(P, Q) = -\sum_x P(x) \log Q(x)$$ For classification, $P$ is the true label (0 or 1) and $Q$ is the predicted probability: $$\text{Cross-Entropy Loss} = -\frac{1}{n}\sum_i [y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i)]$$ This is also called **log loss** or **binary cross-entropy**. Why cross-entropy for classification? 1. **Gradient properties**: Provides useful gradients for optimization 2. **Probability interpretation**: Minimizing cross-entropy = maximizing likelihood 3. **Calibration incentive**: Encourages well-calibrated probability outputs ```{python} #| eval: false #| echo: true from sklearn.metrics import log_loss # Cross-entropy loss y_true = np.array([1, 0, 1, 1, 0]) y_prob = np.array([0.9, 0.1, 0.8, 0.7, 0.3]) loss = log_loss(y_true, y_prob) print(f"Cross-entropy loss: {loss:.4f}") # What happens with overconfident wrong predictions? y_prob_bad = np.array([0.9, 0.9, 0.8, 0.7, 0.3]) # Second prediction is wrong and confident loss_bad = log_loss(y_true, y_prob_bad) print(f"With overconfident error: {loss_bad:.4f}") # Much higher loss ``` ### Connection to Loss Functions The loss functions in later chapters derive from information theory: - **Classification**: Cross-entropy loss (log loss) - **Multi-class**: Categorical cross-entropy - **Regression**: MSE corresponds to Gaussian likelihood Understanding this connection helps when: - Choosing loss functions for custom problems - Interpreting what "loss" measures - Understanding why calibrated probabilities emerge ## Common Pitfalls in Medical AI Evaluation *Clinical Context: Published medical AI papers often have methodological flaws that inflate reported performance. Knowing these pitfalls helps you critically read the literature—and avoid the same mistakes.* ### Testing on Training Data Surprisingly common, especially in smaller studies. Signs: - No mention of train/test split - Performance reported on "the dataset" without specifying hold-out - Implausibly high performance (AUC > 0.95 for difficult tasks) Always ask: "What data were these numbers computed on?" ### Data Leakage Patterns Common in medical AI: | Leakage Type | Example | How to Detect | |-------------|---------|---------------| | Patient leakage | Multiple scans from same patient in train/test | Check patient IDs | | Temporal leakage | Training on 2020, testing on 2019 | Check date distributions | | Feature leakage | Including post-outcome variables | Review feature collection timing | | Preprocessing leakage | Fitting scaler on all data | Review preprocessing code | ### Overfitting to the Test Set When researchers iterate on test results, they implicitly tune to the test set. Signs: - Multiple papers on same dataset with steadily improving results - Model selection based on test performance - "Final" test set used multiple times Solutions: Pre-registration, external validation, time-locked test sets. ### Class Imbalance Mishandling In medical data, outcomes are often rare (1-5% positive). Problems: - Accuracy is misleading (99% accuracy means nothing if predicting "no disease" always achieves 99%) - Random undersampling loses information - Oversampling can create leakage if done before splitting Better approaches: - Use appropriate metrics (PR-AUC, F1, specificity at fixed sensitivity) - Apply resampling within cross-validation folds - Consider cost-sensitive learning ### Dataset Shift Models trained on one population may fail on another due to: - **Covariate shift**: Input distribution changes (different demographics) - **Prior shift**: Outcome prevalence changes (screening vs. symptomatic population) - **Concept shift**: Relationship between inputs and outcomes changes (different disease subtypes) External validation is the only way to assess generalization. ### Confusing Discrimination with Calibration A model can perfectly rank-order risk (good discrimination) while outputting completely wrong probabilities (poor calibration). Many evaluations report only AUC. Always assess calibration for models that output probabilities, especially if those probabilities will be communicated to patients or used for treatment thresholds. ## Chapter Summary Statistics provides the foundation for rigorous medical AI development: **Probability foundations:** - Bayes' theorem connects test characteristics to clinical utility - PPV and NPV depend on prevalence—not just sensitivity/specificity - Pre-test probability matters even for AI predictions **From hypothesis testing to ML evaluation:** - P-values don't evaluate model performance - The focus shifts from "is this effect real?" to "how well does this generalize?" - Multiple comparisons require careful handling **Classification metrics:** - The confusion matrix is the foundation - Sensitivity/specificity describe the test; PPV/NPV answer clinical questions - AUC summarizes discrimination; calibration assesses probability accuracy - No single metric captures everything—use multiple **Validation strategy:** - Never evaluate on training data - Cross-validation provides robust estimates - Data leakage is subtle and devastating - External validation is the gold standard **Model comparison:** - Bootstrap confidence intervals quantify uncertainty - Statistical significance ≠ clinical significance - Effect sizes matter more than p-values These concepts appear throughout the book. Every model we build will be evaluated using these metrics; every result should be interpreted with these caveats in mind. ## Exercises 1. A breast cancer screening AI has 90% sensitivity and 95% specificity. In a screening population with 0.5% prevalence, calculate the PPV. What if the same AI is used in a diagnostic setting with 30% prevalence? 2. You're comparing two sepsis prediction models. Model A has AUC 0.82; Model B has AUC 0.79. Model A was developed using random patient splitting; Model B used patient-level splitting. Which model should you trust more? Why? 3. A paper reports 94% accuracy for detecting diabetic retinopathy. What questions would you ask before concluding this is a useful model? 4. Implement a function that computes sensitivity, specificity, PPV, and NPV from predictions and labels. Test it on synthetic data with varying prevalence. 5. Create a dataset with 1000 samples where 50 samples from 10 patients are duplicated across train/test splits. Compare cross-validation performance with and without patient-level splitting. What do you observe?