import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
# Example predictions
y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0]) # 4 positive, 6 negative
y_pred = np.array([1, 1, 0, 0, 0, 0, 0, 1, 0, 0]) # Model predictions
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
# Output:
# [[5 1]
# [2 2]]
# TN=5, FP=1, FN=2, TP=2
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))4 Statistics for Medical AI
Clinical Context: A machine learning paper claims 95% accuracy for detecting pancreatic cancer from CT scans. Is this impressive? Useful? Before you can answer, you need to know: 95% of what? On which patients? How was it validated? This chapter provides the statistical foundation to critically evaluate medical AI systems—and to build ones that actually work in practice.
Statistics and machine learning share deep roots, but they often speak different languages. Classical statistics focuses on inference—drawing conclusions about populations from samples, testing hypotheses, quantifying uncertainty. Machine learning focuses on prediction—building models that generalize to new data. In medical AI, we need both: inference to understand what our models are doing, and prediction to make them clinically useful.
This chapter bridges your existing statistics knowledge to the specific needs of medical AI. We’ll cover probability foundations with clinical applications, explain why traditional hypothesis testing doesn’t translate directly to model evaluation, and develop thorough understanding of the metrics that matter in healthcare: sensitivity, specificity, PPV, NPV, ROC curves, and calibration.
4.1 Probability Foundations
4.1.1 Random Variables and Distributions
Clinical Context: A patient’s blood glucose level, whether a tumor is malignant, the number of ER visits in a year—these are all random variables. Understanding their distributions is fundamental to building models that predict them.
You’ve likely encountered probability distributions in biostatistics. Here’s a quick review focused on their ML applications.
Discrete distributions describe countable outcomes:
- Bernoulli: Single binary outcome (disease present/absent)
- Binomial: Number of successes in n trials (patients responding to treatment)
- Poisson: Count of rare events (adverse events per patient-year)
Continuous distributions describe measurements:
- Normal (Gaussian): Many biological measurements approximate this
- Log-normal: Right-skewed data like length of stay, drug concentrations
- Exponential: Time until an event (survival analysis)
In machine learning, we often model the conditional distribution \(P(Y|X)\)—the probability of an outcome \(Y\) given features \(X\). A logistic regression models the probability of disease given patient characteristics. A neural network might model the probability distribution over possible diagnoses.
4.1.2 Conditional Probability and Bayes’ Theorem
Clinical Context: A screening test is positive. What’s the probability the patient actually has the disease? This question—deceptively simple—is where conditional probability becomes essential.
Bayes’ theorem relates conditional probabilities:
\[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\]
In medical terms, translating to disease (\(D\)) and positive test (\(T^+\)):
\[P(D|T^+) = \frac{P(T^+|D) \cdot P(D)}{P(T^+)}\]
The terms have clinical names:
- \(P(D|T^+)\): Positive predictive value (PPV) — probability of disease given positive test
- \(P(T^+|D)\): Sensitivity — probability of positive test given disease
- \(P(D)\): Prevalence or pre-test probability
- \(P(T^+)\): Probability of positive test (in anyone)
The denominator expands using the law of total probability:
\[P(T^+) = P(T^+|D) \cdot P(D) + P(T^+|\neg D) \cdot P(\neg D)\]
Where \(P(T^+|\neg D)\) is the false positive rate (1 - specificity).
4.1.3 The Mammography Problem
This classic example illustrates why base rates matter—even for AI systems.
Consider a breast cancer screening scenario:
- Prevalence in screening population: 1% (\(P(D) = 0.01\))
- Sensitivity of the test: 90% (\(P(T^+|D) = 0.90\))
- Specificity of the test: 90% (\(P(\neg T^+|\neg D) = 0.90\))
A woman tests positive. What’s the probability she has cancer?
Many people—including physicians—intuitively guess around 90%. Let’s calculate:
\[P(D|T^+) = \frac{0.90 \times 0.01}{0.90 \times 0.01 + 0.10 \times 0.99} = \frac{0.009}{0.009 + 0.099} = \frac{0.009}{0.108} \approx 0.083\]
The answer is about 8%, not 90%. Despite a “90% accurate” test, most positive results are false positives because the disease is rare.
This has profound implications for medical AI:
- Performance metrics must account for prevalence: A model’s real-world utility depends on the population it’s deployed in
- High sensitivity isn’t enough: If specificity is imperfect, false positives overwhelm true positives in low-prevalence settings
- AI systems inherit this problem: A deep learning model with 95% sensitivity and 95% specificity for a 1% prevalence condition will still have a PPV of only 16%
4.1.4 Pre-Test and Post-Test Probability
Clinicians often think in terms of how a test changes their assessment:
- Pre-test probability: Prior belief about disease probability (based on symptoms, demographics, clinical judgment)
- Post-test probability: Updated probability after seeing test results
This is Bayesian reasoning in action. An AI prediction is just another “test”—it should update clinical judgment, not replace it.
Likelihood ratios provide a convenient way to update probabilities:
\[\text{Positive LR} = \frac{\text{Sensitivity}}{1 - \text{Specificity}}\]
\[\text{Post-test odds} = \text{Pre-test odds} \times \text{Likelihood ratio}\]
A positive LR of 10 means a positive result increases the odds of disease 10-fold. A well-calibrated AI model should provide meaningful likelihood ratios—dramatically changing probability estimates when positive.
4.1.5 Independence and Conditional Independence
Two events are independent if knowing one tells you nothing about the other:
\[P(A \cap B) = P(A) \cdot P(B)\]
Conditional independence is subtler: \(A\) and \(B\) are conditionally independent given \(C\) if:
\[P(A \cap B | C) = P(A|C) \cdot P(B|C)\]
This matters in medical AI because:
- Naive Bayes classifiers assume features are conditionally independent given the class label—often violated in medical data
- Graphical models represent conditional independence structures
- Confounding occurs when two variables appear related but are both caused by a third
Example: In hospital data, ice cream sales might correlate with drowning deaths. They’re not causally related—both are caused by summer weather. Conditional on season, they’re independent.
4.2 From Hypothesis Testing to Model Evaluation
Clinical Context: You’ve likely been trained to look for “p < 0.05” as the marker of a meaningful result. In machine learning, this framework doesn’t directly apply—and understanding why will help you avoid common evaluation mistakes.
4.2.1 The Classical Framework (Brief Review)
Traditional hypothesis testing proceeds:
- State a null hypothesis (\(H_0\): no effect)
- Collect data and compute a test statistic
- Calculate the p-value: probability of data this extreme if \(H_0\) is true
- Reject \(H_0\) if p-value < significance level (often 0.05)
This framework connects to diagnostic testing:
| Statistical Term | Diagnostic Testing Equivalent |
|---|---|
| Type I error (false positive) | False positive rate (1 - specificity) |
| Type II error (false negative) | False negative rate (1 - sensitivity) |
| Power (1 - Type II error rate) | Sensitivity |
| Significance level | False positive rate |
4.2.2 Why P-Values Don’t Evaluate Models
Here’s the fundamental disconnect: hypothesis testing asks whether an effect exists; machine learning asks how well a model predicts.
Consider comparing two models:
- Model A: AUC = 0.82
- Model B: AUC = 0.85
Is Model B better? A traditional approach might test:
\(H_0\): AUC\(_A\) = AUC\(_B\)
With enough data, you’ll get p < 0.05. But this tells you the difference is statistically detectable, not that it’s clinically meaningful. A p-value of 0.001 for a 0.01 AUC improvement isn’t impressive—it just means you have a lot of data.
Conversely, with small datasets, important differences might not reach significance. The p-value depends on sample size, not just effect size.
4.2.3 The Replication Crisis and Its Lessons
The past decade has revealed widespread problems with p-value-based research:
- Publication bias: Only “significant” results get published
- P-hacking: Trying multiple analyses until p < 0.05
- HARKing: Hypothesizing After Results are Known
- Overfitting to significance: Models tuned to achieve p < 0.05 on one dataset
Medical AI research isn’t immune. A 2019 analysis found that many published medical AI studies:
- Tested on the same data used for model development
- Reported only the best-performing model variant
- Used subgroup analyses that weren’t pre-specified
4.2.4 What ML Does Differently
Machine learning evaluation focuses on predictive performance on held-out data:
- Train/test split: Never evaluate on training data
- Cross-validation: Estimate performance across multiple splits
- External validation: Test on completely independent datasets
Instead of asking “is this effect real?”, we ask “how well does this model generalize?”
Key metrics replace p-values:
- Discrimination: Can the model rank-order risk? (AUC, concordance)
- Calibration: Are probability outputs accurate? (calibration curves, Brier score)
- Clinical utility: Does the model improve decisions? (net benefit, decision curves)
The shift in mindset: we’re not trying to prove something is true; we’re trying to build something that works.
4.2.5 Multiple Comparisons in Model Selection
When you compare many models, some will appear best by chance. This is the machine learning version of multiple testing.
Suppose you try 20 model variants and report the best test-set performance. Even if all models were identical, one would appear ~5% better than average just by luck.
Solutions:
- Nested cross-validation: Tune hyperparameters in inner loop, evaluate in outer loop
- Pre-registration: Specify your evaluation plan before looking at results
- Conservative reporting: Report confidence intervals, not just point estimates
- External validation: The ultimate test is independent data
4.3 Evaluation Metrics for Classification
Clinical Context: Your model outputs a probability of sepsis. How do you quantify whether these predictions are useful? The choice of metric shapes what “good” means—and different stakeholders care about different things.
4.3.1 The Confusion Matrix
All classification metrics derive from the confusion matrix:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
From these four counts, we compute everything else.
4.3.2 Sensitivity and Specificity
Sensitivity (True Positive Rate, Recall): Of those with the condition, how many did we detect?
\[\text{Sensitivity} = \frac{TP}{TP + FN}\]
Specificity (True Negative Rate): Of those without the condition, how many did we correctly identify as negative?
\[\text{Specificity} = \frac{TN}{TN + FP}\]
Clinical interpretation:
- High sensitivity: Few missed cases (low false negatives). Critical for ruling OUT disease. A negative result is reassuring.
- High specificity: Few false alarms (low false positives). Critical for ruling IN disease. A positive result is convincing.
The clinical mnemonic: SnNout (Sensitive test, Negative result rules out) and SpPin (Specific test, Positive result rules in).
4.3.3 The Sensitivity-Specificity Tradeoff
Most models output a probability, and we choose a threshold to convert to binary predictions. Moving this threshold trades sensitivity for specificity:
- Lower threshold (call more things positive): Higher sensitivity, lower specificity
- Higher threshold (call fewer things positive): Lower sensitivity, higher specificity
The right threshold depends on clinical context:
| Scenario | Priority | Threshold |
|---|---|---|
| Cancer screening | Don’t miss cases | Low (favor sensitivity) |
| Surgical indication | Avoid unnecessary surgery | High (favor specificity) |
| ICU admission | Balance both | Depends on capacity |
from sklearn.metrics import precision_recall_curve, roc_curve
import numpy as np
# Model probability outputs (not binary predictions)
y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
y_prob = np.array([0.9, 0.8, 0.4, 0.3, 0.2, 0.1, 0.15, 0.6, 0.05, 0.1])
# Different thresholds give different sensitivity/specificity
for threshold in [0.3, 0.5, 0.7]:
y_pred = (y_prob >= threshold).astype(int)
tp = np.sum((y_true == 1) & (y_pred == 1))
fn = np.sum((y_true == 1) & (y_pred == 0))
tn = np.sum((y_true == 0) & (y_pred == 0))
fp = np.sum((y_true == 0) & (y_pred == 1))
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
print(f"Threshold {threshold}: Sensitivity={sensitivity:.2f}, Specificity={specificity:.2f}")4.3.4 PPV and NPV: What Clinicians Care About
While sensitivity and specificity describe the test, clinicians need to answer: “Given this result, what’s the probability of disease?”
Positive Predictive Value (PPV): If the test is positive, what’s the probability of disease?
\[\text{PPV} = \frac{TP}{TP + FP}\]
Negative Predictive Value (NPV): If the test is negative, what’s the probability of no disease?
\[\text{NPV} = \frac{TN}{TN + FN}\]
Critical insight: PPV and NPV depend on prevalence. The same test performs differently in different populations.
def calculate_ppv_npv(sensitivity, specificity, prevalence):
"""Calculate PPV and NPV from sensitivity, specificity, and prevalence."""
# Using Bayes' theorem
tp_rate = sensitivity * prevalence
fp_rate = (1 - specificity) * (1 - prevalence)
fn_rate = (1 - sensitivity) * prevalence
tn_rate = specificity * (1 - prevalence)
ppv = tp_rate / (tp_rate + fp_rate)
npv = tn_rate / (tn_rate + fn_rate)
return ppv, npv
# Same test (90% sensitivity, 90% specificity) in different populations
for prevalence in [0.01, 0.10, 0.50]:
ppv, npv = calculate_ppv_npv(0.90, 0.90, prevalence)
print(f"Prevalence {prevalence*100:.0f}%: PPV={ppv:.2f}, NPV={npv:.2f}")
# Output:
# Prevalence 1%: PPV=0.08, NPV=1.00
# Prevalence 10%: PPV=0.50, NPV=0.99
# Prevalence 50%: PPV=0.90, NPV=0.90This is why a screening test that works in a specialty clinic (high prevalence) may generate mostly false positives in primary care (low prevalence).
4.3.5 ROC Curves and AUC
The Receiver Operating Characteristic (ROC) curve plots sensitivity vs. (1 - specificity) across all possible thresholds.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
# Simulated model outputs
np.random.seed(42)
n = 1000
y_true = np.random.binomial(1, 0.3, n) # 30% positive
# Better model: higher scores for positives
y_prob_good = np.where(y_true == 1,
np.random.beta(5, 2, n), # Positives: higher scores
np.random.beta(2, 5, n)) # Negatives: lower scores
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_prob_good)
auc = roc_auc_score(y_true, y_prob_good)
# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'Model (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()AUC (Area Under the ROC Curve) summarizes discrimination in a single number:
- AUC = 0.5: No discrimination (random guessing)
- AUC = 0.7-0.8: Acceptable discrimination
- AUC = 0.8-0.9: Excellent discrimination
- AUC > 0.9: Outstanding discrimination
Interpretation: AUC is the probability that a randomly chosen positive case ranks higher than a randomly chosen negative case.
4.3.6 Limitations of AUC
AUC is popular but has important limitations:
- Ignores calibration: A model can have high AUC but output poorly calibrated probabilities
- Averages across all thresholds: Clinically, you’ll use one threshold—performance there matters more
- Insensitive to class imbalance: Can look good even when PPV is terrible
- Doesn’t reflect clinical utility: Two models with the same AUC might have very different clinical value
Use AUC as one metric among several, not the sole criterion.
4.3.7 Precision-Recall Curves
For imbalanced datasets (rare diseases), precision-recall curves are often more informative than ROC.
Precision = PPV = \(\frac{TP}{TP + FP}\)
Recall = Sensitivity = \(\frac{TP}{TP + FN}\)
from sklearn.metrics import precision_recall_curve, average_precision_score
# Imbalanced data: 5% positive
np.random.seed(42)
n = 1000
y_true_imbalanced = np.random.binomial(1, 0.05, n)
y_prob_imbalanced = np.where(y_true_imbalanced == 1,
np.random.beta(4, 2, n),
np.random.beta(2, 4, n))
precision, recall, thresholds = precision_recall_curve(y_true_imbalanced, y_prob_imbalanced)
ap = average_precision_score(y_true_imbalanced, y_prob_imbalanced)
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-', linewidth=2, label=f'Model (AP = {ap:.3f})')
plt.axhline(y=0.05, color='k', linestyle='--', label='Random (prevalence)')
plt.xlabel('Recall (Sensitivity)')
plt.ylabel('Precision (PPV)')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()Average Precision (AP) summarizes the PR curve. For rare outcomes, this is often more meaningful than AUC.
4.3.8 Calibration: Are the Probabilities Right?
A model might correctly rank patients by risk but output wrong probability values. Calibration measures whether predicted probabilities match observed frequencies.
If a model says “30% risk of readmission” for a group of patients, about 30% should actually be readmitted.
from sklearn.calibration import calibration_curve
# Generate predictions and outcomes
np.random.seed(42)
n = 2000
y_prob = np.random.beta(2, 5, n) # Model predictions
# Create outcomes that are systematically different from predictions (miscalibrated)
y_true = np.random.binomial(1, np.clip(y_prob * 1.5, 0, 1), n)
# Compute calibration curve
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)
plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, 'bo-', label='Model')
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()Brier Score combines discrimination and calibration:
\[\text{Brier Score} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2\]
Lower is better. Brier score is the mean squared error of probability predictions.
from sklearn.metrics import brier_score_loss
brier = brier_score_loss(y_true, y_prob)
print(f"Brier Score: {brier:.4f}")
# Perfect predictions: 0, Random guessing (p=0.5): 0.25Why calibration matters clinically:
- Shared decision-making requires accurate probabilities (“Your risk is about 15%”)
- Clinical guidelines often use risk thresholds (“Treat if risk > 10%”)
- Poorly calibrated models can systematically over- or under-treat
4.3.9 Decision Curve Analysis and Net Benefit
Clinical Context: Your sepsis model has AUC 0.85 and is well-calibrated. But should clinicians actually use it? AUC measures discrimination; calibration measures probability accuracy. Neither directly answers: “Does using this model improve clinical decisions compared to alternatives?” Decision curve analysis (DCA) fills this gap (Vickers and Elkin 2006).
The Clinical Decision Problem
Consider a model predicting 30-day mortality. At what probability threshold should you change management (e.g., palliative care consultation)? This depends on the relative costs of:
- False positives: Unnecessary interventions, patient/family distress
- False negatives: Missed opportunities to improve end-of-life care
Different clinicians may have different thresholds depending on patient preferences and clinical context. DCA evaluates model utility across a range of thresholds.
Net Benefit
The net benefit at threshold \(p_t\) is:
\[ \text{Net Benefit} = \frac{TP}{n} - \frac{FP}{n} \times \frac{p_t}{1 - p_t} \]
The weighting factor \(\frac{p_t}{1-p_t}\) reflects the relative harm of false positives vs. false negatives implied by the threshold choice. At threshold 0.2, you’re implicitly saying a false positive is 4x less harmful than a false negative.
import numpy as np
import matplotlib.pyplot as plt
def calculate_net_benefit(y_true, y_prob, thresholds):
"""Calculate net benefit at each threshold."""
n = len(y_true)
net_benefits = []
for pt in thresholds:
y_pred = (y_prob >= pt).astype(int)
tp = np.sum((y_pred == 1) & (y_true == 1))
fp = np.sum((y_pred == 1) & (y_true == 0))
# Net benefit formula
nb = (tp / n) - (fp / n) * (pt / (1 - pt))
net_benefits.append(nb)
return np.array(net_benefits)
def decision_curve_analysis(y_true, y_prob, model_name="Model"):
"""Generate decision curve analysis plot."""
thresholds = np.arange(0.01, 0.99, 0.01)
# Model net benefit
nb_model = calculate_net_benefit(y_true, y_prob, thresholds)
# "Treat all" strategy (everyone gets intervention)
prevalence = y_true.mean()
nb_treat_all = prevalence - (1 - prevalence) * thresholds / (1 - thresholds)
# "Treat none" strategy (baseline = 0)
nb_treat_none = np.zeros_like(thresholds)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(thresholds, nb_model, 'b-', linewidth=2, label=model_name)
plt.plot(thresholds, nb_treat_all, 'r--', linewidth=1.5, label='Treat All')
plt.plot(thresholds, nb_treat_none, 'k-', linewidth=1, label='Treat None')
plt.xlabel('Threshold Probability')
plt.ylabel('Net Benefit')
plt.title('Decision Curve Analysis')
plt.legend()
plt.xlim([0, 1])
plt.ylim([-0.1, max(prevalence, 0.3)])
plt.grid(True, alpha=0.3)
plt.show()
return thresholds, nb_model
# Example usage
thresholds, nb = decision_curve_analysis(y_true, y_prob, "Sepsis Model")Interpreting Decision Curves
The decision curve shows net benefit (y-axis) across threshold probabilities (x-axis). Compare your model to two reference strategies:
- Treat All: Intervene on everyone regardless of model
- Treat None: Never intervene (net benefit = 0)
Good model: Net benefit exceeds both “treat all” and “treat none” across clinically relevant thresholds.
Key insights:
- A model above “treat all” at low thresholds means it correctly identifies low-risk patients who don’t need intervention
- A model above “treat none” at high thresholds means it correctly identifies high-risk patients who need intervention
- The threshold range where the model is superior defines its clinical utility
Comparing Multiple Models
def compare_models_dca(y_true, models_dict):
"""Compare multiple models using decision curve analysis."""
thresholds = np.arange(0.01, 0.99, 0.01)
prevalence = y_true.mean()
plt.figure(figsize=(10, 6))
# Reference strategies
nb_treat_all = prevalence - (1 - prevalence) * thresholds / (1 - thresholds)
plt.plot(thresholds, nb_treat_all, 'r--', linewidth=1.5, label='Treat All')
plt.plot(thresholds, np.zeros_like(thresholds), 'k-', linewidth=1, label='Treat None')
# Each model
colors = ['blue', 'green', 'orange', 'purple']
for i, (name, y_prob) in enumerate(models_dict.items()):
nb = calculate_net_benefit(y_true, y_prob, thresholds)
plt.plot(thresholds, nb, color=colors[i % len(colors)],
linewidth=2, label=name)
plt.xlabel('Threshold Probability')
plt.ylabel('Net Benefit')
plt.title('Decision Curve Analysis: Model Comparison')
plt.legend()
plt.xlim([0, 0.5]) # Focus on clinically relevant range
plt.ylim([-0.05, max(prevalence, 0.3)])
plt.grid(True, alpha=0.3)
plt.show()
# Compare models
models = {
'Logistic Regression': y_prob_lr,
'Random Forest': y_prob_rf,
'XGBoost': y_prob_xgb
}
compare_models_dca(y_true, models)When to Use DCA
Decision curve analysis is most valuable when:
- You need to justify clinical adoption (not just statistical performance)
- The intervention has meaningful costs/harms
- Different stakeholders may have different threshold preferences
- You’re comparing a new model against existing clinical practice
Limitations:
- Assumes threshold-based decisions (not all clinical decisions work this way)
- Doesn’t account for downstream outcomes of decisions
- Sensitive to prevalence in the evaluation cohort
DCA complements discrimination (AUC) and calibration—together they provide a complete picture of clinical utility.
4.4 Evaluation Metrics for Regression
Clinical Context: Predicting continuous outcomes—length of stay, lab values, time to event—requires different metrics than classification. Here we focus on regression evaluation.
4.4.1 Mean Squared Error and Variants
Mean Squared Error (MSE): Average squared difference between predictions and actual values.
\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]
Root Mean Squared Error (RMSE): Square root of MSE, in original units.
\[\text{RMSE} = \sqrt{\text{MSE}}\]
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
# Example: predicting length of stay (days)
y_true = np.array([3, 5, 2, 8, 4, 6, 3, 7, 5, 4])
y_pred = np.array([3.5, 4.5, 2.5, 7, 4.5, 5.5, 3, 8, 4.5, 4])
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
print(f"MSE: {mse:.3f}")
print(f"RMSE: {rmse:.3f} days")MSE penalizes large errors heavily (squared). This may or may not be desirable:
- Appropriate: If large errors are disproportionately bad (missing a 10-day stay by 5 days is worse than missing two 3-day stays by 2 days each)
- Less appropriate: If all errors matter equally
4.4.2 Mean Absolute Error
Mean Absolute Error (MAE): Average absolute difference.
\[\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|\]
MAE treats all errors linearly—a 2-day error counts twice as much as a 1-day error, not four times as much (like MSE).
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae:.3f} days")MAE is more robust to outliers than MSE. For length of stay prediction with occasional 30+ day stays, MAE may be more clinically meaningful.
4.4.3 R-Squared and Its Limitations
R-squared (\(R^2\)): Proportion of variance explained by the model.
\[R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}\]
- \(R^2 = 1\): Perfect predictions
- \(R^2 = 0\): Model no better than predicting the mean
- \(R^2 < 0\): Model worse than predicting the mean (possible for bad models on test data)
r2 = r2_score(y_true, y_pred)
print(f"R-squared: {r2:.3f}")Limitations of \(R^2\):
- Scale-dependent comparison: Can’t compare across different outcomes
- Doesn’t indicate clinical utility: High \(R^2\) doesn’t mean predictions are useful
- Can be misleading with narrow outcome range: Easy to get high \(R^2\) when variance is small
4.4.4 Clinical Example: Risk Score Prediction
Many clinical risk scores (APACHE, SOFA, Framingham) predict continuous or semi-continuous outcomes. Evaluation should consider:
- Discrimination: Does the score rank-order risk correctly? (Compare via concordance index)
- Calibration: Do predicted values match observed outcomes?
- Clinical utility: Do predictions at decision thresholds work?
4.5 Cross-Validation and Model Selection
Clinical Context: You’ve built a sepsis prediction model that achieves 0.89 AUC on your development data. Will it work on tomorrow’s patients? Next month’s? At other hospitals? Proper validation is the only way to know.
4.5.1 The Train/Validation/Test Split
The fundamental rule: never evaluate on training data. Models memorize training examples; training performance is optimistically biased.
Standard three-way split:
- Training set (~60-70%): Fit model parameters
- Validation set (~15-20%): Tune hyperparameters, select model variants
- Test set (~15-20%): Final, one-time evaluation
from sklearn.model_selection import train_test_split
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Second split: training and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# Results in 60% train, 20% val, 20% test4.5.2 Why This Matters: Overfitting
Overfitting occurs when a model learns patterns specific to the training data that don’t generalize. Signs:
- Training performance much better than validation/test performance
- Performance degrades on new data
- Model captures noise rather than signal
Complex models overfit more easily. A deep neural network can memorize 1000 patient records perfectly—but this memorization won’t help with patient 1001.
4.5.3 K-Fold Cross-Validation
Cross-validation provides more robust performance estimates by averaging across multiple splits.
K-fold cross-validation: 1. Divide data into K equal folds 2. For each fold: train on K-1 folds, evaluate on held-out fold 3. Average performance across all folds
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
# 5-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print(f"AUC: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")Stratified K-fold ensures each fold has approximately the same class distribution—essential for imbalanced medical datasets.
Common choices: - K=5 or K=10: Standard choices - Leave-one-out (K=n): Maximum use of data, but high variance and computationally expensive
4.5.4 Data Leakage: The Silent Killer
Data leakage occurs when information from the test set influences training. It’s the most common cause of models that work in development but fail in deployment.
Common sources in medical AI:
1. Patient-level leakage Multiple samples from the same patient appear in both training and test sets.
from sklearn.model_selection import GroupKFold
# Wrong: random split might put same patient in train and test
# Right: group by patient ID
group_kfold = GroupKFold(n_splits=5)
for train_idx, test_idx in group_kfold.split(X, y, groups=patient_ids):
# Now no patient appears in both train and test
pass2. Temporal leakage Training on future data to predict the past.
from sklearn.model_selection import TimeSeriesSplit
# For time-series data: always train on past, predict future
ts_cv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in ts_cv.split(X):
# train_idx always comes before test_idx in time
pass3. Feature leakage Features computed using test set information, or features that encode the outcome.
Examples: - Normalizing features using mean/std computed on full dataset (including test) - Including diagnosis codes that are assigned after the prediction timepoint - Using length of stay to predict mortality (LOS encodes survival time)
4. Preprocessing leakage Fitting transformers on full data before splitting.
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Wrong: fit scaler on all data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Leaks test set statistics
X_train, X_test = train_test_split(X_scaled)
# Right: fit scaler only on training data
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Transform only, don't fit4.5.5 Nested Cross-Validation
When tuning hyperparameters, you need two levels of validation:
- Inner loop: Select best hyperparameters
- Outer loop: Estimate performance of the tuning procedure
from sklearn.model_selection import GridSearchCV, cross_val_score
# Inner CV for hyperparameter tuning
param_grid = {'C': [0.1, 1, 10]}
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=inner_cv, scoring='roc_auc')
# Outer CV for performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='roc_auc')
print(f"Nested CV AUC: {nested_scores.mean():.3f} (+/- {nested_scores.std()*2:.3f})")Without nested CV, you’re evaluating on the same data used to select hyperparameters—leading to optimistic estimates.
4.5.6 External Validation: The Reality Check
Clinical Context: Your sepsis model achieves AUC 0.89 with nested cross-validation at your hospital. You’re ready to publish and deploy. But will it work at the community hospital across town? At a pediatric hospital? In a different country? External validation—testing on completely independent data—is the only way to know (Steyerberg et al. 2010).
Why External Validation Matters
Internal validation (train/test splits, cross-validation) all use data from the same source. This shared source means:
- Same patient population demographics
- Same clinical workflows and documentation practices
- Same lab equipment and reference ranges
- Same EHR system and coding conventions
- Same time period with same disease patterns
Models learn all of these factors, not just the clinical signal. External validation reveals how much of your performance depends on site-specific artifacts.
The Performance Drop Is Real
Studies consistently show 5-20% AUC degradation when models move to new sites:
| Domain | Internal AUC | External AUC | Drop |
|---|---|---|---|
| Sepsis prediction | 0.89 | 0.78 | 12% |
| Chest X-ray diagnosis | 0.94 | 0.82 | 13% |
| Readmission risk | 0.76 | 0.68 | 11% |
| Mortality prediction | 0.91 | 0.82 | 10% |
These are representative ranges from literature; actual drops vary by model and sites.
This isn’t failure—it’s reality. A model that shows no external validation drop is either exceptionally robust or hasn’t been properly tested.
Types of External Validation
Temporal validation: Test on data from a later time period at the same site.
# Split by time, not randomly
train_data = df[df['admission_date'] < '2022-01-01']
test_data = df[df['admission_date'] >= '2022-01-01']
# Train on historical, test on recent
model.fit(train_data[features], train_data[outcome])
y_prob = model.predict_proba(test_data[features])[:, 1]
temporal_auc = roc_auc_score(test_data[outcome], y_prob)This catches temporal drift but not site-specific factors.
Geographic validation: Test on data from different institutions.
# Train on Site A, test on Site B
site_a_data = df[df['site'] == 'A']
site_b_data = df[df['site'] == 'B']
model.fit(site_a_data[features], site_a_data[outcome])
y_prob = model.predict_proba(site_b_data[features])[:, 1]
geographic_auc = roc_auc_score(site_b_data[outcome], y_prob)This is the gold standard but requires multi-site data access.
Population validation: Test on different patient subgroups (pediatric vs. adult, ICU vs. general ward).
Prospective validation: Deploy the model and evaluate on new patients in real-time. The ultimate test, but requires deployment infrastructure.
What to Report
When publishing or deploying a model, report:
- Internal performance: Cross-validation on development data
- Temporal performance: Hold-out from later time period
- External performance: At least one independent site (if possible)
- Performance difference: Explicit comparison showing the drop
- Subgroup performance: Across demographics, severity, etc.
If You Can’t Get External Data
Multi-site data isn’t always available. Alternatives:
- Simulate site differences: Add noise, shift distributions, test robustness
- Leave-one-site-out: If you have multi-site data, train on N-1 sites, test on held-out site
- Temporal hold-out: Strictly time-separated test set
- Be honest: State clearly that external validation is needed before deployment
def leave_one_site_out_cv(df, sites_col, features, outcome, model_class):
"""Leave-one-site-out cross-validation."""
sites = df[sites_col].unique()
results = []
for test_site in sites:
train_data = df[df[sites_col] != test_site]
test_data = df[df[sites_col] == test_site]
model = model_class()
model.fit(train_data[features], train_data[outcome])
y_prob = model.predict_proba(test_data[features])[:, 1]
auc = roc_auc_score(test_data[outcome], y_prob)
results.append({
'test_site': test_site,
'n_test': len(test_data),
'auc': auc
})
return pd.DataFrame(results)
# Evaluate generalization across sites
results = leave_one_site_out_cv(df, 'hospital_id', feature_cols, 'outcome', LogisticRegression)
print(f"Mean AUC: {results['auc'].mean():.3f} (range: {results['auc'].min():.3f}-{results['auc'].max():.3f})")Red Flags in Published Work
Be skeptical of papers that:
- Report only internal validation
- Use random splits on multi-site data (without site-aware splitting)
- Claim generalization without testing it
- Show AUC > 0.95 on tasks where clinical experts achieve 0.80
External validation isn’t optional—it’s the minimum bar for clinical credibility.
4.6 Comparing Models and Statistical Testing
Clinical Context: Your new deep learning model achieves AUC 0.85; the existing logistic regression gets 0.82. Is the improvement real, or just noise?
4.6.1 Bootstrap Confidence Intervals
Bootstrap resampling provides confidence intervals for any metric:
from sklearn.metrics import roc_auc_score
from sklearn.utils import resample
import numpy as np
def bootstrap_auc(y_true, y_prob, n_bootstrap=1000):
"""Compute bootstrap confidence interval for AUC."""
aucs = []
n = len(y_true)
for _ in range(n_bootstrap):
# Sample with replacement
indices = resample(range(n), n_samples=n, random_state=None)
# Skip if only one class in bootstrap sample
if len(np.unique(y_true[indices])) < 2:
continue
auc = roc_auc_score(y_true[indices], y_prob[indices])
aucs.append(auc)
# 95% confidence interval
lower = np.percentile(aucs, 2.5)
upper = np.percentile(aucs, 97.5)
return np.mean(aucs), lower, upper
# Example usage
mean_auc, lower, upper = bootstrap_auc(y_true, y_prob)
print(f"AUC: {mean_auc:.3f} (95% CI: {lower:.3f}-{upper:.3f})")4.6.2 Comparing Two Models
To compare models on the same test set, use paired comparisons that account for correlation.
DeLong’s test compares AUCs:
# Using scipy for a simple comparison
# For DeLong's test, consider packages like 'pROC' (R) or specialized Python implementations
# Simple paired bootstrap comparison
def compare_models_bootstrap(y_true, y_prob_a, y_prob_b, n_bootstrap=1000):
"""Compare two models using paired bootstrap."""
diffs = []
n = len(y_true)
for _ in range(n_bootstrap):
indices = resample(range(n), n_samples=n)
if len(np.unique(y_true[indices])) < 2:
continue
auc_a = roc_auc_score(y_true[indices], y_prob_a[indices])
auc_b = roc_auc_score(y_true[indices], y_prob_b[indices])
diffs.append(auc_a - auc_b)
# Check if CI excludes zero
lower = np.percentile(diffs, 2.5)
upper = np.percentile(diffs, 97.5)
mean_diff = np.mean(diffs)
return mean_diff, lower, upper
diff, lower, upper = compare_models_bootstrap(y_true, y_prob_model_a, y_prob_model_b)
print(f"AUC difference: {diff:.3f} (95% CI: {lower:.3f}-{upper:.3f})")
if lower > 0:
print("Model A significantly better")
elif upper < 0:
print("Model B significantly better")
else:
print("No significant difference")McNemar’s test compares binary predictions:
from scipy.stats import chi2
def mcnemar_test(y_true, y_pred_a, y_pred_b):
"""McNemar's test for comparing two classifiers."""
# Count discordant pairs
b = np.sum((y_pred_a == y_true) & (y_pred_b != y_true)) # A right, B wrong
c = np.sum((y_pred_a != y_true) & (y_pred_b == y_true)) # A wrong, B right
# McNemar statistic (with continuity correction)
statistic = (abs(b - c) - 1)**2 / (b + c)
p_value = 1 - chi2.cdf(statistic, df=1)
return statistic, p_value, b, c
stat, p, b, c = mcnemar_test(y_true, y_pred_model_a, y_pred_model_b)
print(f"McNemar's chi-squared: {stat:.3f}, p-value: {p:.4f}")
print(f"Model A correct & B wrong: {b}, Model A wrong & B correct: {c}")4.6.3 Effect Sizes and Clinical Significance
Statistical significance doesn’t imply clinical significance. Consider:
- Minimum clinically important difference: How much improvement matters?
- Effect size: Standardized measure of difference magnitude
- Decision-analytic perspective: Does the improvement change clinical decisions?
An AUC improvement from 0.82 to 0.83 might be statistically significant with enough data, but clinically meaningless. An improvement from 0.75 to 0.85 is likely both statistically and clinically significant.
The following section on information theory explains the mathematical foundation of neural network training. If you’re reading for conceptual understanding, you can skip to Chapter 7—the key takeaway is that “cross-entropy loss” measures how wrong a model’s probability predictions are, and training minimizes this error.
4.7 Information Theory Essentials
Clinical Context: When training neural networks, you’ll minimize “cross-entropy loss.” Understanding where this comes from helps you choose appropriate loss functions and interpret training dynamics.
4.7.1 Entropy: Measuring Uncertainty
Entropy quantifies uncertainty in a probability distribution:
\[H(X) = -\sum_x P(x) \log P(x)\]
For a binary outcome with probability \(p\):
\[H = -p \log p - (1-p) \log(1-p)\]
Properties: - Maximum entropy when \(p = 0.5\) (maximum uncertainty) - Zero entropy when \(p = 0\) or \(p = 1\) (certainty)
import numpy as np
import matplotlib.pyplot as plt
def binary_entropy(p):
"""Calculate entropy of a Bernoulli distribution."""
if p == 0 or p == 1:
return 0
return -p * np.log2(p) - (1-p) * np.log2(1-p)
# Plot entropy vs probability
p_values = np.linspace(0.01, 0.99, 100)
entropy_values = [binary_entropy(p) for p in p_values]
plt.figure(figsize=(8, 5))
plt.plot(p_values, entropy_values, 'b-', linewidth=2)
plt.xlabel('Probability p')
plt.ylabel('Entropy (bits)')
plt.title('Binary Entropy Function')
plt.grid(True, alpha=0.3)
plt.show()4.7.2 Cross-Entropy and Log Loss
Cross-entropy measures how well a predicted distribution \(Q\) matches the true distribution \(P\):
\[H(P, Q) = -\sum_x P(x) \log Q(x)\]
For classification, \(P\) is the true label (0 or 1) and \(Q\) is the predicted probability:
\[\text{Cross-Entropy Loss} = -\frac{1}{n}\sum_i [y_i \log(\hat{p}_i) + (1-y_i) \log(1-\hat{p}_i)]\]
This is also called log loss or binary cross-entropy.
Why cross-entropy for classification?
- Gradient properties: Provides useful gradients for optimization
- Probability interpretation: Minimizing cross-entropy = maximizing likelihood
- Calibration incentive: Encourages well-calibrated probability outputs
from sklearn.metrics import log_loss
# Cross-entropy loss
y_true = np.array([1, 0, 1, 1, 0])
y_prob = np.array([0.9, 0.1, 0.8, 0.7, 0.3])
loss = log_loss(y_true, y_prob)
print(f"Cross-entropy loss: {loss:.4f}")
# What happens with overconfident wrong predictions?
y_prob_bad = np.array([0.9, 0.9, 0.8, 0.7, 0.3]) # Second prediction is wrong and confident
loss_bad = log_loss(y_true, y_prob_bad)
print(f"With overconfident error: {loss_bad:.4f}") # Much higher loss4.7.3 Connection to Loss Functions
The loss functions in later chapters derive from information theory:
- Classification: Cross-entropy loss (log loss)
- Multi-class: Categorical cross-entropy
- Regression: MSE corresponds to Gaussian likelihood
Understanding this connection helps when: - Choosing loss functions for custom problems - Interpreting what “loss” measures - Understanding why calibrated probabilities emerge
4.8 Common Pitfalls in Medical AI Evaluation
Clinical Context: Published medical AI papers often have methodological flaws that inflate reported performance. Knowing these pitfalls helps you critically read the literature—and avoid the same mistakes.
4.8.1 Testing on Training Data
Surprisingly common, especially in smaller studies. Signs:
- No mention of train/test split
- Performance reported on “the dataset” without specifying hold-out
- Implausibly high performance (AUC > 0.95 for difficult tasks)
Always ask: “What data were these numbers computed on?”
4.8.2 Data Leakage Patterns
Common in medical AI:
| Leakage Type | Example | How to Detect |
|---|---|---|
| Patient leakage | Multiple scans from same patient in train/test | Check patient IDs |
| Temporal leakage | Training on 2020, testing on 2019 | Check date distributions |
| Feature leakage | Including post-outcome variables | Review feature collection timing |
| Preprocessing leakage | Fitting scaler on all data | Review preprocessing code |
4.8.3 Overfitting to the Test Set
When researchers iterate on test results, they implicitly tune to the test set. Signs:
- Multiple papers on same dataset with steadily improving results
- Model selection based on test performance
- “Final” test set used multiple times
Solutions: Pre-registration, external validation, time-locked test sets.
4.8.4 Class Imbalance Mishandling
In medical data, outcomes are often rare (1-5% positive). Problems:
- Accuracy is misleading (99% accuracy means nothing if predicting “no disease” always achieves 99%)
- Random undersampling loses information
- Oversampling can create leakage if done before splitting
Better approaches: - Use appropriate metrics (PR-AUC, F1, specificity at fixed sensitivity) - Apply resampling within cross-validation folds - Consider cost-sensitive learning
4.8.5 Dataset Shift
Models trained on one population may fail on another due to:
- Covariate shift: Input distribution changes (different demographics)
- Prior shift: Outcome prevalence changes (screening vs. symptomatic population)
- Concept shift: Relationship between inputs and outcomes changes (different disease subtypes)
External validation is the only way to assess generalization.
4.8.6 Confusing Discrimination with Calibration
A model can perfectly rank-order risk (good discrimination) while outputting completely wrong probabilities (poor calibration). Many evaluations report only AUC.
Always assess calibration for models that output probabilities, especially if those probabilities will be communicated to patients or used for treatment thresholds.
4.9 Chapter Summary
Statistics provides the foundation for rigorous medical AI development:
Probability foundations: - Bayes’ theorem connects test characteristics to clinical utility - PPV and NPV depend on prevalence—not just sensitivity/specificity - Pre-test probability matters even for AI predictions
From hypothesis testing to ML evaluation: - P-values don’t evaluate model performance - The focus shifts from “is this effect real?” to “how well does this generalize?” - Multiple comparisons require careful handling
Classification metrics: - The confusion matrix is the foundation - Sensitivity/specificity describe the test; PPV/NPV answer clinical questions - AUC summarizes discrimination; calibration assesses probability accuracy - No single metric captures everything—use multiple
Validation strategy: - Never evaluate on training data - Cross-validation provides robust estimates - Data leakage is subtle and devastating - External validation is the gold standard
Model comparison: - Bootstrap confidence intervals quantify uncertainty - Statistical significance ≠ clinical significance - Effect sizes matter more than p-values
These concepts appear throughout the book. Every model we build will be evaluated using these metrics; every result should be interpreted with these caveats in mind.
4.10 Exercises
A breast cancer screening AI has 90% sensitivity and 95% specificity. In a screening population with 0.5% prevalence, calculate the PPV. What if the same AI is used in a diagnostic setting with 30% prevalence?
You’re comparing two sepsis prediction models. Model A has AUC 0.82; Model B has AUC 0.79. Model A was developed using random patient splitting; Model B used patient-level splitting. Which model should you trust more? Why?
A paper reports 94% accuracy for detecting diabetic retinopathy. What questions would you ask before concluding this is a useful model?
Implement a function that computes sensitivity, specificity, PPV, and NPV from predictions and labels. Test it on synthetic data with varying prevalence.
Create a dataset with 1000 samples where 50 samples from 10 patients are duplicated across train/test splits. Compare cross-validation performance with and without patient-level splitting. What do you observe?