6 Classical Machine Learning

Clinical Context: A hospital wants to predict 30-day readmissions using data from their electronic health records—demographics, vital signs, lab values, diagnoses. Should they train a deep neural network? Probably not. For structured, tabular clinical data, classical machine learning methods often outperform deep learning while being faster to train, easier to interpret, and more robust with limited data. This chapter covers the algorithms that remain workhorses of medical AI.

Deep learning dominates headlines, but classical machine learning remains essential in medical AI. On tabular data—the structured rows and columns that make up most clinical datasets—algorithms like random forests and gradient boosting frequently match or exceed deep learning performance. They train in seconds rather than hours, require less data, and produce models that clinicians can actually interpret.

This chapter covers the foundational algorithms every medical AI practitioner must understand: linear models, decision trees, ensemble methods, and unsupervised techniques. We focus on the key concepts—what each algorithm optimizes and why—while providing practical sklearn implementations you can apply immediately.

6.1 Why Classical ML Still Matters

6.1.1 The Tabular Data Advantage

Clinical data comes in many forms, but structured tables dominate: patient demographics, vital signs, laboratory values, diagnosis codes, medication lists. For this tabular data, classical ML has surprising staying power.

A 2022 benchmarking study found that gradient boosting (XGBoost, LightGBM) matched or outperformed deep learning on most tabular datasets, while training orders of magnitude faster. The reasons:

Feature engineering matters more than model complexity: Clinical domain knowledge—knowing which lab values to combine, how to represent time since admission—often trumps architectural sophistication
Sample sizes are often modest: Deep learning shines with millions of examples; clinical datasets often have thousands
Relationships are often simple: Many clinical predictions involve roughly linear relationships that don’t require deep networks to capture
Missing data is endemic: Classical methods handle missing values more gracefully than neural networks

6.1.2 Interpretability and Clinical Trust

Clinicians need to understand predictions before acting on them. A random forest can tell you which features drove a prediction. A gradient boosting model can show feature importance rankings. Logistic regression coefficients have direct clinical interpretations as odds ratios.

This interpretability isn’t just nice to have—it’s often essential for: - Debugging unexpected predictions - Explaining decisions to patients - Satisfying regulatory requirements - Building clinician trust in AI tools

6.1.3 When to Choose Classical ML

As a general framework:

Scenario	Recommendation
Tabular data, <100K samples	Classical ML (usually gradient boosting)
Tabular data, >100K samples	Try both, classical often still wins
Images, audio, video	Deep learning
Free text	Deep learning (transformers)
Mixed data	Classical for tabular, deep for unstructured
Interpretability required	Classical ML
Need a quick baseline	Classical ML

6.2 Linear Models

6.2.1 Linear Regression: Quick Review

Clinical Context: Predicting a patient’s length of stay or expected lab value—continuous outcomes—starts with linear regression.

Linear regression finds the best linear relationship between features $X$ and a continuous outcome $y$:

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p = X\beta\]

The model minimizes the mean squared error:

\[\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Simulated clinical data: predicting length of stay
np.random.seed(42)
n = 500
age = np.random.normal(65, 15, n)
comorbidity_count = np.random.poisson(2, n)
admission_severity = np.random.uniform(0, 10, n)

# True relationship (with noise)
los = 2 + 0.05 * age + 0.8 * comorbidity_count + 0.3 * admission_severity + np.random.normal(0, 1, n)

X = np.column_stack([age, comorbidity_count, admission_severity])
y = los

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit linear regression
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f} days")
print(f"Coefficients: {model.coef_}")

6.2.2 Regularization: Ridge and Lasso

When you have many features—potentially more features than samples—ordinary linear regression overfits. Regularization adds a penalty for large coefficients.

Ridge regression (L2) penalizes the sum of squared coefficients:

\[\text{Loss} = \text{MSE} + \alpha \sum_j \beta_j^2\]

Lasso regression (L1) penalizes the sum of absolute coefficients:

\[\text{Loss} = \text{MSE} + \alpha \sum_j |\beta_j|\]

Key difference: Lasso can shrink coefficients exactly to zero, performing automatic feature selection. Ridge shrinks but doesn’t zero out.

# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"Ridge R-squared: {r2_score(y_test, ridge.predict(X_test)):.3f}")

# Lasso regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
print(f"Lasso R-squared: {r2_score(y_test, lasso.predict(X_test)):.3f}")
print(f"Lasso non-zero coefficients: {np.sum(lasso.coef_ != 0)}")

6.2.3 Logistic Regression: The Classification Foundation

Clinical Context: Will this patient be readmitted within 30 days? Does this patient have sepsis? Binary outcomes require classification, and logistic regression is the starting point.

Logistic regression extends linear regression to classification by passing the linear combination through the sigmoid function:

\[P(y=1|X) = \sigma(X\beta) = \frac{1}{1 + e^{-X\beta}}\]

The sigmoid maps any real number to a probability between 0 and 1.

What logistic regression optimizes: The model maximizes the likelihood of the observed data, which is equivalent to minimizing cross-entropy loss (log loss):

\[\text{Loss} = -\frac{1}{n}\sum_i [y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]\]

This loss heavily penalizes confident wrong predictions—predicting 0.99 probability for a negative case incurs much more loss than predicting 0.6.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.preprocessing import StandardScaler

# Simulated readmission data
np.random.seed(42)
n = 1000

age = np.random.normal(65, 15, n)
prior_admissions = np.random.poisson(1, n)
charlson_score = np.random.poisson(2, n)
discharge_disposition = np.random.binomial(1, 0.3, n)  # 1 = SNF, 0 = home

# Generate probability of readmission
log_odds = -3 + 0.02 * age + 0.5 * prior_admissions + 0.3 * charlson_score + 0.8 * discharge_disposition
prob = 1 / (1 + np.exp(-log_odds))
readmitted = np.random.binomial(1, prob)

X = np.column_stack([age, prior_admissions, charlson_score, discharge_disposition])
y = readmitted

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features (important for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit logistic regression with L2 regularization
model = LogisticRegression(C=1.0, random_state=42)  # C = 1/alpha
model.fit(X_train_scaled, y_train)

# Evaluate
y_prob = model.predict_proba(X_test_scaled)[:, 1]
y_pred = model.predict(X_test_scaled)

print(f"AUC: {roc_auc_score(y_test, y_prob):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

6.2.4 Interpreting Coefficients

Logistic regression coefficients have a direct clinical interpretation. The coefficient $\beta_j$ represents the change in log-odds for a one-unit increase in feature $x_j$.

The odds ratio is $e^{\beta_j}$—more intuitive for clinicians:

import pandas as pd

# Interpret coefficients
feature_names = ['Age', 'Prior Admissions', 'Charlson Score', 'Discharge to SNF']
coefficients = model.coef_[0]
odds_ratios = np.exp(coefficients)

interpretation = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients,
    'Odds Ratio': odds_ratios
})
print(interpretation.to_string(index=False))

# Example interpretation:
# Odds Ratio of 1.5 for "Prior Admissions" means each additional prior admission
# increases the odds of readmission by 50%

6.2.5 Regularization in High Dimensions

With many features (hundreds of diagnosis codes, thousands of medication combinations), regularization becomes essential:

from sklearn.linear_model import LogisticRegressionCV

# Automatically tune regularization strength with cross-validation
model_cv = LogisticRegressionCV(
    Cs=10,  # Try 10 values of C
    cv=5,
    scoring='roc_auc',
    random_state=42
)
model_cv.fit(X_train_scaled, y_train)

print(f"Best C: {model_cv.C_[0]:.4f}")
print(f"AUC with tuned C: {roc_auc_score(y_test, model_cv.predict_proba(X_test_scaled)[:, 1]):.3f}")

6.3 Support Vector Machines

Clinical Context: SVMs were state-of-the-art for many medical classification tasks before deep learning. While less common now, understanding the maximum margin concept provides useful intuition.

6.3.1 The Maximum Margin Idea

Support Vector Machines find the decision boundary that maximizes the margin—the distance to the nearest training points (the “support vectors”).

For linearly separable data, many boundaries correctly classify all points. SVM chooses the one with maximum margin, which tends to generalize better.

The optimization problem:

\[\text{Minimize } \frac{1}{2}||\mathbf{w}||^2 \text{ subject to } y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1\]

For non-separable data, we allow some misclassification with a penalty:

\[\text{Minimize } \frac{1}{2}||\mathbf{w}||^2 + C\sum_i \xi_i\]

The parameter $C$ controls the tradeoff between margin size and classification errors.

6.3.2 The Kernel Trick

SVMs can learn non-linear boundaries through the kernel trick—implicitly mapping data to a higher-dimensional space where a linear boundary suffices.

Common kernels: - Linear: No transformation (for linearly separable problems) - RBF (Radial Basis Function): Flexible non-linear boundaries - Polynomial: Captures polynomial interactions

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# SVM with RBF kernel
svm_model = SVC(C=1.0, kernel='rbf', gamma='scale', probability=True, random_state=42)
svm_model.fit(X_train_scaled, y_train)

y_prob_svm = svm_model.predict_proba(X_test_scaled)[:, 1]
print(f"SVM AUC: {roc_auc_score(y_test, y_prob_svm):.3f}")

6.3.3 When SVMs Work (and Don’t)

Strengths: - Effective in high-dimensional spaces - Memory efficient (only stores support vectors) - Works well when classes are well-separated

Weaknesses: - Slow to train on large datasets (scales poorly beyond ~10K samples) - Requires careful feature scaling - Less interpretable than linear models - Often outperformed by gradient boosting on tabular data

For most clinical tabular data today, gradient boosting is preferred over SVMs.

6.4 Decision Trees

Clinical Context: Decision trees mirror clinical reasoning—a series of yes/no questions leading to a diagnosis or risk category. Their interpretability makes them valuable for explaining predictions to clinicians.

6.4.1 Recursive Partitioning

A decision tree splits data recursively based on feature values, creating a tree of decisions:

                    Age > 65?
                   /         \
                 Yes          No
                 /             \
         Charlson > 2?      Low Risk
           /      \
         Yes      No
          /        \
     High Risk   Medium Risk

At each node, the algorithm chooses the split that best separates the outcome classes.

6.4.2 Splitting Criteria

How do we measure “best separates”? Two common criteria:

Gini Impurity: Probability of misclassifying a randomly chosen sample

\[\text{Gini} = 1 - \sum_k p_k^2\]

Entropy: Information-theoretic measure of disorder

\[\text{Entropy} = -\sum_k p_k \log_2(p_k)\]

Both are minimized when a node is “pure” (all samples in one class).

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# Fit a decision tree
tree_model = DecisionTreeClassifier(
    max_depth=3,  # Limit depth for interpretability
    min_samples_leaf=20,  # Require at least 20 samples per leaf
    random_state=42
)
tree_model.fit(X_train, y_train)

# Evaluate
y_prob_tree = tree_model.predict_proba(X_test)[:, 1]
print(f"Decision Tree AUC: {roc_auc_score(y_test, y_prob_tree):.3f}")

# Visualize the tree
plt.figure(figsize=(20, 10))
plot_tree(tree_model,
          feature_names=['Age', 'Prior Admissions', 'Charlson', 'Discharge SNF'],
          class_names=['No Readmit', 'Readmit'],
          filled=True,
          rounded=True)
plt.title('Decision Tree for 30-Day Readmission')
plt.tight_layout()
plt.show()

6.4.3 Controlling Complexity

Decision trees easily overfit—a fully grown tree can memorize the training data. Key hyperparameters to control complexity:

Parameter	Effect
`max_depth`	Maximum tree depth (most important)
`min_samples_split`	Minimum samples to split a node
`min_samples_leaf`	Minimum samples required in a leaf
`max_features`	Number of features to consider per split

# Compare trees of different depths
for depth in [2, 5, 10, None]:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)

    train_auc = roc_auc_score(y_train, tree.predict_proba(X_train)[:, 1])
    test_auc = roc_auc_score(y_test, tree.predict_proba(X_test)[:, 1])

    depth_str = str(depth) if depth else "None"
    print(f"Depth {depth_str:>4}: Train AUC = {train_auc:.3f}, Test AUC = {test_auc:.3f}")

6.4.4 Strengths and Weaknesses

Strengths: - Highly interpretable (can visualize the tree) - Handles mixed feature types (numeric and categorical) - No feature scaling required - Captures non-linear relationships and interactions

Weaknesses: - High variance: small data changes can produce very different trees - Prone to overfitting - Generally lower accuracy than ensembles - Biased toward features with many values

The instability of single trees motivates ensemble methods.

6.5 Ensemble Methods

Clinical Context: Why trust one model when you can combine many? Ensemble methods aggregate multiple models to produce more robust predictions. Random forests and gradient boosting are the workhorses of clinical prediction.

6.5.1 Why Ensembles Work

Ensemble methods improve predictions by combining multiple “weak learners.” The key insight is bias-variance decomposition:

\[\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}\]

High bias: Model is too simple, misses patterns
High variance: Model is too sensitive to training data

Different ensemble strategies address these differently: - Bagging (e.g., Random Forest): Reduces variance by averaging many high-variance models - Boosting (e.g., XGBoost): Reduces bias by sequentially correcting errors

6.5.2 Random Forests

Random forests combine two ideas:

Bagging: Train many trees on bootstrap samples (random samples with replacement)
Feature randomness: At each split, consider only a random subset of features

This decorrelates the trees, so their errors don’t compound when averaged.

from sklearn.ensemble import RandomForestClassifier

# Fit random forest
rf_model = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=10,          # Maximum depth per tree
    min_samples_leaf=5,    # Minimum samples per leaf
    max_features='sqrt',   # Features per split = sqrt(total features)
    random_state=42,
    n_jobs=-1              # Use all CPU cores
)
rf_model.fit(X_train, y_train)

# Evaluate
y_prob_rf = rf_model.predict_proba(X_test)[:, 1]
print(f"Random Forest AUC: {roc_auc_score(y_test, y_prob_rf):.3f}")

Out-of-bag evaluation: Each tree is trained on ~63% of samples (due to bootstrap). The remaining ~37% can evaluate that tree, providing a free validation estimate:

# Enable out-of-bag scoring
rf_oob = RandomForestClassifier(
    n_estimators=100,
    oob_score=True,  # Enable OOB evaluation
    random_state=42,
    n_jobs=-1
)
rf_oob.fit(X_train, y_train)
print(f"Out-of-bag accuracy: {rf_oob.oob_score_:.3f}")

Feature importance: Random forests provide built-in feature importance based on how much each feature decreases impurity:

# Feature importance
feature_names = ['Age', 'Prior Admissions', 'Charlson Score', 'Discharge to SNF']
importances = rf_model.feature_importances_

importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)

print("Feature Importances:")
print(importance_df.to_string(index=False))

6.5.3 Gradient Boosting

Gradient boosting builds trees sequentially, with each tree correcting the errors of the ensemble so far.

The algorithm: 1. Start with a simple prediction (e.g., the mean) 2. Compute residuals (errors) 3. Fit a tree to predict the residuals 4. Add this tree to the ensemble (with a learning rate) 5. Repeat

XGBoost and LightGBM are optimized implementations that dominate tabular data competitions.

from sklearn.ensemble import GradientBoostingClassifier

# Sklearn's gradient boosting
gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gb_model.fit(X_train, y_train)

y_prob_gb = gb_model.predict_proba(X_test)[:, 1]
print(f"Gradient Boosting AUC: {roc_auc_score(y_test, y_prob_gb):.3f}")

XGBoost is often preferred for its speed and additional features:

# XGBoost (install with: pip install xgboost)
from xgboost import XGBClassifier

xgb_model = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    subsample=0.8,         # Use 80% of samples per tree
    colsample_bytree=0.8,  # Use 80% of features per tree
    random_state=42,
    use_label_encoder=False,
    eval_metric='auc'
)
xgb_model.fit(X_train, y_train)

y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]
print(f"XGBoost AUC: {roc_auc_score(y_test, y_prob_xgb):.3f}")

6.5.4 Key Hyperparameters

For gradient boosting, the most important hyperparameters:

Parameter	Effect	Typical Range
`n_estimators`	Number of trees	100-1000
`learning_rate`	Step size	0.01-0.3
`max_depth`	Tree depth	3-10
`subsample`	Row sampling	0.5-1.0
`colsample_bytree`	Column sampling	0.5-1.0

Lower learning rate + more trees = better but slower. Start with defaults, then tune.

6.5.5 Random Forest vs. Gradient Boosting

Aspect	Random Forest	Gradient Boosting
Training	Parallelizable, fast	Sequential, slower
Overfitting	Resistant	More prone
Tuning	Few hyperparameters	Many hyperparameters
Performance	Very good	Often slightly better
Default choice	Great starting point	For squeezing out performance

6.6 Unsupervised Learning Essentials

Clinical Context: Not all learning involves predicting outcomes. Sometimes we want to find structure in data—patient subtypes, latent variables, or reduced-dimension representations for visualization.

6.6.1 Principal Component Analysis (PCA)

PCA finds the directions of maximum variance in high-dimensional data, enabling: - Dimensionality reduction: Compress many features into fewer - Visualization: Project high-dimensional data to 2D or 3D - Preprocessing: Decorrelate features before modeling

PCA finds orthogonal axes (principal components) ordered by variance explained:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Create higher-dimensional data for demonstration
np.random.seed(42)
n = 500
# 10 features with some correlation structure
X_highdim = np.random.randn(n, 10)
X_highdim[:, 1] = X_highdim[:, 0] + np.random.randn(n) * 0.5  # Correlated
X_highdim[:, 2] = X_highdim[:, 0] - X_highdim[:, 1] + np.random.randn(n) * 0.3

# Standardize (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_highdim)

# Fit PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Variance explained
print("Variance explained by each component:")
for i, var in enumerate(pca.explained_variance_ratio_[:5]):
    print(f"  PC{i+1}: {var:.3f} ({var*100:.1f}%)")
print(f"\nTotal variance in first 3 components: {sum(pca.explained_variance_ratio_[:3]):.3f}")

Using PCA for visualization:

import matplotlib.pyplot as plt

# Reduce to 2 dimensions for visualization
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

# Create synthetic labels for visualization
labels = (X_highdim[:, 0] > 0).astype(int)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='coolwarm', alpha=0.6)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)')
plt.title('PCA Visualization')
plt.colorbar(scatter)
plt.show()

6.6.2 K-Means Clustering

K-means partitions data into K clusters by minimizing within-cluster variance. Clinical applications include: - Patient subtyping (phenotyping) - Identifying treatment response groups - Anomaly detection

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Fit K-means with different numbers of clusters
for k in [2, 3, 4, 5]:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(X_scaled)

    # Silhouette score: measure of cluster quality (-1 to 1, higher is better)
    silhouette = silhouette_score(X_scaled, cluster_labels)
    inertia = kmeans.inertia_  # Within-cluster sum of squares

    print(f"K={k}: Silhouette={silhouette:.3f}, Inertia={inertia:.1f}")

Choosing K: Use the elbow method (plot inertia vs. K) or silhouette score. In clinical contexts, domain knowledge often guides the choice.

# Visualize clusters in PCA space
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=cluster_labels, cmap='viridis', alpha=0.6)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('K-Means Clusters (K=3)')
plt.colorbar(scatter, label='Cluster')
plt.show()

6.6.3 Limitations and Alternatives

K-means limitations: - Assumes spherical clusters of similar size - Sensitive to initialization (use n_init > 1) - Must specify K in advance - Sensitive to outliers

Alternatives: - Hierarchical clustering: No need to specify K, produces dendrogram - DBSCAN: Finds clusters of arbitrary shape, identifies outliers - Gaussian Mixture Models: Soft clustering with probability assignments

6.7 Model Selection and Hyperparameter Tuning

Clinical Context: You’ve tried logistic regression, random forest, and XGBoost. How do you choose? And once you’ve chosen, how do you find the best hyperparameters?

6.7.1 The Hyperparameter Tuning Problem

Model hyperparameters (learning rate, tree depth, regularization strength) control model complexity. Finding good values is essential but tricky:

Too simple (high regularization, shallow trees): Underfitting
Too complex (low regularization, deep trees): Overfitting

We can’t use test data for tuning—that would leak information. We need a validation strategy.

6.7.2 Grid Search vs. Random Search

Grid search tries all combinations of specified hyperparameter values:

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2]
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='auc'),
    param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV AUC: {grid_search.best_score_:.3f}")

Random search samples random combinations, often finding good solutions faster:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

# Define parameter distributions
param_distributions = {
    'n_estimators': randint(50, 300),
    'max_depth': randint(2, 10),
    'learning_rate': uniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4)
}

# Random search
random_search = RandomizedSearchCV(
    XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='auc'),
    param_distributions,
    n_iter=20,  # Try 20 random combinations
    cv=5,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1
)
random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV AUC: {random_search.best_score_:.3f}")

6.7.3 Sklearn Pipelines

Pipelines chain preprocessing and modeling steps, ensuring proper handling of train/test splits:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Create pipeline: impute missing → scale → model
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Now cross-validation applies each step correctly
# (imputer and scaler fit only on training folds)
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Pipeline CV AUC: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")

Tuning hyperparameters within a pipeline:

# Parameter names use step__parameter format
pipe_param_grid = {
    'classifier__n_estimators': [50, 100],
    'classifier__max_depth': [5, 10],
    'imputer__strategy': ['mean', 'median']
}

pipe_search = GridSearchCV(pipeline, pipe_param_grid, cv=5, scoring='roc_auc')
pipe_search.fit(X_train, y_train)

print(f"Best pipeline parameters: {pipe_search.best_params_}")

6.7.4 Avoiding Data Leakage

Common leakage mistakes and fixes:

Mistake	Fix
Scaling before splitting	Use pipeline or scale after split
Imputing with global statistics	Impute within pipeline
Feature selection on all data	Feature selection within pipeline
Tuning on test set	Use CV for tuning, hold out test

6.8 Clinical Applications

Clinical Context: Let’s put it all together with a realistic example—predicting 30-day hospital readmissions using EHR data.

6.8.1 Feature Engineering for Clinical Data

Clinical prediction often involves substantial feature engineering:

import pandas as pd
import numpy as np

# Simulate realistic EHR data
np.random.seed(42)
n_patients = 2000

ehr_data = pd.DataFrame({
    'patient_id': range(n_patients),
    'age': np.random.normal(68, 15, n_patients).clip(18, 100),
    'gender': np.random.choice(['M', 'F'], n_patients),
    'race': np.random.choice(['White', 'Black', 'Asian', 'Hispanic', 'Other'],
                             n_patients, p=[0.6, 0.2, 0.1, 0.08, 0.02]),
    'insurance': np.random.choice(['Medicare', 'Medicaid', 'Commercial', 'Self'],
                                  n_patients, p=[0.5, 0.2, 0.25, 0.05]),
    'length_of_stay': np.random.exponential(5, n_patients).clip(1, 30),
    'num_prior_admissions_12mo': np.random.poisson(0.8, n_patients),
    'num_ed_visits_12mo': np.random.poisson(1.2, n_patients),
    'charlson_comorbidity_index': np.random.poisson(2, n_patients).clip(0, 10),
    'num_medications': np.random.poisson(8, n_patients),
    'discharge_disposition': np.random.choice(['Home', 'SNF', 'Rehab', 'Home Health'],
                                               n_patients, p=[0.5, 0.25, 0.15, 0.1]),
    'hemoglobin_discharge': np.random.normal(11, 2, n_patients).clip(5, 18),
    'creatinine_discharge': np.random.exponential(1.2, n_patients).clip(0.5, 10),
    'sodium_discharge': np.random.normal(138, 4, n_patients),
})

# Add some missing values (realistic)
ehr_data.loc[np.random.choice(n_patients, 100, replace=False), 'hemoglobin_discharge'] = np.nan
ehr_data.loc[np.random.choice(n_patients, 50, replace=False), 'creatinine_discharge'] = np.nan

# Generate outcome (30-day readmission)
readmit_prob = 1 / (1 + np.exp(-(
    -3
    + 0.02 * ehr_data['age']
    + 0.4 * ehr_data['num_prior_admissions_12mo']
    + 0.2 * ehr_data['num_ed_visits_12mo']
    + 0.15 * ehr_data['charlson_comorbidity_index']
    + 0.05 * ehr_data['length_of_stay']
    + 0.5 * (ehr_data['discharge_disposition'] == 'SNF')
    + 0.3 * (ehr_data['creatinine_discharge'].fillna(1.2) > 1.5)
    - 0.2 * (ehr_data['hemoglobin_discharge'].fillna(11) > 10)
)))
ehr_data['readmitted_30d'] = np.random.binomial(1, readmit_prob)

print(f"Dataset shape: {ehr_data.shape}")
print(f"Readmission rate: {ehr_data['readmitted_30d'].mean():.1%}")
print(f"\nMissing values:\n{ehr_data.isnull().sum()[ehr_data.isnull().sum() > 0]}")

6.8.2 Preprocessing Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Define feature groups
numeric_features = ['age', 'length_of_stay', 'num_prior_admissions_12mo',
                   'num_ed_visits_12mo', 'charlson_comorbidity_index',
                   'num_medications', 'hemoglobin_discharge',
                   'creatinine_discharge', 'sodium_discharge']

categorical_features = ['gender', 'race', 'insurance', 'discharge_disposition']

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numeric_features),
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
            ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
        ]), categorical_features)
    ])

# Prepare data
X = ehr_data[numeric_features + categorical_features]
y = ehr_data['readmitted_30d']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} patients")
print(f"Test set: {X_test.shape[0]} patients")

6.8.3 Model Comparison

from sklearn.metrics import roc_auc_score, average_precision_score

# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42,
                             use_label_encoder=False, eval_metric='auc')
}

# Evaluate each model
results = []
for name, model in models.items():
    # Create full pipeline
    pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    # Cross-validation
    cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')

    # Fit on full training set and evaluate on test
    pipeline.fit(X_train, y_train)
    y_prob = pipeline.predict_proba(X_test)[:, 1]

    test_auc = roc_auc_score(y_test, y_prob)
    test_ap = average_precision_score(y_test, y_prob)

    results.append({
        'Model': name,
        'CV AUC': f"{cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})",
        'Test AUC': test_auc,
        'Test AP': test_ap
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

6.8.4 Feature Importance Analysis

# Get feature names after preprocessing
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)

# Extract feature names
cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot']
cat_feature_names = cat_encoder.get_feature_names_out(categorical_features)
all_feature_names = numeric_features + list(cat_feature_names)

# Get importances
importances = pipeline.named_steps['classifier'].feature_importances_

# Display top features
importance_df = pd.DataFrame({
    'Feature': all_feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)

print("Top 10 Most Important Features:")
print(importance_df.head(10).to_string(index=False))

6.8.5 Handling Class Imbalance

Readmission rates are often 10-15%—moderately imbalanced. Strategies:

from sklearn.utils.class_weight import compute_class_weight

# Option 1: Class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
weight_dict = dict(zip(np.unique(y_train), class_weights))
print(f"Class weights: {weight_dict}")

# Use in model
rf_weighted = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # Or pass weight_dict
    random_state=42
)

# Option 2: Threshold adjustment
# Instead of default 0.5 threshold, choose based on desired sensitivity
pipeline.fit(X_train, y_train)
y_prob = pipeline.predict_proba(X_test)[:, 1]

for threshold in [0.3, 0.4, 0.5]:
    y_pred = (y_prob >= threshold).astype(int)
    tp = np.sum((y_test == 1) & (y_pred == 1))
    fn = np.sum((y_test == 1) & (y_pred == 0))
    fp = np.sum((y_test == 0) & (y_pred == 1))

    sensitivity = tp / (tp + fn)
    ppv = tp / (tp + fp) if (tp + fp) > 0 else 0

    print(f"Threshold {threshold}: Sensitivity={sensitivity:.2f}, PPV={ppv:.2f}")

6.9 When to Use What: A Decision Framework

6.9.1 Algorithm Selection Guide

Is your data tabular (structured rows/columns)?
├── No → Consider deep learning (images, text, sequences)
└── Yes → Classical ML is a strong choice
    │
    ├── Need interpretability?
    │   ├── Yes → Logistic regression or decision tree
    │   └── No → Continue below
    │
    ├── Sample size?
    │   ├── < 1,000 → Logistic regression or Random Forest
    │   ├── 1,000 - 100,000 → Random Forest or Gradient Boosting
    │   └── > 100,000 → Gradient Boosting (XGBoost/LightGBM)
    │
    └── Need a quick baseline?
        └── Random Forest with defaults

6.9.2 Classical ML vs. Deep Learning

Factor	Classical ML	Deep Learning
Data type	Tabular	Images, text, sequences
Sample size	Works with 100s-1000s	Needs 10,000s+
Training time	Minutes	Hours to days
Interpretability	Good (especially logistic, trees)	Poor (black box)
Feature engineering	Required	Learned automatically
Compute requirements	CPU sufficient	Often needs GPU

6.9.3 Practical Recommendations

Always start with a baseline: Logistic regression or random forest with defaults
Try gradient boosting: Often the best for tabular data with minimal tuning
Use cross-validation: Never trust single train/test split
Consider interpretability needs: Sometimes a slightly worse but interpretable model is better
Profile your errors: Look at which patients are misclassified—is there a pattern?

6.10 Chapter Summary

Classical machine learning remains essential for medical AI:

Linear models: - Logistic regression provides interpretable baselines - Coefficients translate to odds ratios - Regularization handles high-dimensional data

Decision trees: - Mirror clinical reasoning - Highly interpretable but unstable alone - Foundation for ensemble methods

Ensemble methods: - Random forests: robust, parallel, good defaults - Gradient boosting: often best performance, more tuning required - XGBoost/LightGBM: state of the art for tabular data

Unsupervised learning: - PCA for visualization and preprocessing - K-means for patient subtyping

Practical guidance: - Pipelines prevent data leakage - Cross-validation for model selection - Always compare to simple baselines

For tabular clinical data—the majority of EHR-based prediction tasks—these methods often match or exceed deep learning while being faster, more interpretable, and more robust.

6.11 Exercises

Using the readmission dataset from this chapter, implement a complete pipeline that handles missing values, encodes categorical variables, and trains an XGBoost model. Report cross-validated AUC.
Compare the feature importances from logistic regression (coefficients), random forest (impurity decrease), and XGBoost (gain). Do they agree on the most important features?
Create a calibration curve for your best model. Is it well-calibrated? If not, implement Platt scaling or isotonic regression to improve calibration.
Implement a threshold selection procedure that achieves at least 80% sensitivity while maximizing PPV. What threshold does this require?
Use K-means clustering on the patient features (before adding the outcome). Do the discovered clusters have different readmission rates? What clinical interpretation might explain this?

# Classical Machine Learning {#sec-classical-ml} *Clinical Context: A hospital wants to predict 30-day readmissions using data from their electronic health records—demographics, vital signs, lab values, diagnoses. Should they train a deep neural network? Probably not. For structured, tabular clinical data, classical machine learning methods often outperform deep learning while being faster to train, easier to interpret, and more robust with limited data. This chapter covers the algorithms that remain workhorses of medical AI.* Deep learning dominates headlines, but classical machine learning remains essential in medical AI. On tabular data—the structured rows and columns that make up most clinical datasets—algorithms like random forests and gradient boosting frequently match or exceed deep learning performance. They train in seconds rather than hours, require less data, and produce models that clinicians can actually interpret. This chapter covers the foundational algorithms every medical AI practitioner must understand: linear models, decision trees, ensemble methods, and unsupervised techniques. We focus on the key concepts—what each algorithm optimizes and why—while providing practical sklearn implementations you can apply immediately. ## Why Classical ML Still Matters ### The Tabular Data Advantage Clinical data comes in many forms, but structured tables dominate: patient demographics, vital signs, laboratory values, diagnosis codes, medication lists. For this tabular data, classical ML has surprising staying power. A 2022 benchmarking study found that gradient boosting (XGBoost, LightGBM) matched or outperformed deep learning on most tabular datasets, while training orders of magnitude faster. The reasons: 1. **Feature engineering matters more than model complexity**: Clinical domain knowledge—knowing which lab values to combine, how to represent time since admission—often trumps architectural sophistication 2. **Sample sizes are often modest**: Deep learning shines with millions of examples; clinical datasets often have thousands 3. **Relationships are often simple**: Many clinical predictions involve roughly linear relationships that don't require deep networks to capture 4. **Missing data is endemic**: Classical methods handle missing values more gracefully than neural networks ### Interpretability and Clinical Trust Clinicians need to understand predictions before acting on them. A random forest can tell you which features drove a prediction. A gradient boosting model can show feature importance rankings. Logistic regression coefficients have direct clinical interpretations as odds ratios. This interpretability isn't just nice to have—it's often essential for: - Debugging unexpected predictions - Explaining decisions to patients - Satisfying regulatory requirements - Building clinician trust in AI tools ### When to Choose Classical ML As a general framework: | Scenario | Recommendation | |----------|---------------| | Tabular data, <100K samples | Classical ML (usually gradient boosting) | | Tabular data, >100K samples | Try both, classical often still wins | | Images, audio, video | Deep learning | | Free text | Deep learning (transformers) | | Mixed data | Classical for tabular, deep for unstructured | | Interpretability required | Classical ML | | Need a quick baseline | Classical ML | ## Linear Models ### Linear Regression: Quick Review *Clinical Context: Predicting a patient's length of stay or expected lab value—continuous outcomes—starts with linear regression.* Linear regression finds the best linear relationship between features $X$ and a continuous outcome $y$: $$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p = X\beta$$ The model minimizes the **mean squared error**: $$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$ ```{python} #| eval: false #| echo: true from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score import numpy as np # Simulated clinical data: predicting length of stay np.random.seed(42) n = 500 age = np.random.normal(65, 15, n) comorbidity_count = np.random.poisson(2, n) admission_severity = np.random.uniform(0, 10, n) # True relationship (with noise) los = 2 + 0.05 * age + 0.8 * comorbidity_count + 0.3 * admission_severity + np.random.normal(0, 1, n) X = np.column_stack([age, comorbidity_count, admission_severity]) y = los X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Fit linear regression model = LinearRegression() model.fit(X_train, y_train) # Evaluate y_pred = model.predict(X_test) print(f"R-squared: {r2_score(y_test, y_pred):.3f}") print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f} days") print(f"Coefficients: {model.coef_}") ``` ### Regularization: Ridge and Lasso When you have many features—potentially more features than samples—ordinary linear regression overfits. **Regularization** adds a penalty for large coefficients. **Ridge regression (L2)** penalizes the sum of squared coefficients: $$\text{Loss} = \text{MSE} + \alpha \sum_j \beta_j^2$$ **Lasso regression (L1)** penalizes the sum of absolute coefficients: $$\text{Loss} = \text{MSE} + \alpha \sum_j |\beta_j|$$ Key difference: Lasso can shrink coefficients exactly to zero, performing automatic feature selection. Ridge shrinks but doesn't zero out. ```{python} #| eval: false #| echo: true # Ridge regression ridge = Ridge(alpha=1.0) ridge.fit(X_train, y_train) print(f"Ridge R-squared: {r2_score(y_test, ridge.predict(X_test)):.3f}") # Lasso regression lasso = Lasso(alpha=0.1) lasso.fit(X_train, y_train) print(f"Lasso R-squared: {r2_score(y_test, lasso.predict(X_test)):.3f}") print(f"Lasso non-zero coefficients: {np.sum(lasso.coef_ != 0)}") ``` ### Logistic Regression: The Classification Foundation *Clinical Context: Will this patient be readmitted within 30 days? Does this patient have sepsis? Binary outcomes require classification, and logistic regression is the starting point.* Logistic regression extends linear regression to classification by passing the linear combination through the **sigmoid function**: $$P(y=1|X) = \sigma(X\beta) = \frac{1}{1 + e^{-X\beta}}$$ The sigmoid maps any real number to a probability between 0 and 1. **What logistic regression optimizes**: The model maximizes the likelihood of the observed data, which is equivalent to minimizing **cross-entropy loss** (log loss): $$\text{Loss} = -\frac{1}{n}\sum_i [y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]$$ This loss heavily penalizes confident wrong predictions—predicting 0.99 probability for a negative case incurs much more loss than predicting 0.6. ```{python} #| eval: false #| echo: true from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_auc_score, classification_report from sklearn.preprocessing import StandardScaler # Simulated readmission data np.random.seed(42) n = 1000 age = np.random.normal(65, 15, n) prior_admissions = np.random.poisson(1, n) charlson_score = np.random.poisson(2, n) discharge_disposition = np.random.binomial(1, 0.3, n) # 1 = SNF, 0 = home # Generate probability of readmission log_odds = -3 + 0.02 * age + 0.5 * prior_admissions + 0.3 * charlson_score + 0.8 * discharge_disposition prob = 1 / (1 + np.exp(-log_odds)) readmitted = np.random.binomial(1, prob) X = np.column_stack([age, prior_admissions, charlson_score, discharge_disposition]) y = readmitted X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Scale features (important for regularization) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Fit logistic regression with L2 regularization model = LogisticRegression(C=1.0, random_state=42) # C = 1/alpha model.fit(X_train_scaled, y_train) # Evaluate y_prob = model.predict_proba(X_test_scaled)[:, 1] y_pred = model.predict(X_test_scaled) print(f"AUC: {roc_auc_score(y_test, y_prob):.3f}") print("\nClassification Report:") print(classification_report(y_test, y_pred)) ``` ### Interpreting Coefficients Logistic regression coefficients have a direct clinical interpretation. The coefficient $\beta_j$ represents the change in **log-odds** for a one-unit increase in feature $x_j$. The **odds ratio** is $e^{\beta_j}$—more intuitive for clinicians: ```{python} #| eval: false #| echo: true import pandas as pd # Interpret coefficients feature_names = ['Age', 'Prior Admissions', 'Charlson Score', 'Discharge to SNF'] coefficients = model.coef_[0] odds_ratios = np.exp(coefficients) interpretation = pd.DataFrame({ 'Feature': feature_names, 'Coefficient': coefficients, 'Odds Ratio': odds_ratios }) print(interpretation.to_string(index=False)) # Example interpretation: # Odds Ratio of 1.5 for "Prior Admissions" means each additional prior admission # increases the odds of readmission by 50% ``` ### Regularization in High Dimensions With many features (hundreds of diagnosis codes, thousands of medication combinations), regularization becomes essential: ```{python} #| eval: false #| echo: true from sklearn.linear_model import LogisticRegressionCV # Automatically tune regularization strength with cross-validation model_cv = LogisticRegressionCV( Cs=10, # Try 10 values of C cv=5, scoring='roc_auc', random_state=42 ) model_cv.fit(X_train_scaled, y_train) print(f"Best C: {model_cv.C_[0]:.4f}") print(f"AUC with tuned C: {roc_auc_score(y_test, model_cv.predict_proba(X_test_scaled)[:, 1]):.3f}") ``` ## Support Vector Machines *Clinical Context: SVMs were state-of-the-art for many medical classification tasks before deep learning. While less common now, understanding the maximum margin concept provides useful intuition.* ### The Maximum Margin Idea Support Vector Machines find the decision boundary that maximizes the **margin**—the distance to the nearest training points (the "support vectors"). For linearly separable data, many boundaries correctly classify all points. SVM chooses the one with maximum margin, which tends to generalize better. The optimization problem: $$\text{Minimize } \frac{1}{2}||\mathbf{w}||^2 \text{ subject to } y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1$$ For non-separable data, we allow some misclassification with a penalty: $$\text{Minimize } \frac{1}{2}||\mathbf{w}||^2 + C\sum_i \xi_i$$ The parameter $C$ controls the tradeoff between margin size and classification errors. ### The Kernel Trick SVMs can learn non-linear boundaries through the **kernel trick**—implicitly mapping data to a higher-dimensional space where a linear boundary suffices. Common kernels: - **Linear**: No transformation (for linearly separable problems) - **RBF (Radial Basis Function)**: Flexible non-linear boundaries - **Polynomial**: Captures polynomial interactions ```{python} #| eval: false #| echo: true from sklearn.svm import SVC from sklearn.metrics import accuracy_score # SVM with RBF kernel svm_model = SVC(C=1.0, kernel='rbf', gamma='scale', probability=True, random_state=42) svm_model.fit(X_train_scaled, y_train) y_prob_svm = svm_model.predict_proba(X_test_scaled)[:, 1] print(f"SVM AUC: {roc_auc_score(y_test, y_prob_svm):.3f}") ``` ### When SVMs Work (and Don't) **Strengths:** - Effective in high-dimensional spaces - Memory efficient (only stores support vectors) - Works well when classes are well-separated **Weaknesses:** - Slow to train on large datasets (scales poorly beyond ~10K samples) - Requires careful feature scaling - Less interpretable than linear models - Often outperformed by gradient boosting on tabular data For most clinical tabular data today, gradient boosting is preferred over SVMs. ## Decision Trees *Clinical Context: Decision trees mirror clinical reasoning—a series of yes/no questions leading to a diagnosis or risk category. Their interpretability makes them valuable for explaining predictions to clinicians.* ### Recursive Partitioning A decision tree splits data recursively based on feature values, creating a tree of decisions: ``` Age > 65? / \ Yes No / \ Charlson > 2? Low Risk / \ Yes No / \ High Risk Medium Risk ``` At each node, the algorithm chooses the split that best separates the outcome classes. ### Splitting Criteria How do we measure "best separates"? Two common criteria: **Gini Impurity**: Probability of misclassifying a randomly chosen sample $$\text{Gini} = 1 - \sum_k p_k^2$$ **Entropy**: Information-theoretic measure of disorder $$\text{Entropy} = -\sum_k p_k \log_2(p_k)$$ Both are minimized when a node is "pure" (all samples in one class). ```{python} #| eval: false #| echo: true from sklearn.tree import DecisionTreeClassifier, plot_tree import matplotlib.pyplot as plt # Fit a decision tree tree_model = DecisionTreeClassifier( max_depth=3, # Limit depth for interpretability min_samples_leaf=20, # Require at least 20 samples per leaf random_state=42 ) tree_model.fit(X_train, y_train) # Evaluate y_prob_tree = tree_model.predict_proba(X_test)[:, 1] print(f"Decision Tree AUC: {roc_auc_score(y_test, y_prob_tree):.3f}") # Visualize the tree plt.figure(figsize=(20, 10)) plot_tree(tree_model, feature_names=['Age', 'Prior Admissions', 'Charlson', 'Discharge SNF'], class_names=['No Readmit', 'Readmit'], filled=True, rounded=True) plt.title('Decision Tree for 30-Day Readmission') plt.tight_layout() plt.show() ``` ### Controlling Complexity Decision trees easily overfit—a fully grown tree can memorize the training data. Key hyperparameters to control complexity: | Parameter | Effect | |-----------|--------| | `max_depth` | Maximum tree depth (most important) | | `min_samples_split` | Minimum samples to split a node | | `min_samples_leaf` | Minimum samples required in a leaf | | `max_features` | Number of features to consider per split | ```{python} #| eval: false #| echo: true # Compare trees of different depths for depth in [2, 5, 10, None]: tree = DecisionTreeClassifier(max_depth=depth, random_state=42) tree.fit(X_train, y_train) train_auc = roc_auc_score(y_train, tree.predict_proba(X_train)[:, 1]) test_auc = roc_auc_score(y_test, tree.predict_proba(X_test)[:, 1]) depth_str = str(depth) if depth else "None" print(f"Depth {depth_str:>4}: Train AUC = {train_auc:.3f}, Test AUC = {test_auc:.3f}") ``` ### Strengths and Weaknesses **Strengths:** - Highly interpretable (can visualize the tree) - Handles mixed feature types (numeric and categorical) - No feature scaling required - Captures non-linear relationships and interactions **Weaknesses:** - High variance: small data changes can produce very different trees - Prone to overfitting - Generally lower accuracy than ensembles - Biased toward features with many values The instability of single trees motivates ensemble methods. ## Ensemble Methods *Clinical Context: Why trust one model when you can combine many? Ensemble methods aggregate multiple models to produce more robust predictions. Random forests and gradient boosting are the workhorses of clinical prediction.* ### Why Ensembles Work Ensemble methods improve predictions by combining multiple "weak learners." The key insight is **bias-variance decomposition**: $$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$$ - **High bias**: Model is too simple, misses patterns - **High variance**: Model is too sensitive to training data Different ensemble strategies address these differently: - **Bagging** (e.g., Random Forest): Reduces variance by averaging many high-variance models - **Boosting** (e.g., XGBoost): Reduces bias by sequentially correcting errors ### Random Forests Random forests combine two ideas: 1. **Bagging**: Train many trees on bootstrap samples (random samples with replacement) 2. **Feature randomness**: At each split, consider only a random subset of features This decorrelates the trees, so their errors don't compound when averaged. ```{python} #| eval: false #| echo: true from sklearn.ensemble import RandomForestClassifier # Fit random forest rf_model = RandomForestClassifier( n_estimators=100, # Number of trees max_depth=10, # Maximum depth per tree min_samples_leaf=5, # Minimum samples per leaf max_features='sqrt', # Features per split = sqrt(total features) random_state=42, n_jobs=-1 # Use all CPU cores ) rf_model.fit(X_train, y_train) # Evaluate y_prob_rf = rf_model.predict_proba(X_test)[:, 1] print(f"Random Forest AUC: {roc_auc_score(y_test, y_prob_rf):.3f}") ``` **Out-of-bag evaluation**: Each tree is trained on ~63% of samples (due to bootstrap). The remaining ~37% can evaluate that tree, providing a free validation estimate: ```{python} #| eval: false #| echo: true # Enable out-of-bag scoring rf_oob = RandomForestClassifier( n_estimators=100, oob_score=True, # Enable OOB evaluation random_state=42, n_jobs=-1 ) rf_oob.fit(X_train, y_train) print(f"Out-of-bag accuracy: {rf_oob.oob_score_:.3f}") ``` **Feature importance**: Random forests provide built-in feature importance based on how much each feature decreases impurity: ```{python} #| eval: false #| echo: true # Feature importance feature_names = ['Age', 'Prior Admissions', 'Charlson Score', 'Discharge to SNF'] importances = rf_model.feature_importances_ importance_df = pd.DataFrame({ 'Feature': feature_names, 'Importance': importances }).sort_values('Importance', ascending=False) print("Feature Importances:") print(importance_df.to_string(index=False)) ``` ### Gradient Boosting Gradient boosting builds trees sequentially, with each tree correcting the errors of the ensemble so far. The algorithm: 1. Start with a simple prediction (e.g., the mean) 2. Compute residuals (errors) 3. Fit a tree to predict the residuals 4. Add this tree to the ensemble (with a learning rate) 5. Repeat **XGBoost** and **LightGBM** are optimized implementations that dominate tabular data competitions. ```{python} #| eval: false #| echo: true from sklearn.ensemble import GradientBoostingClassifier # Sklearn's gradient boosting gb_model = GradientBoostingClassifier( n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42 ) gb_model.fit(X_train, y_train) y_prob_gb = gb_model.predict_proba(X_test)[:, 1] print(f"Gradient Boosting AUC: {roc_auc_score(y_test, y_prob_gb):.3f}") ``` **XGBoost** is often preferred for its speed and additional features: ```{python} #| eval: false #| echo: true # XGBoost (install with: pip install xgboost) from xgboost import XGBClassifier xgb_model = XGBClassifier( n_estimators=100, learning_rate=0.1, max_depth=3, subsample=0.8, # Use 80% of samples per tree colsample_bytree=0.8, # Use 80% of features per tree random_state=42, use_label_encoder=False, eval_metric='auc' ) xgb_model.fit(X_train, y_train) y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1] print(f"XGBoost AUC: {roc_auc_score(y_test, y_prob_xgb):.3f}") ``` ### Key Hyperparameters For gradient boosting, the most important hyperparameters: | Parameter | Effect | Typical Range | |-----------|--------|---------------| | `n_estimators` | Number of trees | 100-1000 | | `learning_rate` | Step size | 0.01-0.3 | | `max_depth` | Tree depth | 3-10 | | `subsample` | Row sampling | 0.5-1.0 | | `colsample_bytree` | Column sampling | 0.5-1.0 | Lower learning rate + more trees = better but slower. Start with defaults, then tune. ### Random Forest vs. Gradient Boosting | Aspect | Random Forest | Gradient Boosting | |--------|--------------|-------------------| | Training | Parallelizable, fast | Sequential, slower | | Overfitting | Resistant | More prone | | Tuning | Few hyperparameters | Many hyperparameters | | Performance | Very good | Often slightly better | | Default choice | Great starting point | For squeezing out performance | ## Unsupervised Learning Essentials *Clinical Context: Not all learning involves predicting outcomes. Sometimes we want to find structure in data—patient subtypes, latent variables, or reduced-dimension representations for visualization.* ### Principal Component Analysis (PCA) **PCA** finds the directions of maximum variance in high-dimensional data, enabling: - **Dimensionality reduction**: Compress many features into fewer - **Visualization**: Project high-dimensional data to 2D or 3D - **Preprocessing**: Decorrelate features before modeling PCA finds orthogonal axes (principal components) ordered by variance explained: ```{python} #| eval: false #| echo: true from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler # Create higher-dimensional data for demonstration np.random.seed(42) n = 500 # 10 features with some correlation structure X_highdim = np.random.randn(n, 10) X_highdim[:, 1] = X_highdim[:, 0] + np.random.randn(n) * 0.5 # Correlated X_highdim[:, 2] = X_highdim[:, 0] - X_highdim[:, 1] + np.random.randn(n) * 0.3 # Standardize (important for PCA) scaler = StandardScaler() X_scaled = scaler.fit_transform(X_highdim) # Fit PCA pca = PCA() X_pca = pca.fit_transform(X_scaled) # Variance explained print("Variance explained by each component:") for i, var in enumerate(pca.explained_variance_ratio_[:5]): print(f" PC{i+1}: {var:.3f} ({var*100:.1f}%)") print(f"\nTotal variance in first 3 components: {sum(pca.explained_variance_ratio_[:3]):.3f}") ``` **Using PCA for visualization**: ```{python} #| eval: false #| echo: true import matplotlib.pyplot as plt # Reduce to 2 dimensions for visualization pca_2d = PCA(n_components=2) X_2d = pca_2d.fit_transform(X_scaled) # Create synthetic labels for visualization labels = (X_highdim[:, 0] > 0).astype(int) plt.figure(figsize=(8, 6)) scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='coolwarm', alpha=0.6) plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)') plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)') plt.title('PCA Visualization') plt.colorbar(scatter) plt.show() ``` ### K-Means Clustering **K-means** partitions data into K clusters by minimizing within-cluster variance. Clinical applications include: - Patient subtyping (phenotyping) - Identifying treatment response groups - Anomaly detection ```{python} #| eval: false #| echo: true from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score # Fit K-means with different numbers of clusters for k in [2, 3, 4, 5]: kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) cluster_labels = kmeans.fit_predict(X_scaled) # Silhouette score: measure of cluster quality (-1 to 1, higher is better) silhouette = silhouette_score(X_scaled, cluster_labels) inertia = kmeans.inertia_ # Within-cluster sum of squares print(f"K={k}: Silhouette={silhouette:.3f}, Inertia={inertia:.1f}") ``` **Choosing K**: Use the elbow method (plot inertia vs. K) or silhouette score. In clinical contexts, domain knowledge often guides the choice. ```{python} #| eval: false #| echo: true # Visualize clusters in PCA space kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) cluster_labels = kmeans.fit_predict(X_scaled) plt.figure(figsize=(8, 6)) scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=cluster_labels, cmap='viridis', alpha=0.6) plt.xlabel('PC1') plt.ylabel('PC2') plt.title('K-Means Clusters (K=3)') plt.colorbar(scatter, label='Cluster') plt.show() ``` ### Limitations and Alternatives **K-means limitations:** - Assumes spherical clusters of similar size - Sensitive to initialization (use `n_init > 1`) - Must specify K in advance - Sensitive to outliers **Alternatives:** - **Hierarchical clustering**: No need to specify K, produces dendrogram - **DBSCAN**: Finds clusters of arbitrary shape, identifies outliers - **Gaussian Mixture Models**: Soft clustering with probability assignments ## Model Selection and Hyperparameter Tuning *Clinical Context: You've tried logistic regression, random forest, and XGBoost. How do you choose? And once you've chosen, how do you find the best hyperparameters?* ### The Hyperparameter Tuning Problem Model hyperparameters (learning rate, tree depth, regularization strength) control model complexity. Finding good values is essential but tricky: - **Too simple** (high regularization, shallow trees): Underfitting - **Too complex** (low regularization, deep trees): Overfitting We can't use test data for tuning—that would leak information. We need a validation strategy. ### Grid Search vs. Random Search **Grid search** tries all combinations of specified hyperparameter values: ```{python} #| eval: false #| echo: true from sklearn.model_selection import GridSearchCV # Define parameter grid param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 7], 'learning_rate': [0.01, 0.1, 0.2] } # Grid search with cross-validation grid_search = GridSearchCV( XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='auc'), param_grid, cv=5, scoring='roc_auc', n_jobs=-1 ) grid_search.fit(X_train, y_train) print(f"Best parameters: {grid_search.best_params_}") print(f"Best CV AUC: {grid_search.best_score_:.3f}") ``` **Random search** samples random combinations, often finding good solutions faster: ```{python} #| eval: false #| echo: true from sklearn.model_selection import RandomizedSearchCV from scipy.stats import uniform, randint # Define parameter distributions param_distributions = { 'n_estimators': randint(50, 300), 'max_depth': randint(2, 10), 'learning_rate': uniform(0.01, 0.3), 'subsample': uniform(0.6, 0.4) } # Random search random_search = RandomizedSearchCV( XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='auc'), param_distributions, n_iter=20, # Try 20 random combinations cv=5, scoring='roc_auc', random_state=42, n_jobs=-1 ) random_search.fit(X_train, y_train) print(f"Best parameters: {random_search.best_params_}") print(f"Best CV AUC: {random_search.best_score_:.3f}") ``` ### Sklearn Pipelines **Pipelines** chain preprocessing and modeling steps, ensuring proper handling of train/test splits: ```{python} #| eval: false #| echo: true from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer # Create pipeline: impute missing → scale → model pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()), ('classifier', RandomForestClassifier(random_state=42)) ]) # Now cross-validation applies each step correctly # (imputer and scaler fit only on training folds) from sklearn.model_selection import cross_val_score scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc') print(f"Pipeline CV AUC: {scores.mean():.3f} (+/- {scores.std()*2:.3f})") ``` **Tuning hyperparameters within a pipeline**: ```{python} #| eval: false #| echo: true # Parameter names use step__parameter format pipe_param_grid = { 'classifier__n_estimators': [50, 100], 'classifier__max_depth': [5, 10], 'imputer__strategy': ['mean', 'median'] } pipe_search = GridSearchCV(pipeline, pipe_param_grid, cv=5, scoring='roc_auc') pipe_search.fit(X_train, y_train) print(f"Best pipeline parameters: {pipe_search.best_params_}") ``` ### Avoiding Data Leakage Common leakage mistakes and fixes: | Mistake | Fix | |---------|-----| | Scaling before splitting | Use pipeline or scale after split | | Imputing with global statistics | Impute within pipeline | | Feature selection on all data | Feature selection within pipeline | | Tuning on test set | Use CV for tuning, hold out test | ## Clinical Applications *Clinical Context: Let's put it all together with a realistic example—predicting 30-day hospital readmissions using EHR data.* ### Feature Engineering for Clinical Data Clinical prediction often involves substantial feature engineering: ```{python} #| eval: false #| echo: true import pandas as pd import numpy as np # Simulate realistic EHR data np.random.seed(42) n_patients = 2000 ehr_data = pd.DataFrame({ 'patient_id': range(n_patients), 'age': np.random.normal(68, 15, n_patients).clip(18, 100), 'gender': np.random.choice(['M', 'F'], n_patients), 'race': np.random.choice(['White', 'Black', 'Asian', 'Hispanic', 'Other'], n_patients, p=[0.6, 0.2, 0.1, 0.08, 0.02]), 'insurance': np.random.choice(['Medicare', 'Medicaid', 'Commercial', 'Self'], n_patients, p=[0.5, 0.2, 0.25, 0.05]), 'length_of_stay': np.random.exponential(5, n_patients).clip(1, 30), 'num_prior_admissions_12mo': np.random.poisson(0.8, n_patients), 'num_ed_visits_12mo': np.random.poisson(1.2, n_patients), 'charlson_comorbidity_index': np.random.poisson(2, n_patients).clip(0, 10), 'num_medications': np.random.poisson(8, n_patients), 'discharge_disposition': np.random.choice(['Home', 'SNF', 'Rehab', 'Home Health'], n_patients, p=[0.5, 0.25, 0.15, 0.1]), 'hemoglobin_discharge': np.random.normal(11, 2, n_patients).clip(5, 18), 'creatinine_discharge': np.random.exponential(1.2, n_patients).clip(0.5, 10), 'sodium_discharge': np.random.normal(138, 4, n_patients), }) # Add some missing values (realistic) ehr_data.loc[np.random.choice(n_patients, 100, replace=False), 'hemoglobin_discharge'] = np.nan ehr_data.loc[np.random.choice(n_patients, 50, replace=False), 'creatinine_discharge'] = np.nan # Generate outcome (30-day readmission) readmit_prob = 1 / (1 + np.exp(-( -3 + 0.02 * ehr_data['age'] + 0.4 * ehr_data['num_prior_admissions_12mo'] + 0.2 * ehr_data['num_ed_visits_12mo'] + 0.15 * ehr_data['charlson_comorbidity_index'] + 0.05 * ehr_data['length_of_stay'] + 0.5 * (ehr_data['discharge_disposition'] == 'SNF') + 0.3 * (ehr_data['creatinine_discharge'].fillna(1.2) > 1.5) - 0.2 * (ehr_data['hemoglobin_discharge'].fillna(11) > 10) ))) ehr_data['readmitted_30d'] = np.random.binomial(1, readmit_prob) print(f"Dataset shape: {ehr_data.shape}") print(f"Readmission rate: {ehr_data['readmitted_30d'].mean():.1%}") print(f"\nMissing values:\n{ehr_data.isnull().sum()[ehr_data.isnull().sum() > 0]}") ``` ### Preprocessing Pipeline ```{python} #| eval: false #| echo: true from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder # Define feature groups numeric_features = ['age', 'length_of_stay', 'num_prior_admissions_12mo', 'num_ed_visits_12mo', 'charlson_comorbidity_index', 'num_medications', 'hemoglobin_discharge', 'creatinine_discharge', 'sodium_discharge'] categorical_features = ['gender', 'race', 'insurance', 'discharge_disposition'] # Create preprocessing pipeline preprocessor = ColumnTransformer( transformers=[ ('num', Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]), numeric_features), ('cat', Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')), ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')) ]), categorical_features) ]) # Prepare data X = ehr_data[numeric_features + categorical_features] y = ehr_data['readmitted_30d'] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) print(f"Training set: {X_train.shape[0]} patients") print(f"Test set: {X_test.shape[0]} patients") ``` ### Model Comparison ```{python} #| eval: false #| echo: true from sklearn.metrics import roc_auc_score, average_precision_score # Define models to compare models = { 'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42), 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42), 'XGBoost': XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False, eval_metric='auc') } # Evaluate each model results = [] for name, model in models.items(): # Create full pipeline pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', model) ]) # Cross-validation cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc') # Fit on full training set and evaluate on test pipeline.fit(X_train, y_train) y_prob = pipeline.predict_proba(X_test)[:, 1] test_auc = roc_auc_score(y_test, y_prob) test_ap = average_precision_score(y_test, y_prob) results.append({ 'Model': name, 'CV AUC': f"{cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})", 'Test AUC': test_auc, 'Test AP': test_ap }) results_df = pd.DataFrame(results) print(results_df.to_string(index=False)) ``` ### Feature Importance Analysis ```{python} #| eval: false #| echo: true # Get feature names after preprocessing pipeline = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)) ]) pipeline.fit(X_train, y_train) # Extract feature names cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'] cat_feature_names = cat_encoder.get_feature_names_out(categorical_features) all_feature_names = numeric_features + list(cat_feature_names) # Get importances importances = pipeline.named_steps['classifier'].feature_importances_ # Display top features importance_df = pd.DataFrame({ 'Feature': all_feature_names, 'Importance': importances }).sort_values('Importance', ascending=False) print("Top 10 Most Important Features:") print(importance_df.head(10).to_string(index=False)) ``` ### Handling Class Imbalance Readmission rates are often 10-15%—moderately imbalanced. Strategies: ```{python} #| eval: false #| echo: true from sklearn.utils.class_weight import compute_class_weight # Option 1: Class weights class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train) weight_dict = dict(zip(np.unique(y_train), class_weights)) print(f"Class weights: {weight_dict}") # Use in model rf_weighted = RandomForestClassifier( n_estimators=100, class_weight='balanced', # Or pass weight_dict random_state=42 ) # Option 2: Threshold adjustment # Instead of default 0.5 threshold, choose based on desired sensitivity pipeline.fit(X_train, y_train) y_prob = pipeline.predict_proba(X_test)[:, 1] for threshold in [0.3, 0.4, 0.5]: y_pred = (y_prob >= threshold).astype(int) tp = np.sum((y_test == 1) & (y_pred == 1)) fn = np.sum((y_test == 1) & (y_pred == 0)) fp = np.sum((y_test == 0) & (y_pred == 1)) sensitivity = tp / (tp + fn) ppv = tp / (tp + fp) if (tp + fp) > 0 else 0 print(f"Threshold {threshold}: Sensitivity={sensitivity:.2f}, PPV={ppv:.2f}") ``` ## When to Use What: A Decision Framework ### Algorithm Selection Guide ``` Is your data tabular (structured rows/columns)? ├── No → Consider deep learning (images, text, sequences) └── Yes → Classical ML is a strong choice │ ├── Need interpretability? │ ├── Yes → Logistic regression or decision tree │ └── No → Continue below │ ├── Sample size? │ ├── < 1,000 → Logistic regression or Random Forest │ ├── 1,000 - 100,000 → Random Forest or Gradient Boosting │ └── > 100,000 → Gradient Boosting (XGBoost/LightGBM) │ └── Need a quick baseline? └── Random Forest with defaults ``` ### Classical ML vs. Deep Learning | Factor | Classical ML | Deep Learning | |--------|-------------|---------------| | Data type | Tabular | Images, text, sequences | | Sample size | Works with 100s-1000s | Needs 10,000s+ | | Training time | Minutes | Hours to days | | Interpretability | Good (especially logistic, trees) | Poor (black box) | | Feature engineering | Required | Learned automatically | | Compute requirements | CPU sufficient | Often needs GPU | ### Practical Recommendations 1. **Always start with a baseline**: Logistic regression or random forest with defaults 2. **Try gradient boosting**: Often the best for tabular data with minimal tuning 3. **Use cross-validation**: Never trust single train/test split 4. **Consider interpretability needs**: Sometimes a slightly worse but interpretable model is better 5. **Profile your errors**: Look at which patients are misclassified—is there a pattern? ## Chapter Summary Classical machine learning remains essential for medical AI: **Linear models:** - Logistic regression provides interpretable baselines - Coefficients translate to odds ratios - Regularization handles high-dimensional data **Decision trees:** - Mirror clinical reasoning - Highly interpretable but unstable alone - Foundation for ensemble methods **Ensemble methods:** - Random forests: robust, parallel, good defaults - Gradient boosting: often best performance, more tuning required - XGBoost/LightGBM: state of the art for tabular data **Unsupervised learning:** - PCA for visualization and preprocessing - K-means for patient subtyping **Practical guidance:** - Pipelines prevent data leakage - Cross-validation for model selection - Always compare to simple baselines For tabular clinical data—the majority of EHR-based prediction tasks—these methods often match or exceed deep learning while being faster, more interpretable, and more robust. ## Exercises 1. Using the readmission dataset from this chapter, implement a complete pipeline that handles missing values, encodes categorical variables, and trains an XGBoost model. Report cross-validated AUC. 2. Compare the feature importances from logistic regression (coefficients), random forest (impurity decrease), and XGBoost (gain). Do they agree on the most important features? 3. Create a calibration curve for your best model. Is it well-calibrated? If not, implement Platt scaling or isotonic regression to improve calibration. 4. Implement a threshold selection procedure that achieves at least 80% sensitivity while maximizing PPV. What threshold does this require? 5. Use K-means clustering on the patient features (before adding the outcome). Do the discovered clusters have different readmission rates? What clinical interpretation might explain this?