from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Simulated clinical data: predicting length of stay
np.random.seed(42)
n = 500
age = np.random.normal(65, 15, n)
comorbidity_count = np.random.poisson(2, n)
admission_severity = np.random.uniform(0, 10, n)
# True relationship (with noise)
los = 2 + 0.05 * age + 0.8 * comorbidity_count + 0.3 * admission_severity + np.random.normal(0, 1, n)
X = np.column_stack([age, comorbidity_count, admission_severity])
y = los
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit linear regression
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f} days")
print(f"Coefficients: {model.coef_}")6 Classical Machine Learning
Clinical Context: A hospital wants to predict 30-day readmissions using data from their electronic health records—demographics, vital signs, lab values, diagnoses. Should they train a deep neural network? Probably not. For structured, tabular clinical data, classical machine learning methods often outperform deep learning while being faster to train, easier to interpret, and more robust with limited data. This chapter covers the algorithms that remain workhorses of medical AI.
Deep learning dominates headlines, but classical machine learning remains essential in medical AI. On tabular data—the structured rows and columns that make up most clinical datasets—algorithms like random forests and gradient boosting frequently match or exceed deep learning performance. They train in seconds rather than hours, require less data, and produce models that clinicians can actually interpret.
This chapter covers the foundational algorithms every medical AI practitioner must understand: linear models, decision trees, ensemble methods, and unsupervised techniques. We focus on the key concepts—what each algorithm optimizes and why—while providing practical sklearn implementations you can apply immediately.
6.1 Why Classical ML Still Matters
6.1.1 The Tabular Data Advantage
Clinical data comes in many forms, but structured tables dominate: patient demographics, vital signs, laboratory values, diagnosis codes, medication lists. For this tabular data, classical ML has surprising staying power.
A 2022 benchmarking study found that gradient boosting (XGBoost, LightGBM) matched or outperformed deep learning on most tabular datasets, while training orders of magnitude faster. The reasons:
- Feature engineering matters more than model complexity: Clinical domain knowledge—knowing which lab values to combine, how to represent time since admission—often trumps architectural sophistication
- Sample sizes are often modest: Deep learning shines with millions of examples; clinical datasets often have thousands
- Relationships are often simple: Many clinical predictions involve roughly linear relationships that don’t require deep networks to capture
- Missing data is endemic: Classical methods handle missing values more gracefully than neural networks
6.1.2 Interpretability and Clinical Trust
Clinicians need to understand predictions before acting on them. A random forest can tell you which features drove a prediction. A gradient boosting model can show feature importance rankings. Logistic regression coefficients have direct clinical interpretations as odds ratios.
This interpretability isn’t just nice to have—it’s often essential for: - Debugging unexpected predictions - Explaining decisions to patients - Satisfying regulatory requirements - Building clinician trust in AI tools
6.1.3 When to Choose Classical ML
As a general framework:
| Scenario | Recommendation |
|---|---|
| Tabular data, <100K samples | Classical ML (usually gradient boosting) |
| Tabular data, >100K samples | Try both, classical often still wins |
| Images, audio, video | Deep learning |
| Free text | Deep learning (transformers) |
| Mixed data | Classical for tabular, deep for unstructured |
| Interpretability required | Classical ML |
| Need a quick baseline | Classical ML |
6.2 Linear Models
6.2.1 Linear Regression: Quick Review
Clinical Context: Predicting a patient’s length of stay or expected lab value—continuous outcomes—starts with linear regression.
Linear regression finds the best linear relationship between features \(X\) and a continuous outcome \(y\):
\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p = X\beta\]
The model minimizes the mean squared error:
\[\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\]
6.2.2 Regularization: Ridge and Lasso
When you have many features—potentially more features than samples—ordinary linear regression overfits. Regularization adds a penalty for large coefficients.
Ridge regression (L2) penalizes the sum of squared coefficients:
\[\text{Loss} = \text{MSE} + \alpha \sum_j \beta_j^2\]
Lasso regression (L1) penalizes the sum of absolute coefficients:
\[\text{Loss} = \text{MSE} + \alpha \sum_j |\beta_j|\]
Key difference: Lasso can shrink coefficients exactly to zero, performing automatic feature selection. Ridge shrinks but doesn’t zero out.
# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"Ridge R-squared: {r2_score(y_test, ridge.predict(X_test)):.3f}")
# Lasso regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
print(f"Lasso R-squared: {r2_score(y_test, lasso.predict(X_test)):.3f}")
print(f"Lasso non-zero coefficients: {np.sum(lasso.coef_ != 0)}")6.2.3 Logistic Regression: The Classification Foundation
Clinical Context: Will this patient be readmitted within 30 days? Does this patient have sepsis? Binary outcomes require classification, and logistic regression is the starting point.
Logistic regression extends linear regression to classification by passing the linear combination through the sigmoid function:
\[P(y=1|X) = \sigma(X\beta) = \frac{1}{1 + e^{-X\beta}}\]
The sigmoid maps any real number to a probability between 0 and 1.
What logistic regression optimizes: The model maximizes the likelihood of the observed data, which is equivalent to minimizing cross-entropy loss (log loss):
\[\text{Loss} = -\frac{1}{n}\sum_i [y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]\]
This loss heavily penalizes confident wrong predictions—predicting 0.99 probability for a negative case incurs much more loss than predicting 0.6.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.preprocessing import StandardScaler
# Simulated readmission data
np.random.seed(42)
n = 1000
age = np.random.normal(65, 15, n)
prior_admissions = np.random.poisson(1, n)
charlson_score = np.random.poisson(2, n)
discharge_disposition = np.random.binomial(1, 0.3, n) # 1 = SNF, 0 = home
# Generate probability of readmission
log_odds = -3 + 0.02 * age + 0.5 * prior_admissions + 0.3 * charlson_score + 0.8 * discharge_disposition
prob = 1 / (1 + np.exp(-log_odds))
readmitted = np.random.binomial(1, prob)
X = np.column_stack([age, prior_admissions, charlson_score, discharge_disposition])
y = readmitted
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale features (important for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Fit logistic regression with L2 regularization
model = LogisticRegression(C=1.0, random_state=42) # C = 1/alpha
model.fit(X_train_scaled, y_train)
# Evaluate
y_prob = model.predict_proba(X_test_scaled)[:, 1]
y_pred = model.predict(X_test_scaled)
print(f"AUC: {roc_auc_score(y_test, y_prob):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))6.2.4 Interpreting Coefficients
Logistic regression coefficients have a direct clinical interpretation. The coefficient \(\beta_j\) represents the change in log-odds for a one-unit increase in feature \(x_j\).
The odds ratio is \(e^{\beta_j}\)—more intuitive for clinicians:
import pandas as pd
# Interpret coefficients
feature_names = ['Age', 'Prior Admissions', 'Charlson Score', 'Discharge to SNF']
coefficients = model.coef_[0]
odds_ratios = np.exp(coefficients)
interpretation = pd.DataFrame({
'Feature': feature_names,
'Coefficient': coefficients,
'Odds Ratio': odds_ratios
})
print(interpretation.to_string(index=False))
# Example interpretation:
# Odds Ratio of 1.5 for "Prior Admissions" means each additional prior admission
# increases the odds of readmission by 50%6.2.5 Regularization in High Dimensions
With many features (hundreds of diagnosis codes, thousands of medication combinations), regularization becomes essential:
from sklearn.linear_model import LogisticRegressionCV
# Automatically tune regularization strength with cross-validation
model_cv = LogisticRegressionCV(
Cs=10, # Try 10 values of C
cv=5,
scoring='roc_auc',
random_state=42
)
model_cv.fit(X_train_scaled, y_train)
print(f"Best C: {model_cv.C_[0]:.4f}")
print(f"AUC with tuned C: {roc_auc_score(y_test, model_cv.predict_proba(X_test_scaled)[:, 1]):.3f}")6.3 Support Vector Machines
Clinical Context: SVMs were state-of-the-art for many medical classification tasks before deep learning. While less common now, understanding the maximum margin concept provides useful intuition.
6.3.1 The Maximum Margin Idea
Support Vector Machines find the decision boundary that maximizes the margin—the distance to the nearest training points (the “support vectors”).
For linearly separable data, many boundaries correctly classify all points. SVM chooses the one with maximum margin, which tends to generalize better.
The optimization problem:
\[\text{Minimize } \frac{1}{2}||\mathbf{w}||^2 \text{ subject to } y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1\]
For non-separable data, we allow some misclassification with a penalty:
\[\text{Minimize } \frac{1}{2}||\mathbf{w}||^2 + C\sum_i \xi_i\]
The parameter \(C\) controls the tradeoff between margin size and classification errors.
6.3.2 The Kernel Trick
SVMs can learn non-linear boundaries through the kernel trick—implicitly mapping data to a higher-dimensional space where a linear boundary suffices.
Common kernels: - Linear: No transformation (for linearly separable problems) - RBF (Radial Basis Function): Flexible non-linear boundaries - Polynomial: Captures polynomial interactions
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# SVM with RBF kernel
svm_model = SVC(C=1.0, kernel='rbf', gamma='scale', probability=True, random_state=42)
svm_model.fit(X_train_scaled, y_train)
y_prob_svm = svm_model.predict_proba(X_test_scaled)[:, 1]
print(f"SVM AUC: {roc_auc_score(y_test, y_prob_svm):.3f}")6.3.3 When SVMs Work (and Don’t)
Strengths: - Effective in high-dimensional spaces - Memory efficient (only stores support vectors) - Works well when classes are well-separated
Weaknesses: - Slow to train on large datasets (scales poorly beyond ~10K samples) - Requires careful feature scaling - Less interpretable than linear models - Often outperformed by gradient boosting on tabular data
For most clinical tabular data today, gradient boosting is preferred over SVMs.
6.4 Decision Trees
Clinical Context: Decision trees mirror clinical reasoning—a series of yes/no questions leading to a diagnosis or risk category. Their interpretability makes them valuable for explaining predictions to clinicians.
6.4.1 Recursive Partitioning
A decision tree splits data recursively based on feature values, creating a tree of decisions:
Age > 65?
/ \
Yes No
/ \
Charlson > 2? Low Risk
/ \
Yes No
/ \
High Risk Medium Risk
At each node, the algorithm chooses the split that best separates the outcome classes.
6.4.2 Splitting Criteria
How do we measure “best separates”? Two common criteria:
Gini Impurity: Probability of misclassifying a randomly chosen sample
\[\text{Gini} = 1 - \sum_k p_k^2\]
Entropy: Information-theoretic measure of disorder
\[\text{Entropy} = -\sum_k p_k \log_2(p_k)\]
Both are minimized when a node is “pure” (all samples in one class).
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
# Fit a decision tree
tree_model = DecisionTreeClassifier(
max_depth=3, # Limit depth for interpretability
min_samples_leaf=20, # Require at least 20 samples per leaf
random_state=42
)
tree_model.fit(X_train, y_train)
# Evaluate
y_prob_tree = tree_model.predict_proba(X_test)[:, 1]
print(f"Decision Tree AUC: {roc_auc_score(y_test, y_prob_tree):.3f}")
# Visualize the tree
plt.figure(figsize=(20, 10))
plot_tree(tree_model,
feature_names=['Age', 'Prior Admissions', 'Charlson', 'Discharge SNF'],
class_names=['No Readmit', 'Readmit'],
filled=True,
rounded=True)
plt.title('Decision Tree for 30-Day Readmission')
plt.tight_layout()
plt.show()6.4.3 Controlling Complexity
Decision trees easily overfit—a fully grown tree can memorize the training data. Key hyperparameters to control complexity:
| Parameter | Effect |
|---|---|
max_depth |
Maximum tree depth (most important) |
min_samples_split |
Minimum samples to split a node |
min_samples_leaf |
Minimum samples required in a leaf |
max_features |
Number of features to consider per split |
# Compare trees of different depths
for depth in [2, 5, 10, None]:
tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
tree.fit(X_train, y_train)
train_auc = roc_auc_score(y_train, tree.predict_proba(X_train)[:, 1])
test_auc = roc_auc_score(y_test, tree.predict_proba(X_test)[:, 1])
depth_str = str(depth) if depth else "None"
print(f"Depth {depth_str:>4}: Train AUC = {train_auc:.3f}, Test AUC = {test_auc:.3f}")6.4.4 Strengths and Weaknesses
Strengths: - Highly interpretable (can visualize the tree) - Handles mixed feature types (numeric and categorical) - No feature scaling required - Captures non-linear relationships and interactions
Weaknesses: - High variance: small data changes can produce very different trees - Prone to overfitting - Generally lower accuracy than ensembles - Biased toward features with many values
The instability of single trees motivates ensemble methods.
6.5 Ensemble Methods
Clinical Context: Why trust one model when you can combine many? Ensemble methods aggregate multiple models to produce more robust predictions. Random forests and gradient boosting are the workhorses of clinical prediction.
6.5.1 Why Ensembles Work
Ensemble methods improve predictions by combining multiple “weak learners.” The key insight is bias-variance decomposition:
\[\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}\]
- High bias: Model is too simple, misses patterns
- High variance: Model is too sensitive to training data
Different ensemble strategies address these differently: - Bagging (e.g., Random Forest): Reduces variance by averaging many high-variance models - Boosting (e.g., XGBoost): Reduces bias by sequentially correcting errors
6.5.2 Random Forests
Random forests combine two ideas:
- Bagging: Train many trees on bootstrap samples (random samples with replacement)
- Feature randomness: At each split, consider only a random subset of features
This decorrelates the trees, so their errors don’t compound when averaged.
from sklearn.ensemble import RandomForestClassifier
# Fit random forest
rf_model = RandomForestClassifier(
n_estimators=100, # Number of trees
max_depth=10, # Maximum depth per tree
min_samples_leaf=5, # Minimum samples per leaf
max_features='sqrt', # Features per split = sqrt(total features)
random_state=42,
n_jobs=-1 # Use all CPU cores
)
rf_model.fit(X_train, y_train)
# Evaluate
y_prob_rf = rf_model.predict_proba(X_test)[:, 1]
print(f"Random Forest AUC: {roc_auc_score(y_test, y_prob_rf):.3f}")Out-of-bag evaluation: Each tree is trained on ~63% of samples (due to bootstrap). The remaining ~37% can evaluate that tree, providing a free validation estimate:
# Enable out-of-bag scoring
rf_oob = RandomForestClassifier(
n_estimators=100,
oob_score=True, # Enable OOB evaluation
random_state=42,
n_jobs=-1
)
rf_oob.fit(X_train, y_train)
print(f"Out-of-bag accuracy: {rf_oob.oob_score_:.3f}")Feature importance: Random forests provide built-in feature importance based on how much each feature decreases impurity:
# Feature importance
feature_names = ['Age', 'Prior Admissions', 'Charlson Score', 'Discharge to SNF']
importances = rf_model.feature_importances_
importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': importances
}).sort_values('Importance', ascending=False)
print("Feature Importances:")
print(importance_df.to_string(index=False))6.5.3 Gradient Boosting
Gradient boosting builds trees sequentially, with each tree correcting the errors of the ensemble so far.
The algorithm: 1. Start with a simple prediction (e.g., the mean) 2. Compute residuals (errors) 3. Fit a tree to predict the residuals 4. Add this tree to the ensemble (with a learning rate) 5. Repeat
XGBoost and LightGBM are optimized implementations that dominate tabular data competitions.
from sklearn.ensemble import GradientBoostingClassifier
# Sklearn's gradient boosting
gb_model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
gb_model.fit(X_train, y_train)
y_prob_gb = gb_model.predict_proba(X_test)[:, 1]
print(f"Gradient Boosting AUC: {roc_auc_score(y_test, y_prob_gb):.3f}")XGBoost is often preferred for its speed and additional features:
# XGBoost (install with: pip install xgboost)
from xgboost import XGBClassifier
xgb_model = XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
subsample=0.8, # Use 80% of samples per tree
colsample_bytree=0.8, # Use 80% of features per tree
random_state=42,
use_label_encoder=False,
eval_metric='auc'
)
xgb_model.fit(X_train, y_train)
y_prob_xgb = xgb_model.predict_proba(X_test)[:, 1]
print(f"XGBoost AUC: {roc_auc_score(y_test, y_prob_xgb):.3f}")6.5.4 Key Hyperparameters
For gradient boosting, the most important hyperparameters:
| Parameter | Effect | Typical Range |
|---|---|---|
n_estimators |
Number of trees | 100-1000 |
learning_rate |
Step size | 0.01-0.3 |
max_depth |
Tree depth | 3-10 |
subsample |
Row sampling | 0.5-1.0 |
colsample_bytree |
Column sampling | 0.5-1.0 |
Lower learning rate + more trees = better but slower. Start with defaults, then tune.
6.5.5 Random Forest vs. Gradient Boosting
| Aspect | Random Forest | Gradient Boosting |
|---|---|---|
| Training | Parallelizable, fast | Sequential, slower |
| Overfitting | Resistant | More prone |
| Tuning | Few hyperparameters | Many hyperparameters |
| Performance | Very good | Often slightly better |
| Default choice | Great starting point | For squeezing out performance |
6.6 Unsupervised Learning Essentials
Clinical Context: Not all learning involves predicting outcomes. Sometimes we want to find structure in data—patient subtypes, latent variables, or reduced-dimension representations for visualization.
6.6.1 Principal Component Analysis (PCA)
PCA finds the directions of maximum variance in high-dimensional data, enabling: - Dimensionality reduction: Compress many features into fewer - Visualization: Project high-dimensional data to 2D or 3D - Preprocessing: Decorrelate features before modeling
PCA finds orthogonal axes (principal components) ordered by variance explained:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Create higher-dimensional data for demonstration
np.random.seed(42)
n = 500
# 10 features with some correlation structure
X_highdim = np.random.randn(n, 10)
X_highdim[:, 1] = X_highdim[:, 0] + np.random.randn(n) * 0.5 # Correlated
X_highdim[:, 2] = X_highdim[:, 0] - X_highdim[:, 1] + np.random.randn(n) * 0.3
# Standardize (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_highdim)
# Fit PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Variance explained
print("Variance explained by each component:")
for i, var in enumerate(pca.explained_variance_ratio_[:5]):
print(f" PC{i+1}: {var:.3f} ({var*100:.1f}%)")
print(f"\nTotal variance in first 3 components: {sum(pca.explained_variance_ratio_[:3]):.3f}")Using PCA for visualization:
import matplotlib.pyplot as plt
# Reduce to 2 dimensions for visualization
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
# Create synthetic labels for visualization
labels = (X_highdim[:, 0] > 0).astype(int)
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='coolwarm', alpha=0.6)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]*100:.1f}% variance)')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]*100:.1f}% variance)')
plt.title('PCA Visualization')
plt.colorbar(scatter)
plt.show()6.6.2 K-Means Clustering
K-means partitions data into K clusters by minimizing within-cluster variance. Clinical applications include: - Patient subtyping (phenotyping) - Identifying treatment response groups - Anomaly detection
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Fit K-means with different numbers of clusters
for k in [2, 3, 4, 5]:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)
# Silhouette score: measure of cluster quality (-1 to 1, higher is better)
silhouette = silhouette_score(X_scaled, cluster_labels)
inertia = kmeans.inertia_ # Within-cluster sum of squares
print(f"K={k}: Silhouette={silhouette:.3f}, Inertia={inertia:.1f}")Choosing K: Use the elbow method (plot inertia vs. K) or silhouette score. In clinical contexts, domain knowledge often guides the choice.
# Visualize clusters in PCA space
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X_scaled)
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=cluster_labels, cmap='viridis', alpha=0.6)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('K-Means Clusters (K=3)')
plt.colorbar(scatter, label='Cluster')
plt.show()6.6.3 Limitations and Alternatives
K-means limitations: - Assumes spherical clusters of similar size - Sensitive to initialization (use n_init > 1) - Must specify K in advance - Sensitive to outliers
Alternatives: - Hierarchical clustering: No need to specify K, produces dendrogram - DBSCAN: Finds clusters of arbitrary shape, identifies outliers - Gaussian Mixture Models: Soft clustering with probability assignments
6.7 Model Selection and Hyperparameter Tuning
Clinical Context: You’ve tried logistic regression, random forest, and XGBoost. How do you choose? And once you’ve chosen, how do you find the best hyperparameters?
6.7.1 The Hyperparameter Tuning Problem
Model hyperparameters (learning rate, tree depth, regularization strength) control model complexity. Finding good values is essential but tricky:
- Too simple (high regularization, shallow trees): Underfitting
- Too complex (low regularization, deep trees): Overfitting
We can’t use test data for tuning—that would leak information. We need a validation strategy.
6.7.2 Grid Search vs. Random Search
Grid search tries all combinations of specified hyperparameter values:
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2]
}
# Grid search with cross-validation
grid_search = GridSearchCV(
XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='auc'),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV AUC: {grid_search.best_score_:.3f}")Random search samples random combinations, often finding good solutions faster:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
# Define parameter distributions
param_distributions = {
'n_estimators': randint(50, 300),
'max_depth': randint(2, 10),
'learning_rate': uniform(0.01, 0.3),
'subsample': uniform(0.6, 0.4)
}
# Random search
random_search = RandomizedSearchCV(
XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='auc'),
param_distributions,
n_iter=20, # Try 20 random combinations
cv=5,
scoring='roc_auc',
random_state=42,
n_jobs=-1
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV AUC: {random_search.best_score_:.3f}")6.7.3 Sklearn Pipelines
Pipelines chain preprocessing and modeling steps, ensuring proper handling of train/test splits:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# Create pipeline: impute missing → scale → model
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
# Now cross-validation applies each step correctly
# (imputer and scaler fit only on training folds)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
print(f"Pipeline CV AUC: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")Tuning hyperparameters within a pipeline:
# Parameter names use step__parameter format
pipe_param_grid = {
'classifier__n_estimators': [50, 100],
'classifier__max_depth': [5, 10],
'imputer__strategy': ['mean', 'median']
}
pipe_search = GridSearchCV(pipeline, pipe_param_grid, cv=5, scoring='roc_auc')
pipe_search.fit(X_train, y_train)
print(f"Best pipeline parameters: {pipe_search.best_params_}")6.7.4 Avoiding Data Leakage
Common leakage mistakes and fixes:
| Mistake | Fix |
|---|---|
| Scaling before splitting | Use pipeline or scale after split |
| Imputing with global statistics | Impute within pipeline |
| Feature selection on all data | Feature selection within pipeline |
| Tuning on test set | Use CV for tuning, hold out test |
6.8 Clinical Applications
Clinical Context: Let’s put it all together with a realistic example—predicting 30-day hospital readmissions using EHR data.
6.8.1 Feature Engineering for Clinical Data
Clinical prediction often involves substantial feature engineering:
import pandas as pd
import numpy as np
# Simulate realistic EHR data
np.random.seed(42)
n_patients = 2000
ehr_data = pd.DataFrame({
'patient_id': range(n_patients),
'age': np.random.normal(68, 15, n_patients).clip(18, 100),
'gender': np.random.choice(['M', 'F'], n_patients),
'race': np.random.choice(['White', 'Black', 'Asian', 'Hispanic', 'Other'],
n_patients, p=[0.6, 0.2, 0.1, 0.08, 0.02]),
'insurance': np.random.choice(['Medicare', 'Medicaid', 'Commercial', 'Self'],
n_patients, p=[0.5, 0.2, 0.25, 0.05]),
'length_of_stay': np.random.exponential(5, n_patients).clip(1, 30),
'num_prior_admissions_12mo': np.random.poisson(0.8, n_patients),
'num_ed_visits_12mo': np.random.poisson(1.2, n_patients),
'charlson_comorbidity_index': np.random.poisson(2, n_patients).clip(0, 10),
'num_medications': np.random.poisson(8, n_patients),
'discharge_disposition': np.random.choice(['Home', 'SNF', 'Rehab', 'Home Health'],
n_patients, p=[0.5, 0.25, 0.15, 0.1]),
'hemoglobin_discharge': np.random.normal(11, 2, n_patients).clip(5, 18),
'creatinine_discharge': np.random.exponential(1.2, n_patients).clip(0.5, 10),
'sodium_discharge': np.random.normal(138, 4, n_patients),
})
# Add some missing values (realistic)
ehr_data.loc[np.random.choice(n_patients, 100, replace=False), 'hemoglobin_discharge'] = np.nan
ehr_data.loc[np.random.choice(n_patients, 50, replace=False), 'creatinine_discharge'] = np.nan
# Generate outcome (30-day readmission)
readmit_prob = 1 / (1 + np.exp(-(
-3
+ 0.02 * ehr_data['age']
+ 0.4 * ehr_data['num_prior_admissions_12mo']
+ 0.2 * ehr_data['num_ed_visits_12mo']
+ 0.15 * ehr_data['charlson_comorbidity_index']
+ 0.05 * ehr_data['length_of_stay']
+ 0.5 * (ehr_data['discharge_disposition'] == 'SNF')
+ 0.3 * (ehr_data['creatinine_discharge'].fillna(1.2) > 1.5)
- 0.2 * (ehr_data['hemoglobin_discharge'].fillna(11) > 10)
)))
ehr_data['readmitted_30d'] = np.random.binomial(1, readmit_prob)
print(f"Dataset shape: {ehr_data.shape}")
print(f"Readmission rate: {ehr_data['readmitted_30d'].mean():.1%}")
print(f"\nMissing values:\n{ehr_data.isnull().sum()[ehr_data.isnull().sum() > 0]}")6.8.2 Preprocessing Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Define feature groups
numeric_features = ['age', 'length_of_stay', 'num_prior_admissions_12mo',
'num_ed_visits_12mo', 'charlson_comorbidity_index',
'num_medications', 'hemoglobin_discharge',
'creatinine_discharge', 'sodium_discharge']
categorical_features = ['gender', 'race', 'insurance', 'discharge_disposition']
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
transformers=[
('num', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), numeric_features),
('cat', Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
]), categorical_features)
])
# Prepare data
X = ehr_data[numeric_features + categorical_features]
y = ehr_data['readmitted_30d']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape[0]} patients")
print(f"Test set: {X_test.shape[0]} patients")6.8.3 Model Comparison
from sklearn.metrics import roc_auc_score, average_precision_score
# Define models to compare
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'XGBoost': XGBClassifier(n_estimators=100, random_state=42,
use_label_encoder=False, eval_metric='auc')
}
# Evaluate each model
results = []
for name, model in models.items():
# Create full pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', model)
])
# Cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
# Fit on full training set and evaluate on test
pipeline.fit(X_train, y_train)
y_prob = pipeline.predict_proba(X_test)[:, 1]
test_auc = roc_auc_score(y_test, y_prob)
test_ap = average_precision_score(y_test, y_prob)
results.append({
'Model': name,
'CV AUC': f"{cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})",
'Test AUC': test_auc,
'Test AP': test_ap
})
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))6.8.4 Feature Importance Analysis
# Get feature names after preprocessing
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
pipeline.fit(X_train, y_train)
# Extract feature names
cat_encoder = pipeline.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot']
cat_feature_names = cat_encoder.get_feature_names_out(categorical_features)
all_feature_names = numeric_features + list(cat_feature_names)
# Get importances
importances = pipeline.named_steps['classifier'].feature_importances_
# Display top features
importance_df = pd.DataFrame({
'Feature': all_feature_names,
'Importance': importances
}).sort_values('Importance', ascending=False)
print("Top 10 Most Important Features:")
print(importance_df.head(10).to_string(index=False))6.8.5 Handling Class Imbalance
Readmission rates are often 10-15%—moderately imbalanced. Strategies:
from sklearn.utils.class_weight import compute_class_weight
# Option 1: Class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
weight_dict = dict(zip(np.unique(y_train), class_weights))
print(f"Class weights: {weight_dict}")
# Use in model
rf_weighted = RandomForestClassifier(
n_estimators=100,
class_weight='balanced', # Or pass weight_dict
random_state=42
)
# Option 2: Threshold adjustment
# Instead of default 0.5 threshold, choose based on desired sensitivity
pipeline.fit(X_train, y_train)
y_prob = pipeline.predict_proba(X_test)[:, 1]
for threshold in [0.3, 0.4, 0.5]:
y_pred = (y_prob >= threshold).astype(int)
tp = np.sum((y_test == 1) & (y_pred == 1))
fn = np.sum((y_test == 1) & (y_pred == 0))
fp = np.sum((y_test == 0) & (y_pred == 1))
sensitivity = tp / (tp + fn)
ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
print(f"Threshold {threshold}: Sensitivity={sensitivity:.2f}, PPV={ppv:.2f}")6.9 When to Use What: A Decision Framework
6.9.1 Algorithm Selection Guide
Is your data tabular (structured rows/columns)?
├── No → Consider deep learning (images, text, sequences)
└── Yes → Classical ML is a strong choice
│
├── Need interpretability?
│ ├── Yes → Logistic regression or decision tree
│ └── No → Continue below
│
├── Sample size?
│ ├── < 1,000 → Logistic regression or Random Forest
│ ├── 1,000 - 100,000 → Random Forest or Gradient Boosting
│ └── > 100,000 → Gradient Boosting (XGBoost/LightGBM)
│
└── Need a quick baseline?
└── Random Forest with defaults
6.9.2 Classical ML vs. Deep Learning
| Factor | Classical ML | Deep Learning |
|---|---|---|
| Data type | Tabular | Images, text, sequences |
| Sample size | Works with 100s-1000s | Needs 10,000s+ |
| Training time | Minutes | Hours to days |
| Interpretability | Good (especially logistic, trees) | Poor (black box) |
| Feature engineering | Required | Learned automatically |
| Compute requirements | CPU sufficient | Often needs GPU |
6.9.3 Practical Recommendations
- Always start with a baseline: Logistic regression or random forest with defaults
- Try gradient boosting: Often the best for tabular data with minimal tuning
- Use cross-validation: Never trust single train/test split
- Consider interpretability needs: Sometimes a slightly worse but interpretable model is better
- Profile your errors: Look at which patients are misclassified—is there a pattern?
6.10 Chapter Summary
Classical machine learning remains essential for medical AI:
Linear models: - Logistic regression provides interpretable baselines - Coefficients translate to odds ratios - Regularization handles high-dimensional data
Decision trees: - Mirror clinical reasoning - Highly interpretable but unstable alone - Foundation for ensemble methods
Ensemble methods: - Random forests: robust, parallel, good defaults - Gradient boosting: often best performance, more tuning required - XGBoost/LightGBM: state of the art for tabular data
Unsupervised learning: - PCA for visualization and preprocessing - K-means for patient subtyping
Practical guidance: - Pipelines prevent data leakage - Cross-validation for model selection - Always compare to simple baselines
For tabular clinical data—the majority of EHR-based prediction tasks—these methods often match or exceed deep learning while being faster, more interpretable, and more robust.
6.11 Exercises
Using the readmission dataset from this chapter, implement a complete pipeline that handles missing values, encodes categorical variables, and trains an XGBoost model. Report cross-validated AUC.
Compare the feature importances from logistic regression (coefficients), random forest (impurity decrease), and XGBoost (gain). Do they agree on the most important features?
Create a calibration curve for your best model. Is it well-calibrated? If not, implement Platt scaling or isotonic regression to improve calibration.
Implement a threshold selection procedure that achieves at least 80% sensitivity while maximizing PPV. What threshold does this require?
Use K-means clustering on the patient features (before adding the outcome). Do the discovered clusters have different readmission rates? What clinical interpretation might explain this?