2 Framing AI & Machine Learning Problems

Before diving into algorithms and code, we need a shared mental model. This chapter establishes the vocabulary and conceptual framework that underlies everything that follows. Whether you’re building a classic logistic regression or fine-tuning a large language model, the same fundamental structure applies: data goes in, a learned function transforms it, and predictions come out.

2.1 The Terminology Landscape

Clinical Context: A hospital administrator asks whether the new “AI system” for radiology uses “machine learning” or “deep learning” or is “like ChatGPT.” These terms get used interchangeably in the press, but they have specific meanings. Understanding the hierarchy helps you evaluate claims and choose appropriate tools.

2.1.1 AI, ML, Deep Learning, and Beyond

These terms nest inside each other like Russian dolls:

Artificial Intelligence (AI) is the broadest term—the goal of creating systems that exhibit intelligent behavior. This includes everything from 1980s expert systems (hand-coded rules) to modern neural networks. AI is the aspiration; the methods vary.

Machine Learning (ML) is a subset of AI where systems learn from data rather than being explicitly programmed. Instead of a human writing rules like “if temperature > 38°C and WBC > 12,000, consider infection,” the algorithm discovers patterns from labeled examples. ML is defined by learning from data.

Deep Learning is a subset of ML using neural networks with many layers. These “deep” networks can learn hierarchical representations—edges combine into textures, textures into shapes, shapes into objects. Deep learning dominates modern computer vision and natural language processing.

Generative AI refers to models that generate new content—text, images, audio—rather than just classifying or predicting. Large language models (LLMs) like GPT-4 and Claude are generative AI: given a prompt, they generate a response token by token.

Large Language Models (LLMs) are a specific type of generative AI trained on massive text corpora. They’re deep learning models (specifically, transformers) trained to predict the next word. This simple objective, at sufficient scale, produces systems that can write, summarize, translate, and reason.

┌─────────────────────────────────────────────────────┐
│                 Artificial Intelligence              │
│  ┌───────────────────────────────────────────────┐  │
│  │              Machine Learning                  │  │
│  │  ┌─────────────────────────────────────────┐  │  │
│  │  │            Deep Learning                 │  │  │
│  │  │  ┌───────────────────────────────────┐  │  │  │
│  │  │  │   Generative AI / LLMs            │  │  │  │
│  │  │  └───────────────────────────────────┘  │  │  │
│  │  └─────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘

2.1.2 Where Data Science Fits

Data Science is the practice of extracting insights from data. It encompasses statistics, ML, visualization, and domain expertise. A data scientist might use ML models, but also simple descriptive statistics, SQL queries, and well-designed charts. Data science is the job; ML is one tool in the toolbox.

In clinical settings, most impactful work combines all of these: exploratory analysis to understand your data, classical statistics to quantify uncertainty, ML models for prediction, and domain expertise to ensure clinical relevance.

2.1.3 The Historical Arc

Understanding where we are historically helps contextualize the tools:

1950s–1980s: Symbolic AI — Hand-coded rules and expert systems. “If symptom A and symptom B, then diagnosis C.” Required humans to encode all knowledge explicitly.

1990s–2010s: Classical ML — Algorithms that learn patterns from data. Support vector machines, random forests, gradient boosting. Required humans to engineer features (what to measure), but learned the decision boundaries automatically.

2012–2020: Deep Learning Revolution — Neural networks that learn features directly from raw data. Enabled by GPUs and large datasets. Transformed computer vision (ImageNet moment in 2012) and then NLP.

2020–Present: Foundation Models — Massive models pretrained on internet-scale data, then adapted to specific tasks. GPT, BERT, and domain-specific variants. The “pretrain then fine-tune” paradigm democratizes access to powerful models.

We’re in the foundation model era, but all the earlier approaches remain valuable. Classical ML often outperforms deep learning on small tabular datasets. The right tool depends on your data and problem.

2.2 The Anatomy of Every ML Problem

Clinical Context: Whether you’re predicting sepsis from vital signs or generating radiology reports from images, every ML system has the same fundamental structure. Understanding this structure helps you frame new problems and evaluate existing solutions.

2.2.1 Inputs, Models, Outputs

Every machine learning system can be understood as:

        ┌─────────────────────┐
Input → │   Learned Function  │ → Output
  X     │      f(X) = Y       │     Y
        └─────────────────────┘

Inputs (X): The data you feed to the model. This could be: - A chest X-ray image (matrix of pixel values) - A clinical note (sequence of words) - Lab values and vitals (table of numbers) - A combination of all three (multimodal)

The Model (f): A learned function that maps inputs to outputs. During training, the model adjusts its internal parameters to minimize prediction errors on training data. The model is essentially a “fancy nonlinear function approximator”—it learns complex, nonlinear relationships between inputs and outputs that would be impossible to specify by hand.

Outputs (Y): What the model produces: - A classification: “pneumonia” vs. “normal” - A probability: “73% probability of readmission” - A continuous value: “predicted length of stay: 4.2 days” - A segmentation: pixel-by-pixel tumor boundaries - Generated text: a draft clinical note

2.2.2 Multi-Channel and Multimodal Inputs

Inputs are often richer than a single data source:

Multi-channel: A single modality with multiple components. A color image has three channels (red, green, blue). A CT scan is a 3D volume. An ECG has 12 leads.

Multimodal: Multiple distinct data types combined. An image plus the patient’s age and sex. A clinical note plus lab values. Modern clinical AI increasingly fuses multiple modalities because clinicians don’t make decisions from single data sources—neither should models.

# Conceptual example: multimodal input
class MultimodalModel(nn.Module):
    def __init__(self):
        self.image_encoder = ResNet18()       # Process chest X-ray
        self.text_encoder = ClinicalBERT()    # Process clinical notes
        self.tabular_encoder = nn.Linear(10, 64)  # Process lab values
        self.fusion = nn.Linear(64 + 768 + 64, 1)  # Combine and predict

    def forward(self, image, text, labs):
        img_features = self.image_encoder(image)
        text_features = self.text_encoder(text)
        lab_features = self.tabular_encoder(labs)
        combined = torch.cat([img_features, text_features, lab_features])
        return self.fusion(combined)

2.2.3 The Model as Function Approximator

A key insight: neural networks are universal function approximators. Given enough capacity and data, they can learn arbitrarily complex functions. You don’t need to specify the mathematical form—the network discovers it from examples.

This is both powerful and dangerous: - Powerful: Can capture relationships too complex for humans to specify - Dangerous: Will learn any pattern that predicts the outcome, including spurious correlations and shortcuts

A chest X-ray model might learn that portable X-rays (taken at bedside) predict worse outcomes—not because the image quality matters, but because sicker patients get portable studies. The model found a shortcut. This is why understanding your data matters as much as understanding your model.

2.3 Learning Paradigms

Clinical Context: You have 10,000 chest X-rays. Some are labeled with diagnoses, some aren’t. How you use these labels—or lack thereof—determines your learning paradigm. Different paradigms suit different situations.

2.3.1 Supervised Learning

The most common paradigm: learn from labeled examples.

Setup: Training data consists of (input, output) pairs. The model learns to map inputs to outputs by minimizing prediction error on training examples.

Examples: - Image → Diagnosis (classification) - Clinical variables → Mortality risk (regression) - ECG signal → Arrhythmia type (classification)

Requirements: Labeled data. In medicine, this often means expert annotation, which is expensive and time-consuming.

Training data:  (X₁, Y₁), (X₂, Y₂), ..., (Xₙ, Yₙ)
                   ↓
Learn function f such that f(Xᵢ) ≈ Yᵢ
                   ↓
Apply to new data: f(X_new) → Y_predicted

2.3.2 Unsupervised Learning

Learn structure without labels.

Setup: Training data is just inputs—no labels. The model finds patterns, clusters, or representations in the data.

Examples: - Clustering patients by disease phenotype - Dimensionality reduction for visualization - Anomaly detection (normal vs. unusual patterns)

Use in medicine: Patient stratification, discovering disease subtypes, detecting outliers in quality monitoring.

2.3.3 Self-Supervised Learning

Create labels from the data itself.

Setup: Design a “pretext task” where labels can be generated automatically. Train on this task; the model learns useful representations that transfer to downstream tasks.

Examples: - Masked language modeling: hide words, predict them from context (how BERT learns) - Next token prediction: predict what comes next (how GPT learns) - Contrastive learning: learn that two views of the same image are similar

Why it matters: Self-supervised learning enables pretraining on massive unlabeled datasets. The pretrained model can then be fine-tuned on small labeled datasets. This is how modern LLMs and vision models achieve strong performance—they learn general representations from billions of unlabeled examples first.

2.3.4 Where LLMs Fit

Large language models combine paradigms:

Self-supervised pretraining: Learn to predict next tokens on internet text (trillions of tokens, no human labels)
Supervised fine-tuning: Train on (prompt, response) pairs curated by humans
RLHF: Reinforcement learning from human feedback to align with preferences

The self-supervised pretraining is what makes LLMs possible—you could never manually label trillions of tokens. The model learns language structure, world knowledge, and reasoning patterns just from predicting what comes next.

2.4 The Train/Test Commandments

Clinical Context: Your model achieves 95% accuracy. Is it any good? That depends entirely on whether you evaluated on data the model has seen before. The train/test split isn’t a bureaucratic requirement—it’s the foundation of honest evaluation.

2.4.1 Why Splitting Matters

Machine learning models are powerful pattern matchers. Given enough capacity, they can memorize training data perfectly—achieving 100% accuracy by simply remembering every example. This is overfitting: the model has learned the training data specifically, not generalizable patterns.

To estimate how the model will perform on new data, you must evaluate on data it hasn’t seen during training. This is the core principle.

2.4.2 The Three-Way Split

In practice, we split data three ways:

Training set (~60-80%): What the model learns from. The model sees these examples repeatedly, adjusting parameters to minimize error.

Validation set (~10-20%): For tuning decisions. Which hyperparameters work best? When to stop training? The validation set guides these choices without contaminating the final evaluation.

Test set (~10-20%): The final exam. Touched only once, at the very end, to report performance. If you repeatedly check test set performance and adjust your approach, you’re effectively training on it.

┌─────────────────────────────────────────────────────────────┐
│                        All Data                              │
├──────────────────────┬──────────────┬───────────────────────┤
│     Training Set     │  Validation  │      Test Set         │
│   (learn from this)  │   (tune on)  │  (report this, once)  │
└──────────────────────┴──────────────┴───────────────────────┘

2.4.3 The Cardinal Sin: Data Leakage

Data leakage occurs when information from outside the training set improperly influences the model. Forms include:

Future information: Using data that wouldn’t be available at prediction time. Predicting sepsis using labs drawn after sepsis was diagnosed. Predicting mortality using discharge disposition.

Test set contamination: Any use of test data during training or model selection. Normalizing features using statistics from the whole dataset. Selecting features based on correlation with the outcome in all data.

Duplicate patients: The same patient appearing in both training and test sets. The model learns patient-specific patterns rather than generalizable features.

Leakage produces models that look excellent in development but fail in deployment. A sepsis model with 99% AUC that uses future labs is worthless clinically—you don’t have those labs when you need to make the prediction.

2.4.4 Clinical Considerations for Splitting

Standard random splits often aren’t appropriate for clinical data:

Temporal splits: Train on earlier data, test on later data. This simulates deployment conditions where you’re predicting the future. It also catches temporal drift—changes in coding practices, patient populations, or treatment patterns over time.

Site-based splits: Train on some hospitals, test on others. This tests generalization across institutions—different EHR systems, documentation practices, and patient populations.

Patient-based splits: Ensure all data from one patient stays together. If a patient has five visits, all five should be in training or all five in test—not split across both.

# Temporal split example
from sklearn.model_selection import train_test_split

# Sort by date
df = df.sort_values('admission_date')

# Use last 20% as test (future data)
split_point = int(len(df) * 0.8)
train_val = df.iloc[:split_point]
test = df.iloc[split_point:]

# Further split train_val into train and validation
train, val = train_test_split(train_val, test_size=0.2, random_state=42)

print(f"Train: {train['admission_date'].min()} to {train['admission_date'].max()}")
print(f"Test: {test['admission_date'].min()} to {test['admission_date'].max()}")

2.5 Framing Your Clinical Problem

Clinical Context: A colleague asks “Can ML predict which patients will be readmitted?” Before discussing algorithms, you need to answer several questions. The framing determines whether the project succeeds more than the choice of model.

2.5.1 The Five Questions

Before writing any code, answer these:

1. What is your input?

What data will be available at prediction time? Be specific: - Just vital signs? All labs? Clinical notes? - At what time point? Admission? 24 hours in? Discharge? - What’s the data format? Images? Structured tables? Free text?

2. What is your output?

What exactly are you predicting? - Binary classification: yes/no, present/absent - Multi-class: which of several categories - Continuous value: risk score, length of stay - Time-to-event: when will something happen - Generated content: text, explanation

3. What labels exist?

How is the ground truth defined? - Expert annotation (expensive, often limited) - Outcomes from EHR (mortality, readmission—easy to get but may be noisy) - Billing codes (available but often inaccurate) - Consensus labels from multiple experts?

Label quality limits model quality. Garbage labels in, garbage model out.

4. Who is in your data?

What population do your training examples represent? - Single institution or multi-site? - What time period? (Practice patterns change) - What patient demographics? - What’s excluded? (Missing data, consent requirements)

Your model will perform best on patients similar to training data.

5. How will this be used?

What decision does this support? - Screening (high sensitivity, catch all cases) - Diagnosis (high specificity, minimize false positives) - Triage (prioritization, rank ordering) - Prognosis (risk communication, shared decision-making)

The use case determines which errors matter most, which metrics to optimize, and what performance is “good enough.”

2.5.2 A Worked Example

Problem: Predict 30-day readmission for heart failure patients.

Input: - Structured data: demographics, labs, vitals at discharge - NOT post-discharge data (not available at prediction time) - Decision: include or exclude clinical notes? (more signal but more complexity)

Output: - Probability of readmission (0-100%) - Allows flexible threshold based on intervention capacity

Labels: - 30-day all-cause readmission from EHR - Challenge: What about patients who died? Transferred? Lost to follow-up?

Population: - Heart failure patients at one academic medical center, 2018-2022 - Question: Will this generalize to community hospitals?

Use case: - Identify high-risk patients for transitional care program - Limited slots, so need to prioritize - Accept some false positives (extra follow-up calls) to catch true positives

This framing shapes everything: what features to include, how to handle missing data, what baseline to compare against, and what performance is acceptable.

2.6 Summary: The Mental Model

Every ML system follows the same pattern:

Define the problem: What are you predicting (Y) from what inputs (X)?
Choose a paradigm: Supervised (labels), unsupervised (structure), self-supervised (pretraining)?
Split your data: Train/validation/test, with appropriate clinical considerations
Train the model: Learn the function f(X) → Y from training data
Evaluate honestly: Test set, once, with metrics appropriate to your use case
Consider deployment: Will it generalize? Is the performance clinically useful?

The terminology (AI, ML, deep learning, LLMs) describes nested categories of techniques. The problem framing (inputs, outputs, labels, population, use case) determines whether any technique will succeed.

With this mental model in place, you’re ready to dive into the technical details—whether that’s the classical ML algorithms in the chapters ahead or the deep learning architectures that follow.

# Framing AI & Machine Learning Problems {#sec-framing-ai-ml} Before diving into algorithms and code, we need a shared mental model. This chapter establishes the vocabulary and conceptual framework that underlies everything that follows. Whether you're building a classic logistic regression or fine-tuning a large language model, the same fundamental structure applies: data goes in, a learned function transforms it, and predictions come out. ## The Terminology Landscape *Clinical Context: A hospital administrator asks whether the new "AI system" for radiology uses "machine learning" or "deep learning" or is "like ChatGPT." These terms get used interchangeably in the press, but they have specific meanings. Understanding the hierarchy helps you evaluate claims and choose appropriate tools.* ### AI, ML, Deep Learning, and Beyond These terms nest inside each other like Russian dolls: **Artificial Intelligence (AI)** is the broadest term—the goal of creating systems that exhibit intelligent behavior. This includes everything from 1980s expert systems (hand-coded rules) to modern neural networks. AI is the aspiration; the methods vary. **Machine Learning (ML)** is a subset of AI where systems learn from data rather than being explicitly programmed. Instead of a human writing rules like "if temperature > 38°C and WBC > 12,000, consider infection," the algorithm discovers patterns from labeled examples. ML is defined by *learning from data*. **Deep Learning** is a subset of ML using neural networks with many layers. These "deep" networks can learn hierarchical representations—edges combine into textures, textures into shapes, shapes into objects. Deep learning dominates modern computer vision and natural language processing. **Generative AI** refers to models that generate new content—text, images, audio—rather than just classifying or predicting. Large language models (LLMs) like GPT-4 and Claude are generative AI: given a prompt, they generate a response token by token. **Large Language Models (LLMs)** are a specific type of generative AI trained on massive text corpora. They're deep learning models (specifically, transformers) trained to predict the next word. This simple objective, at sufficient scale, produces systems that can write, summarize, translate, and reason. ``` ┌─────────────────────────────────────────────────────┐ │ Artificial Intelligence │ │ ┌───────────────────────────────────────────────┐ │ │ │ Machine Learning │ │ │ │ ┌─────────────────────────────────────────┐ │ │ │ │ │ Deep Learning │ │ │ │ │ │ ┌───────────────────────────────────┐ │ │ │ │ │ │ │ Generative AI / LLMs │ │ │ │ │ │ │ └───────────────────────────────────┘ │ │ │ │ │ └─────────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────┘ ``` ### Where Data Science Fits **Data Science** is the practice of extracting insights from data. It encompasses statistics, ML, visualization, and domain expertise. A data scientist might use ML models, but also simple descriptive statistics, SQL queries, and well-designed charts. Data science is the job; ML is one tool in the toolbox. In clinical settings, most impactful work combines all of these: exploratory analysis to understand your data, classical statistics to quantify uncertainty, ML models for prediction, and domain expertise to ensure clinical relevance. ### The Historical Arc Understanding where we are historically helps contextualize the tools: **1950s–1980s: Symbolic AI** — Hand-coded rules and expert systems. "If symptom A and symptom B, then diagnosis C." Required humans to encode all knowledge explicitly. **1990s–2010s: Classical ML** — Algorithms that learn patterns from data. Support vector machines, random forests, gradient boosting. Required humans to engineer features (what to measure), but learned the decision boundaries automatically. **2012–2020: Deep Learning Revolution** — Neural networks that learn features directly from raw data. Enabled by GPUs and large datasets. Transformed computer vision (ImageNet moment in 2012) and then NLP. **2020–Present: Foundation Models** — Massive models pretrained on internet-scale data, then adapted to specific tasks. GPT, BERT, and domain-specific variants. The "pretrain then fine-tune" paradigm democratizes access to powerful models. We're in the foundation model era, but all the earlier approaches remain valuable. Classical ML often outperforms deep learning on small tabular datasets. The right tool depends on your data and problem. ## The Anatomy of Every ML Problem *Clinical Context: Whether you're predicting sepsis from vital signs or generating radiology reports from images, every ML system has the same fundamental structure. Understanding this structure helps you frame new problems and evaluate existing solutions.* ### Inputs, Models, Outputs Every machine learning system can be understood as: ``` ┌─────────────────────┐ Input → │ Learned Function │ → Output X │ f(X) = Y │ Y └─────────────────────┘ ``` **Inputs (X)**: The data you feed to the model. This could be: - A chest X-ray image (matrix of pixel values) - A clinical note (sequence of words) - Lab values and vitals (table of numbers) - A combination of all three (multimodal) **The Model (f)**: A learned function that maps inputs to outputs. During training, the model adjusts its internal parameters to minimize prediction errors on training data. The model is essentially a "fancy nonlinear function approximator"—it learns complex, nonlinear relationships between inputs and outputs that would be impossible to specify by hand. **Outputs (Y)**: What the model produces: - A classification: "pneumonia" vs. "normal" - A probability: "73% probability of readmission" - A continuous value: "predicted length of stay: 4.2 days" - A segmentation: pixel-by-pixel tumor boundaries - Generated text: a draft clinical note ### Multi-Channel and Multimodal Inputs Inputs are often richer than a single data source: **Multi-channel**: A single modality with multiple components. A color image has three channels (red, green, blue). A CT scan is a 3D volume. An ECG has 12 leads. **Multimodal**: Multiple distinct data types combined. An image plus the patient's age and sex. A clinical note plus lab values. Modern clinical AI increasingly fuses multiple modalities because clinicians don't make decisions from single data sources—neither should models. ```{python} #| eval: false # Conceptual example: multimodal input class MultimodalModel(nn.Module): def __init__(self): self.image_encoder = ResNet18() # Process chest X-ray self.text_encoder = ClinicalBERT() # Process clinical notes self.tabular_encoder = nn.Linear(10, 64) # Process lab values self.fusion = nn.Linear(64 + 768 + 64, 1) # Combine and predict def forward(self, image, text, labs): img_features = self.image_encoder(image) text_features = self.text_encoder(text) lab_features = self.tabular_encoder(labs) combined = torch.cat([img_features, text_features, lab_features]) return self.fusion(combined) ``` ### The Model as Function Approximator A key insight: neural networks are **universal function approximators**. Given enough capacity and data, they can learn arbitrarily complex functions. You don't need to specify the mathematical form—the network discovers it from examples. This is both powerful and dangerous: - **Powerful**: Can capture relationships too complex for humans to specify - **Dangerous**: Will learn *any* pattern that predicts the outcome, including spurious correlations and shortcuts A chest X-ray model might learn that portable X-rays (taken at bedside) predict worse outcomes—not because the image quality matters, but because sicker patients get portable studies. The model found a shortcut. This is why understanding your data matters as much as understanding your model. ## Learning Paradigms *Clinical Context: You have 10,000 chest X-rays. Some are labeled with diagnoses, some aren't. How you use these labels—or lack thereof—determines your learning paradigm. Different paradigms suit different situations.* ### Supervised Learning The most common paradigm: learn from labeled examples. **Setup**: Training data consists of (input, output) pairs. The model learns to map inputs to outputs by minimizing prediction error on training examples. **Examples**: - Image → Diagnosis (classification) - Clinical variables → Mortality risk (regression) - ECG signal → Arrhythmia type (classification) **Requirements**: Labeled data. In medicine, this often means expert annotation, which is expensive and time-consuming. ``` Training data: (X₁, Y₁), (X₂, Y₂), ..., (Xₙ, Yₙ) ↓ Learn function f such that f(Xᵢ) ≈ Yᵢ ↓ Apply to new data: f(X_new) → Y_predicted ``` ### Unsupervised Learning Learn structure without labels. **Setup**: Training data is just inputs—no labels. The model finds patterns, clusters, or representations in the data. **Examples**: - Clustering patients by disease phenotype - Dimensionality reduction for visualization - Anomaly detection (normal vs. unusual patterns) **Use in medicine**: Patient stratification, discovering disease subtypes, detecting outliers in quality monitoring. ### Self-Supervised Learning Create labels from the data itself. **Setup**: Design a "pretext task" where labels can be generated automatically. Train on this task; the model learns useful representations that transfer to downstream tasks. **Examples**: - Masked language modeling: hide words, predict them from context (how BERT learns) - Next token prediction: predict what comes next (how GPT learns) - Contrastive learning: learn that two views of the same image are similar **Why it matters**: Self-supervised learning enables pretraining on massive unlabeled datasets. The pretrained model can then be fine-tuned on small labeled datasets. This is how modern LLMs and vision models achieve strong performance—they learn general representations from billions of unlabeled examples first. ### Where LLMs Fit Large language models combine paradigms: 1. **Self-supervised pretraining**: Learn to predict next tokens on internet text (trillions of tokens, no human labels) 2. **Supervised fine-tuning**: Train on (prompt, response) pairs curated by humans 3. **RLHF**: Reinforcement learning from human feedback to align with preferences The self-supervised pretraining is what makes LLMs possible—you could never manually label trillions of tokens. The model learns language structure, world knowledge, and reasoning patterns just from predicting what comes next. ## The Train/Test Commandments *Clinical Context: Your model achieves 95% accuracy. Is it any good? That depends entirely on whether you evaluated on data the model has seen before. The train/test split isn't a bureaucratic requirement—it's the foundation of honest evaluation.* ### Why Splitting Matters Machine learning models are powerful pattern matchers. Given enough capacity, they can memorize training data perfectly—achieving 100% accuracy by simply remembering every example. This is **overfitting**: the model has learned the training data specifically, not generalizable patterns. To estimate how the model will perform on *new* data, you must evaluate on data it hasn't seen during training. This is the core principle. ### The Three-Way Split In practice, we split data three ways: **Training set (~60-80%)**: What the model learns from. The model sees these examples repeatedly, adjusting parameters to minimize error. **Validation set (~10-20%)**: For tuning decisions. Which hyperparameters work best? When to stop training? The validation set guides these choices without contaminating the final evaluation. **Test set (~10-20%)**: The final exam. Touched only once, at the very end, to report performance. If you repeatedly check test set performance and adjust your approach, you're effectively training on it. ``` ┌─────────────────────────────────────────────────────────────┐ │ All Data │ ├──────────────────────┬──────────────┬───────────────────────┤ │ Training Set │ Validation │ Test Set │ │ (learn from this) │ (tune on) │ (report this, once) │ └──────────────────────┴──────────────┴───────────────────────┘ ``` ### The Cardinal Sin: Data Leakage **Data leakage** occurs when information from outside the training set improperly influences the model. Forms include: **Future information**: Using data that wouldn't be available at prediction time. Predicting sepsis using labs drawn after sepsis was diagnosed. Predicting mortality using discharge disposition. **Test set contamination**: Any use of test data during training or model selection. Normalizing features using statistics from the whole dataset. Selecting features based on correlation with the outcome in all data. **Duplicate patients**: The same patient appearing in both training and test sets. The model learns patient-specific patterns rather than generalizable features. Leakage produces models that look excellent in development but fail in deployment. A sepsis model with 99% AUC that uses future labs is worthless clinically—you don't have those labs when you need to make the prediction. ### Clinical Considerations for Splitting Standard random splits often aren't appropriate for clinical data: **Temporal splits**: Train on earlier data, test on later data. This simulates deployment conditions where you're predicting the future. It also catches temporal drift—changes in coding practices, patient populations, or treatment patterns over time. **Site-based splits**: Train on some hospitals, test on others. This tests generalization across institutions—different EHR systems, documentation practices, and patient populations. **Patient-based splits**: Ensure all data from one patient stays together. If a patient has five visits, all five should be in training or all five in test—not split across both. ```{python} #| eval: false # Temporal split example from sklearn.model_selection import train_test_split # Sort by date df = df.sort_values('admission_date') # Use last 20% as test (future data) split_point = int(len(df) * 0.8) train_val = df.iloc[:split_point] test = df.iloc[split_point:] # Further split train_val into train and validation train, val = train_test_split(train_val, test_size=0.2, random_state=42) print(f"Train: {train['admission_date'].min()} to {train['admission_date'].max()}") print(f"Test: {test['admission_date'].min()} to {test['admission_date'].max()}") ``` ## Framing Your Clinical Problem *Clinical Context: A colleague asks "Can ML predict which patients will be readmitted?" Before discussing algorithms, you need to answer several questions. The framing determines whether the project succeeds more than the choice of model.* ### The Five Questions Before writing any code, answer these: **1. What is your input?** What data will be available at prediction time? Be specific: - Just vital signs? All labs? Clinical notes? - At what time point? Admission? 24 hours in? Discharge? - What's the data format? Images? Structured tables? Free text? **2. What is your output?** What exactly are you predicting? - Binary classification: yes/no, present/absent - Multi-class: which of several categories - Continuous value: risk score, length of stay - Time-to-event: when will something happen - Generated content: text, explanation **3. What labels exist?** How is the ground truth defined? - Expert annotation (expensive, often limited) - Outcomes from EHR (mortality, readmission—easy to get but may be noisy) - Billing codes (available but often inaccurate) - Consensus labels from multiple experts? Label quality limits model quality. Garbage labels in, garbage model out. **4. Who is in your data?** What population do your training examples represent? - Single institution or multi-site? - What time period? (Practice patterns change) - What patient demographics? - What's excluded? (Missing data, consent requirements) Your model will perform best on patients similar to training data. **5. How will this be used?** What decision does this support? - Screening (high sensitivity, catch all cases) - Diagnosis (high specificity, minimize false positives) - Triage (prioritization, rank ordering) - Prognosis (risk communication, shared decision-making) The use case determines which errors matter most, which metrics to optimize, and what performance is "good enough." ### A Worked Example **Problem**: Predict 30-day readmission for heart failure patients. **Input**: - Structured data: demographics, labs, vitals at discharge - NOT post-discharge data (not available at prediction time) - Decision: include or exclude clinical notes? (more signal but more complexity) **Output**: - Probability of readmission (0-100%) - Allows flexible threshold based on intervention capacity **Labels**: - 30-day all-cause readmission from EHR - Challenge: What about patients who died? Transferred? Lost to follow-up? **Population**: - Heart failure patients at one academic medical center, 2018-2022 - Question: Will this generalize to community hospitals? **Use case**: - Identify high-risk patients for transitional care program - Limited slots, so need to prioritize - Accept some false positives (extra follow-up calls) to catch true positives This framing shapes everything: what features to include, how to handle missing data, what baseline to compare against, and what performance is acceptable. ## Summary: The Mental Model Every ML system follows the same pattern: 1. **Define the problem**: What are you predicting (Y) from what inputs (X)? 2. **Choose a paradigm**: Supervised (labels), unsupervised (structure), self-supervised (pretraining)? 3. **Split your data**: Train/validation/test, with appropriate clinical considerations 4. **Train the model**: Learn the function f(X) → Y from training data 5. **Evaluate honestly**: Test set, once, with metrics appropriate to your use case 6. **Consider deployment**: Will it generalize? Is the performance clinically useful? The terminology (AI, ML, deep learning, LLMs) describes nested categories of techniques. The problem framing (inputs, outputs, labels, population, use case) determines whether any technique will succeed. With this mental model in place, you're ready to dive into the technical details—whether that's the classical ML algorithms in the chapters ahead or the deep learning architectures that follow.