3 Programming Fundamentals

This chapter provides the programming foundation for everything that follows. If you’ve programmed in R, MATLAB, Java, or another language, you’ll find Python familiar—the concepts transfer, and we’ll focus on what makes Python distinctive. If Python is your first language, this chapter moves quickly; consider supplementing with additional tutorials as needed.

We’ll cover four essential tools: Python itself, NumPy for numerical computing, Pandas for data manipulation, and PyTorch for deep learning. By the end, you’ll be able to load clinical data, explore it visually, and prepare it for machine learning—skills you’ll use in every subsequent chapter.

3.1 Getting Started

Clinical Context: A radiology resident wants to build a chest X-ray classifier. They have MATLAB experience from undergrad engineering courses, but the deep learning community uses Python. This chapter bridges that gap.

3.1.1 Why Python for Medical AI?

Python dominates machine learning and medical AI for several reasons:

Ecosystem: The major deep learning frameworks (PyTorch, TensorFlow) are Python-first. Libraries for data manipulation (Pandas), visualization (Matplotlib), and scientific computing (NumPy, SciPy) are mature and well-documented.

Readability: Python’s clean syntax makes code easier to read, write, and share. In collaborative research, readability matters.

Community: When you hit a problem, someone has probably solved it. Stack Overflow answers, GitHub repositories, and tutorials abound.

Integration: Python connects easily to databases, web services, and other languages. Building end-to-end pipelines is straightforward.

3.1.2 Environment Options

You need somewhere to run Python code. Two main options:

Google Colab (recommended for starting out): A free, cloud-based Jupyter notebook environment. No installation required—just open a browser. Includes free GPU access for deep learning. Perfect for learning and prototyping.

To start: Go to colab.research.google.com, sign in with a Google account, and create a new notebook.

Local installation: Install Python via Anaconda or miniconda, then use Jupyter notebooks or an IDE like VS Code. Required for working with protected health information (PHI) that can’t go to cloud services. More setup, but more control.

For this chapter, Colab works well. Later chapters involving PHI will require local environments with appropriate security.

3.1.3 The Jupyter Notebook Workflow

Jupyter notebooks mix code, output, and text in a single document. Each code block (called a “cell”) can be run independently. This interactive style is ideal for data exploration and learning.

Key concepts: - Cells: Individual blocks of code or text - Run a cell: Shift+Enter (runs and moves to next) or Ctrl+Enter (runs and stays) - Kernel: The Python process running your code; restart if things get stuck - Output: Appears directly below the cell that produced it

# This is a code cell. Run it with Shift+Enter.
print("Hello, medical AI!")

# Variables persist across cells
patient_count = 1000

3.2 Python Essentials

Clinical Context: You’re reviewing a colleague’s code for a sepsis prediction model. You need to understand Python syntax well enough to follow the logic and suggest improvements.

This section covers Python’s core features. If you’ve programmed before, much will feel familiar—focus on Python’s specific syntax and idioms.

3.2.1 Variables and Data Types

Python is dynamically typed—you don’t declare types, and variables can change type.

# Numbers
age = 45                    # integer
temperature = 98.6          # float
complex_num = 3 + 4j        # complex (rarely used in clinical work)

# Strings
patient_name = "Jane Doe"
diagnosis = 'Pneumonia'     # single or double quotes work
note = """This is a
multi-line string"""        # triple quotes for multi-line

# Booleans
is_admitted = True
has_allergies = False

# None (like NULL in SQL, NA in R)
discharge_date = None       # patient still admitted

# Check types
print(type(age))            # <class 'int'>
print(type(temperature))    # <class 'float'>
print(type(patient_name))   # <class 'str'>

3.2.2 Collections

Python has several built-in collection types:

# Lists: ordered, mutable, allows duplicates
vitals = [98.6, 72, 120, 80]  # temperature, HR, SBP, DBP
vitals.append(16)              # add respiratory rate
vitals[0] = 99.1               # modify first element
print(vitals)                  # [99.1, 72, 120, 80, 16]

# List indexing (0-based!)
print(vitals[0])               # first element: 99.1
print(vitals[-1])              # last element: 16
print(vitals[1:3])             # slice: [72, 120] (end index excluded)

# Dictionaries: key-value pairs (like hash maps)
patient = {
    "name": "Jane Doe",
    "age": 45,
    "diagnoses": ["Pneumonia", "Hypertension"],
    "admitted": True
}
print(patient["name"])         # Jane Doe
patient["room"] = 401          # add new key
print(patient.get("allergies", "None documented"))  # safe access with default

# Tuples: ordered, immutable (can't change after creation)
blood_pressure = (120, 80)     # SBP, DBP
sbp, dbp = blood_pressure      # tuple unpacking

# Sets: unordered, no duplicates
medications = {"aspirin", "lisinopril", "metformin"}
medications.add("aspirin")     # no effect, already present
print(len(medications))        # still 3

3.2.3 Control Flow

Python uses indentation (not braces) to define code blocks. This is non-negotiable—inconsistent indentation causes errors.

# If/elif/else
heart_rate = 95

if heart_rate < 60:
    status = "Bradycardia"
elif heart_rate > 100:
    status = "Tachycardia"
else:
    status = "Normal"

print(f"Heart rate {heart_rate}: {status}")

# For loops
diagnoses = ["Pneumonia", "CHF", "COPD"]
for diagnosis in diagnoses:
    print(f"Coding: {diagnosis}")

# Loop with index
for i, diagnosis in enumerate(diagnoses):
    print(f"{i+1}. {diagnosis}")

# While loops
attempts = 0
max_attempts = 3
while attempts < max_attempts:
    print(f"Attempt {attempts + 1}")
    attempts += 1

# List comprehensions: concise way to create lists
# Traditional loop:
lengths = []
for d in diagnoses:
    lengths.append(len(d))

# List comprehension (preferred):
lengths = [len(d) for d in diagnoses]

# With condition
long_diagnoses = [d for d in diagnoses if len(d) > 4]

3.2.4 Functions

Functions encapsulate reusable logic:

# Basic function
def calculate_bmi(weight_kg, height_m):
    """Calculate body mass index.

    Args:
        weight_kg: Weight in kilograms
        height_m: Height in meters

    Returns:
        BMI as a float
    """
    return weight_kg / (height_m ** 2)

bmi = calculate_bmi(70, 1.75)
print(f"BMI: {bmi:.1f}")

# Default arguments
def classify_bmi(bmi, thresholds=None):
    if thresholds is None:
        thresholds = {"underweight": 18.5, "overweight": 25, "obese": 30}

    if bmi < thresholds["underweight"]:
        return "Underweight"
    elif bmi < thresholds["overweight"]:
        return "Normal"
    elif bmi < thresholds["obese"]:
        return "Overweight"
    else:
        return "Obese"

print(classify_bmi(22.9))  # Normal

# Keyword arguments
def create_patient(name, age, diagnoses=None, **kwargs):
    """Create patient dictionary with flexible additional fields."""
    patient = {"name": name, "age": age, "diagnoses": diagnoses or []}
    patient.update(kwargs)  # add any additional keyword arguments
    return patient

patient = create_patient("John Smith", 65, diagnoses=["DM2"], room=302, attending="Dr. Jones")

3.2.5 Classes

Classes bundle data and behavior. You’ll mostly read PyTorch classes rather than write your own, but understanding the basics helps:

class Patient:
    """Represents a patient in the hospital."""

    def __init__(self, name, age, mrn):
        """Initialize a new patient.

        Args:
            name: Patient's full name
            age: Age in years
            mrn: Medical record number
        """
        self.name = name
        self.age = age
        self.mrn = mrn
        self.diagnoses = []
        self.vitals_history = []

    def add_diagnosis(self, diagnosis):
        """Add a diagnosis to the patient's problem list."""
        if diagnosis not in self.diagnoses:
            self.diagnoses.append(diagnosis)

    def record_vitals(self, hr, sbp, dbp, temp):
        """Record a set of vital signs."""
        from datetime import datetime
        self.vitals_history.append({
            "timestamp": datetime.now(),
            "hr": hr, "sbp": sbp, "dbp": dbp, "temp": temp
        })

    def __repr__(self):
        """String representation for debugging."""
        return f"Patient({self.name}, {self.age}yo, MRN: {self.mrn})"

# Using the class
pt = Patient("Jane Doe", 45, "MRN001234")
pt.add_diagnosis("Community-acquired pneumonia")
pt.record_vitals(hr=88, sbp=118, dbp=72, temp=101.2)
print(pt)
print(pt.diagnoses)

3.2.6 Importing Modules

Python’s power comes from its ecosystem. Import what you need:

# Import entire module
import math
print(math.sqrt(16))

# Import with alias (conventional aliases shown)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import specific items
from datetime import datetime, timedelta
from pathlib import Path

# Import everything (generally discouraged, but sometimes used)
from math import *

3.3 NumPy: The Foundation of Numerical Computing

Clinical Context: Your sepsis prediction model takes vital signs as input—hundreds of measurements per patient. NumPy arrays make this data fast to load, transform, and feed to machine learning models.

NumPy (Numerical Python) provides the array operations that underlie all scientific computing in Python. Understanding NumPy is essential—Pandas and PyTorch are built on similar concepts.

3.3.1 Why Arrays?

Python lists are flexible but slow for numerical work. NumPy arrays are: - Fast: Operations are implemented in C - Memory-efficient: Homogeneous data types, no per-element overhead - Convenient: Vectorized operations eliminate explicit loops

import numpy as np

# Python list approach (slow)
vitals_list = [98.6, 99.1, 98.9, 100.2, 99.8]
celsius_list = [(f - 32) * 5/9 for f in vitals_list]  # requires loop

# NumPy approach (fast, readable)
vitals_array = np.array([98.6, 99.1, 98.9, 100.2, 99.8])
celsius_array = (vitals_array - 32) * 5/9  # operates on entire array

print(celsius_array)

3.3.2 Creating Arrays

import numpy as np

# From Python lists
heart_rates = np.array([72, 68, 75, 80, 77])
print(heart_rates.shape)  # (5,) — 1D array with 5 elements
print(heart_rates.dtype)  # int64

# 2D array (matrix): rows are patients, columns are features
# Columns: [age, HR, SBP, DBP, temp]
patient_data = np.array([
    [45, 72, 120, 80, 98.6],
    [67, 88, 145, 92, 99.1],
    [34, 65, 110, 70, 98.2]
])
print(patient_data.shape)  # (3, 5) — 3 patients, 5 features

# Common creation functions
zeros = np.zeros((100, 10))        # 100x10 array of zeros
ones = np.ones((50, 5))            # 50x5 array of ones
empty = np.empty((10, 10))         # uninitialized (fast but random values!)
identity = np.eye(4)               # 4x4 identity matrix

# Ranges
indices = np.arange(0, 100, 5)     # [0, 5, 10, ..., 95]
linspace = np.linspace(0, 1, 11)   # 11 evenly spaced points from 0 to 1

# Random arrays (useful for initialization, simulation)
np.random.seed(42)  # for reproducibility
random_uniform = np.random.rand(10, 5)      # uniform [0, 1)
random_normal = np.random.randn(10, 5)      # standard normal
random_integers = np.random.randint(60, 100, size=(10,))  # heart rates 60-99

3.3.3 Indexing and Slicing

import numpy as np

# Sample data: 5 patients, 4 features [age, HR, SBP, DBP]
data = np.array([
    [45, 72, 120, 80],
    [67, 88, 145, 92],
    [34, 65, 110, 70],
    [52, 78, 130, 85],
    [71, 92, 155, 88]
])

# Single element
print(data[0, 1])      # first patient's heart rate: 72

# Entire row (one patient)
print(data[2])         # [34, 65, 110, 70]

# Entire column (one feature across all patients)
print(data[:, 0])      # all ages: [45, 67, 34, 52, 71]

# Slice: first 3 patients, last 2 features
print(data[:3, 2:])    # SBP and DBP for patients 0, 1, 2

# Boolean indexing (very powerful!)
elderly = data[:, 0] > 65  # boolean array: [False, True, False, False, True]
print(data[elderly])       # only patients over 65

# Combining conditions
high_risk = (data[:, 0] > 60) & (data[:, 2] > 140)  # elderly AND high BP
print(data[high_risk])

3.3.4 Broadcasting

Broadcasting is how NumPy handles operations between arrays of different shapes. It’s powerful but can be confusing at first:

import numpy as np

# Scalar broadcast: operation applied to every element
temps_f = np.array([98.6, 99.1, 100.4, 98.9])
temps_c = (temps_f - 32) * 5/9  # scalar operations broadcast

# Vector broadcast: normalizing features (z-score)
# data shape: (5, 4) — 5 patients, 4 features
data = np.array([
    [45, 72, 120, 80],
    [67, 88, 145, 92],
    [34, 65, 110, 70],
    [52, 78, 130, 85],
    [71, 92, 155, 88]
], dtype=float)

# Mean and std of each feature (column)
means = data.mean(axis=0)  # shape: (4,) — one mean per feature
stds = data.std(axis=0)    # shape: (4,)

# Normalize: (data - mean) / std
# NumPy broadcasts (4,) to match (5, 4)
normalized = (data - means) / stds
print(normalized.mean(axis=0))  # approximately [0, 0, 0, 0]
print(normalized.std(axis=0))   # approximately [1, 1, 1, 1]

3.3.5 Common Operations

import numpy as np

data = np.array([
    [45, 72, 120, 80],
    [67, 88, 145, 92],
    [34, 65, 110, 70],
    [52, 78, 130, 85],
    [71, 92, 155, 88]
], dtype=float)

# Aggregations
print(data.sum())              # total of all elements
print(data.mean())             # mean of all elements
print(data.mean(axis=0))       # mean of each column (feature)
print(data.mean(axis=1))       # mean of each row (patient)
print(data.max(axis=0))        # max of each feature
print(data.argmax(axis=0))     # index of max in each column

# Reshaping
flat = data.flatten()          # 1D array with all elements
reshaped = data.reshape(2, 10) # reshape to 2x10 (must have same total elements)
transposed = data.T            # transpose: (5, 4) -> (4, 5)

# Linear algebra
matrix = np.array([[1, 2], [3, 4]])
print(np.dot(matrix, matrix))  # matrix multiplication
print(matrix @ matrix)         # same thing, cleaner syntax
print(np.linalg.inv(matrix))   # matrix inverse
eigenvalues, eigenvectors = np.linalg.eig(matrix)

# Stacking arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.vstack([a, b]))       # vertical stack: (2, 3)
print(np.hstack([a, b]))       # horizontal stack: (6,)
print(np.concatenate([a, b]))  # same as hstack for 1D

3.4 Pandas: Working with Clinical Data

Clinical Context: You’ve received an EHR data extract with patient demographics, vital signs, and diagnoses in separate CSV files. Pandas makes it straightforward to load, clean, merge, and explore this data.

Pandas is built on NumPy and provides DataFrames—two-dimensional tables with labeled rows and columns. It’s the go-to tool for tabular clinical data.

3.4.1 DataFrames: The Workhorse

import pandas as pd
import numpy as np

# Creating a DataFrame from a dictionary
patients = pd.DataFrame({
    "mrn": ["MRN001", "MRN002", "MRN003", "MRN004", "MRN005"],
    "age": [45, 67, 34, 52, 71],
    "sex": ["F", "M", "F", "M", "F"],
    "diagnosis": ["Pneumonia", "CHF", "Appendicitis", "COPD", "MI"],
    "los_days": [5, 8, 3, 6, 10]  # length of stay
})

print(patients)
print(patients.shape)      # (5, 5) — 5 rows, 5 columns
print(patients.columns)    # column names
print(patients.dtypes)     # data type of each column
print(patients.info())     # comprehensive summary
print(patients.describe()) # statistics for numeric columns

3.4.2 Reading Data

import pandas as pd

# CSV (most common)
# df = pd.read_csv("patients.csv")
# df = pd.read_csv("patients.csv", index_col="mrn")  # use mrn as index

# Excel
# df = pd.read_excel("patients.xlsx", sheet_name="Demographics")

# Other formats
# df = pd.read_json("patients.json")
# df = pd.read_parquet("patients.parquet")  # efficient for large data

# From SQL database
# import sqlite3
# conn = sqlite3.connect("hospital.db")
# df = pd.read_sql("SELECT * FROM patients WHERE age > 60", conn)

# For this chapter, we'll create sample data
patients = pd.DataFrame({
    "mrn": ["MRN001", "MRN002", "MRN003", "MRN004", "MRN005"],
    "age": [45, 67, 34, 52, 71],
    "sex": ["F", "M", "F", "M", "F"],
    "diagnosis": ["Pneumonia", "CHF", "Appendicitis", "COPD", "MI"],
    "los_days": [5, 8, 3, 6, 10]
})

3.4.3 Selecting and Filtering Data

import pandas as pd

patients = pd.DataFrame({
    "mrn": ["MRN001", "MRN002", "MRN003", "MRN004", "MRN005"],
    "age": [45, 67, 34, 52, 71],
    "sex": ["F", "M", "F", "M", "F"],
    "diagnosis": ["Pneumonia", "CHF", "Appendicitis", "COPD", "MI"],
    "los_days": [5, 8, 3, 6, 10]
})

# Select columns
print(patients["age"])              # single column (Series)
print(patients[["age", "sex"]])     # multiple columns (DataFrame)

# Select rows by position
print(patients.iloc[0])             # first row
print(patients.iloc[0:3])           # first three rows
print(patients.iloc[0:3, 1:3])      # rows 0-2, columns 1-2

# Select rows by label (if you have meaningful index)
patients_indexed = patients.set_index("mrn")
print(patients_indexed.loc["MRN002"])

# Boolean filtering (most common in practice)
elderly = patients[patients["age"] > 60]
print(elderly)

# Multiple conditions
female_elderly = patients[(patients["age"] > 60) & (patients["sex"] == "F")]
long_stay = patients[patients["los_days"] >= 7]

# Using query() for readability
result = patients.query("age > 60 and sex == 'F'")

# isin() for categorical matching
cardiac = patients[patients["diagnosis"].isin(["CHF", "MI"])]

3.4.4 Handling Missing Values

Missing data is ubiquitous in clinical datasets. Pandas uses NaN (Not a Number) for missing values:

import pandas as pd
import numpy as np

# Create data with missing values
labs = pd.DataFrame({
    "mrn": ["MRN001", "MRN002", "MRN003", "MRN004"],
    "hemoglobin": [12.5, np.nan, 14.2, 11.8],
    "wbc": [8.5, 12.1, np.nan, 15.3],
    "platelets": [250, 180, 320, np.nan]
})

# Detecting missing values
print(labs.isnull())           # boolean DataFrame
print(labs.isnull().sum())     # count missing per column
print(labs.isnull().any())     # any missing in each column?

# Removing missing values
labs_complete = labs.dropna()           # drop rows with ANY missing
labs_some = labs.dropna(thresh=3)       # keep rows with at least 3 non-null values

# Filling missing values
labs_filled = labs.fillna(0)            # fill with constant
labs_mean = labs.fillna(labs.mean())    # fill with column means
labs_ffill = labs.fillna(method="ffill") # forward fill (useful for time series)

# More sophisticated: fill with group means
# labs["hemoglobin"] = labs.groupby("department")["hemoglobin"].transform(
#     lambda x: x.fillna(x.mean())
# )

3.4.5 Grouping and Aggregation

import pandas as pd

# Sample data with multiple patients
encounters = pd.DataFrame({
    "mrn": ["MRN001", "MRN001", "MRN002", "MRN002", "MRN002", "MRN003"],
    "encounter_type": ["ED", "Inpatient", "ED", "ED", "Inpatient", "Outpatient"],
    "charges": [1500, 25000, 1200, 1800, 32000, 200],
    "los_hours": [4, 120, 3, 5, 168, 1]
})

# Group by single column
by_patient = encounters.groupby("mrn")
print(by_patient["charges"].sum())      # total charges per patient
print(by_patient["charges"].mean())     # average charge per patient
print(by_patient.size())                # count encounters per patient

# Multiple aggregations
summary = by_patient.agg({
    "charges": ["sum", "mean", "max"],
    "los_hours": ["sum", "mean"]
})
print(summary)

# Group by multiple columns
by_patient_type = encounters.groupby(["mrn", "encounter_type"]).agg({
    "charges": "sum",
    "los_hours": "mean"
})
print(by_patient_type)

3.4.6 Merging Datasets

Clinical data often comes in multiple tables that need to be joined:

import pandas as pd

# Demographics table
demographics = pd.DataFrame({
    "mrn": ["MRN001", "MRN002", "MRN003"],
    "age": [45, 67, 34],
    "sex": ["F", "M", "F"]
})

# Labs table
labs = pd.DataFrame({
    "mrn": ["MRN001", "MRN001", "MRN002", "MRN004"],
    "test": ["WBC", "Hemoglobin", "WBC", "WBC"],
    "value": [8.5, 12.1, 15.3, 9.2]
})

# Inner join: only patients in both tables
inner = pd.merge(demographics, labs, on="mrn", how="inner")
print(inner)

# Left join: all patients from demographics, matching labs where available
left = pd.merge(demographics, labs, on="mrn", how="left")
print(left)

# Outer join: all patients from both tables
outer = pd.merge(demographics, labs, on="mrn", how="outer")
print(outer)

# Join on different column names
# pd.merge(df1, df2, left_on="patient_id", right_on="mrn")

3.4.7 Time Series Basics

Clinical data often has timestamps—vital signs, medications, lab results over time:

import pandas as pd
import numpy as np

# Create time series data
dates = pd.date_range("2024-01-01", periods=24, freq="H")
vitals = pd.DataFrame({
    "timestamp": dates,
    "heart_rate": np.random.normal(75, 10, 24).astype(int),
    "sbp": np.random.normal(120, 15, 24).astype(int),
    "temp": np.random.normal(98.6, 0.5, 24).round(1)
})

# Set timestamp as index for time series operations
vitals = vitals.set_index("timestamp")

# Resample: aggregate to different time frequency
hourly_mean = vitals.resample("1H").mean()  # already hourly, so same
four_hour_mean = vitals.resample("4H").mean()
daily_mean = vitals.resample("D").mean()

# Rolling calculations (moving averages)
vitals["hr_rolling_4h"] = vitals["heart_rate"].rolling(window=4).mean()

# Time-based selection
morning = vitals.between_time("06:00", "12:00")
first_day = vitals.loc["2024-01-01"]

3.5 Data Visualization Essentials

Clinical Context: Before building models, you need to understand your data. Visualization reveals patterns, outliers, and data quality issues that summary statistics miss.

Matplotlib is Python’s foundational plotting library. Seaborn builds on it with cleaner defaults and statistical plots. Both are essential.

3.5.1 Matplotlib Basics

import matplotlib.pyplot as plt
import numpy as np

# Simple line plot
hours = np.arange(0, 24)
heart_rates = np.random.normal(75, 8, 24)

plt.figure(figsize=(10, 4))
plt.plot(hours, heart_rates, marker='o', linestyle='-', color='blue')
plt.xlabel("Hour")
plt.ylabel("Heart Rate (bpm)")
plt.title("24-Hour Heart Rate Trend")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Multiple lines
np.random.seed(42)
patient1 = np.random.normal(75, 8, 24)
patient2 = np.random.normal(85, 12, 24)

plt.figure(figsize=(10, 4))
plt.plot(hours, patient1, label="Patient 1", marker='o')
plt.plot(hours, patient2, label="Patient 2", marker='s')
plt.xlabel("Hour")
plt.ylabel("Heart Rate (bpm)")
plt.title("Heart Rate Comparison")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

3.5.2 Histograms and Distributions

import matplotlib.pyplot as plt
import numpy as np

# Simulated age distribution
np.random.seed(42)
ages = np.concatenate([
    np.random.normal(45, 15, 500),   # working age
    np.random.normal(75, 10, 300)    # elderly
])
ages = ages[(ages > 18) & (ages < 100)]  # realistic range

plt.figure(figsize=(10, 4))
plt.hist(ages, bins=30, edgecolor='black', alpha=0.7)
plt.xlabel("Age (years)")
plt.ylabel("Count")
plt.title("Patient Age Distribution")
plt.axvline(np.median(ages), color='red', linestyle='--', label=f'Median: {np.median(ages):.1f}')
plt.legend()
plt.tight_layout()
plt.show()

3.5.3 Scatter Plots

import matplotlib.pyplot as plt
import numpy as np

# Simulated relationship: age vs systolic BP
np.random.seed(42)
n = 200
ages = np.random.uniform(30, 80, n)
sbp = 90 + 0.5 * ages + np.random.normal(0, 10, n)

plt.figure(figsize=(8, 6))
plt.scatter(ages, sbp, alpha=0.6, c='blue', edgecolors='none')
plt.xlabel("Age (years)")
plt.ylabel("Systolic BP (mmHg)")
plt.title("Age vs Systolic Blood Pressure")

# Add trend line
z = np.polyfit(ages, sbp, 1)
p = np.poly1d(z)
plt.plot(sorted(ages), p(sorted(ages)), "r--", linewidth=2, label="Trend")
plt.legend()
plt.tight_layout()
plt.show()

3.5.4 Subplots

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)

# Create a 2x2 grid of plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Top left: vital signs over time
hours = np.arange(24)
axes[0, 0].plot(hours, np.random.normal(75, 8, 24), label="HR")
axes[0, 0].set_xlabel("Hour")
axes[0, 0].set_ylabel("Heart Rate")
axes[0, 0].set_title("Heart Rate Trend")
axes[0, 0].grid(True, alpha=0.3)

# Top right: histogram of ages
ages = np.random.normal(55, 18, 500)
axes[0, 1].hist(ages, bins=25, edgecolor='black', alpha=0.7)
axes[0, 1].set_xlabel("Age")
axes[0, 1].set_ylabel("Count")
axes[0, 1].set_title("Age Distribution")

# Bottom left: scatter plot
ages_sample = np.random.uniform(30, 80, 100)
sbp = 90 + 0.5 * ages_sample + np.random.normal(0, 10, 100)
axes[1, 0].scatter(ages_sample, sbp, alpha=0.6)
axes[1, 0].set_xlabel("Age")
axes[1, 0].set_ylabel("Systolic BP")
axes[1, 0].set_title("Age vs BP")

# Bottom right: bar chart
diagnoses = ["Pneumonia", "CHF", "COPD", "MI", "Sepsis"]
counts = [45, 38, 32, 28, 22]
axes[1, 1].bar(diagnoses, counts, color='steelblue', edgecolor='black')
axes[1, 1].set_xlabel("Diagnosis")
axes[1, 1].set_ylabel("Count")
axes[1, 1].set_title("Top Diagnoses")
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

3.5.5 Seaborn for Statistical Plots

Seaborn provides higher-level plotting with better defaults:

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Create sample clinical data
np.random.seed(42)
n = 300
df = pd.DataFrame({
    "age": np.random.normal(55, 18, n),
    "sbp": np.random.normal(125, 20, n),
    "los_days": np.random.exponential(5, n),
    "sex": np.random.choice(["Male", "Female"], n),
    "outcome": np.random.choice(["Survived", "Died"], n, p=[0.85, 0.15])
})

# Set style
sns.set_style("whitegrid")

# Distribution plot
plt.figure(figsize=(10, 4))
sns.histplot(data=df, x="age", hue="outcome", kde=True)
plt.title("Age Distribution by Outcome")
plt.show()

# Box plot: compare groups
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x="outcome", y="los_days")
plt.title("Length of Stay by Outcome")
plt.ylabel("Length of Stay (days)")
plt.show()

# Correlation heatmap
numeric_cols = df[["age", "sbp", "los_days"]]
plt.figure(figsize=(6, 5))
sns.heatmap(numeric_cols.corr(), annot=True, cmap="coolwarm", center=0)
plt.title("Feature Correlations")
plt.show()

For Clinical Readers

The following PyTorch section covers deep learning implementation details. If you’re reading for conceptual understanding rather than coding, you can skip to Chapter 7 and return here when you’re ready to build models. The concepts in Chapters 7-9 will make sense without this material.

3.6 PyTorch Fundamentals

Clinical Context: You’re ready to build a neural network to predict hospital readmission. PyTorch provides the building blocks—tensors, automatic differentiation, and neural network modules—that make this possible.

PyTorch is a deep learning framework that combines NumPy-like tensor operations with automatic differentiation. It’s the most popular framework in medical AI research.

3.6.1 Tensors

Tensors are PyTorch’s fundamental data structure—essentially NumPy arrays that can run on GPUs and track gradients:

import torch
import numpy as np

# Creating tensors
x = torch.tensor([1, 2, 3, 4, 5])
print(x)
print(x.dtype)  # torch.int64
print(x.shape)  # torch.Size([5])

# From NumPy (common pattern)
np_array = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)
tensor_from_np = torch.from_numpy(np_array)
print(tensor_from_np)

# Back to NumPy
back_to_np = tensor_from_np.numpy()

# Creation functions (similar to NumPy)
zeros = torch.zeros(3, 4)
ones = torch.ones(2, 3)
rand = torch.rand(3, 3)        # uniform [0, 1)
randn = torch.randn(3, 3)      # standard normal

# Specify dtype
float_tensor = torch.tensor([1, 2, 3], dtype=torch.float32)

# Move to GPU (if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
gpu_tensor = float_tensor.to(device)

3.6.2 Tensor Operations

import torch

# Create sample tensors
a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
b = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)

# Element-wise operations
print(a + b)
print(a * b)  # element-wise multiplication
print(a ** 2)

# Matrix multiplication
print(a @ b)           # preferred syntax
print(torch.mm(a, b))  # equivalent
print(torch.matmul(a, b))  # also equivalent

# Broadcasting (same rules as NumPy)
c = torch.tensor([1, 2])
print(a + c)  # broadcasts c across rows

# Aggregations
x = torch.randn(3, 4)
print(x.sum())
print(x.mean())
print(x.mean(dim=0))  # mean across rows (result shape: 4)
print(x.mean(dim=1))  # mean across columns (result shape: 3)

# Reshaping
print(x.view(4, 3))    # reshape to 4x3
print(x.view(-1))      # flatten (-1 infers dimension)
print(x.unsqueeze(0))  # add dimension: (3, 4) -> (1, 3, 4)
print(x.squeeze())     # remove dimensions of size 1

3.6.3 Automatic Differentiation

The key feature that makes neural network training possible: PyTorch automatically computes gradients.

import torch

# Create tensor that tracks gradients
x = torch.tensor([2.0, 3.0], requires_grad=True)
print(x)

# Perform computations
y = x ** 2
z = y.sum()  # scalar output
print(f"z = {z}")

# Compute gradients (backpropagation)
z.backward()

# Access gradients
print(f"dz/dx = {x.grad}")  # Should be [4.0, 6.0] = 2*x

# More complex example: linear regression gradient
# Model: y = wx + b
# Loss: MSE = (y_pred - y_true)^2

# Data
x_data = torch.tensor([1.0, 2.0, 3.0, 4.0])
y_true = torch.tensor([2.0, 4.0, 6.0, 8.0])

# Parameters (learnable)
w = torch.tensor(1.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

# Forward pass
y_pred = w * x_data + b
loss = ((y_pred - y_true) ** 2).mean()
print(f"Loss: {loss.item()}")

# Backward pass
loss.backward()
print(f"Gradient of w: {w.grad}")
print(f"Gradient of b: {b.grad}")

3.6.4 The nn Module

torch.nn provides building blocks for neural networks:

import torch
import torch.nn as nn

# Linear layer (fully connected)
linear = nn.Linear(in_features=10, out_features=5)
x = torch.randn(32, 10)  # batch of 32, each with 10 features
output = linear(x)
print(output.shape)  # torch.Size([32, 5])

# Common layers
conv = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3)
pool = nn.MaxPool2d(kernel_size=2)
dropout = nn.Dropout(p=0.5)
batchnorm = nn.BatchNorm1d(num_features=64)

# Activation functions
relu = nn.ReLU()
sigmoid = nn.Sigmoid()
softmax = nn.Softmax(dim=1)

# Building a simple network
class SimpleClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        return x

# Create model
model = SimpleClassifier(input_size=10, hidden_size=32, num_classes=2)
print(model)

# Forward pass
x = torch.randn(16, 10)  # batch of 16
logits = model(x)
print(logits.shape)  # torch.Size([16, 2])

3.6.5 The Training Loop

The fundamental pattern for training neural networks:

import torch
import torch.nn as nn
import torch.optim as optim

# Create synthetic data: predict if patient will be readmitted
# Features: age, num_prior_admits, length_of_stay, num_diagnoses, num_medications
torch.manual_seed(42)
n_samples = 1000
X = torch.randn(n_samples, 5)
# True relationship: risk increases with age, prior admits, diagnoses
true_weights = torch.tensor([0.3, 0.5, -0.1, 0.4, 0.1])
y = (X @ true_weights + torch.randn(n_samples) * 0.5 > 0).float()

# Split data
train_X, test_X = X[:800], X[800:]
train_y, test_y = y[:800], y[800:]

# Define model
model = nn.Sequential(
    nn.Linear(5, 32),
    nn.ReLU(),
    nn.Linear(32, 16),
    nn.ReLU(),
    nn.Linear(16, 1),
    nn.Sigmoid()
)

# Loss function and optimizer
criterion = nn.BCELoss()  # Binary Cross Entropy for classification
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
n_epochs = 100
for epoch in range(n_epochs):
    # Forward pass
    predictions = model(train_X).squeeze()
    loss = criterion(predictions, train_y)

    # Backward pass
    optimizer.zero_grad()  # clear previous gradients
    loss.backward()        # compute gradients
    optimizer.step()       # update parameters

    # Print progress
    if (epoch + 1) % 20 == 0:
        with torch.no_grad():  # disable gradient tracking for evaluation
            train_acc = ((predictions > 0.5) == train_y).float().mean()
            test_pred = model(test_X).squeeze()
            test_acc = ((test_pred > 0.5) == test_y).float().mean()
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {loss.item():.4f}, "
              f"Train Acc: {train_acc:.3f}, Test Acc: {test_acc:.3f}")

3.6.6 Moving Between NumPy and PyTorch

You’ll frequently convert between NumPy (for data manipulation) and PyTorch (for modeling):

import numpy as np
import torch

# NumPy to PyTorch
np_array = np.array([1.0, 2.0, 3.0])
torch_tensor = torch.from_numpy(np_array)  # shares memory!
torch_tensor_copy = torch.tensor(np_array)  # creates copy

# PyTorch to NumPy
tensor = torch.tensor([1.0, 2.0, 3.0])
np_from_tensor = tensor.numpy()  # shares memory if on CPU

# If tensor requires grad or is on GPU
tensor_grad = torch.tensor([1.0, 2.0], requires_grad=True)
np_from_grad = tensor_grad.detach().numpy()  # detach from computation graph

# If on GPU
# gpu_tensor = tensor.cuda()
# np_from_gpu = gpu_tensor.cpu().numpy()  # move to CPU first

3.7 Putting It Together: A Clinical Data Pipeline

Clinical Context: You have CSV files with patient demographics, vital signs, and outcomes. You need to load, clean, merge, and prepare this data for a readmission prediction model.

Let’s build a complete pipeline from raw data to model-ready tensors:

import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# === Step 1: Create sample clinical data ===
np.random.seed(42)
n_patients = 1000

# Demographics
demographics = pd.DataFrame({
    "patient_id": range(n_patients),
    "age": np.random.normal(55, 18, n_patients).clip(18, 95),
    "sex": np.random.choice(["M", "F"], n_patients),
    "insurance": np.random.choice(["Medicare", "Medicaid", "Private", "Self-pay"],
                                   n_patients, p=[0.4, 0.2, 0.35, 0.05])
})

# Encounter data
encounters = pd.DataFrame({
    "patient_id": range(n_patients),
    "length_of_stay": np.random.exponential(4, n_patients).clip(1, 30),
    "num_diagnoses": np.random.poisson(3, n_patients) + 1,
    "num_procedures": np.random.poisson(1.5, n_patients),
    "icu_stay": np.random.choice([0, 1], n_patients, p=[0.85, 0.15])
})

# Outcome (30-day readmission)
# Higher risk for: older, longer stays, more diagnoses, ICU
risk_score = (
    0.02 * demographics["age"] +
    0.1 * encounters["length_of_stay"] +
    0.15 * encounters["num_diagnoses"] +
    0.3 * encounters["icu_stay"]
)
readmit_prob = 1 / (1 + np.exp(-risk_score + 3))
outcomes = pd.DataFrame({
    "patient_id": range(n_patients),
    "readmitted_30d": (np.random.random(n_patients) < readmit_prob).astype(int)
})

print(f"Readmission rate: {outcomes['readmitted_30d'].mean():.1%}")

# === Step 2: Merge and clean ===
df = demographics.merge(encounters, on="patient_id").merge(outcomes, on="patient_id")
print(f"Dataset shape: {df.shape}")
print(df.head())

# Handle categorical variables
df["sex_encoded"] = (df["sex"] == "M").astype(int)
df = pd.get_dummies(df, columns=["insurance"], prefix="ins")

# Check for missing values
print(f"\nMissing values:\n{df.isnull().sum()}")

# === Step 3: Feature selection and preparation ===
feature_columns = [
    "age", "sex_encoded", "length_of_stay", "num_diagnoses",
    "num_procedures", "icu_stay", "ins_Medicaid", "ins_Medicare",
    "ins_Private", "ins_Self-pay"
]
target_column = "readmitted_30d"

X = df[feature_columns].values
y = df[target_column].values

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")

# === Step 4: Train/test split ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"\nTrain set: {X_train.shape[0]} patients")
print(f"Test set: {X_test.shape[0]} patients")
print(f"Train readmission rate: {y_train.mean():.1%}")
print(f"Test readmission rate: {y_test.mean():.1%}")

# === Step 5: Normalize features ===
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit on train only
X_test_scaled = scaler.transform(X_test)        # apply same transformation

# === Step 6: Convert to PyTorch ===
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

print(f"\nPyTorch tensors ready:")
print(f"X_train: {X_train_tensor.shape}, dtype: {X_train_tensor.dtype}")
print(f"y_train: {y_train_tensor.shape}, dtype: {y_train_tensor.dtype}")

# === Step 7: Create DataLoader for batching ===
class ClinicalDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

train_dataset = ClinicalDataset(X_train_tensor, y_train_tensor)
test_dataset = ClinicalDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Example: iterate through one batch
for batch_X, batch_y in train_loader:
    print(f"\nBatch shapes:")
    print(f"Features: {batch_X.shape}")
    print(f"Labels: {batch_y.shape}")
    break

print("\n✓ Data pipeline complete! Ready for model training.")

This pipeline demonstrates the complete workflow:

Load data from multiple sources
Merge related tables
Clean and handle missing values
Encode categorical variables
Split into training and test sets
Normalize features
Convert to PyTorch tensors
Batch using DataLoaders

This pattern—with variations for specific data types—underlies virtually all medical AI work. Master it, and you’re ready for the chapters ahead.

# Programming Fundamentals {#sec-programming} This chapter provides the programming foundation for everything that follows. If you've programmed in R, MATLAB, Java, or another language, you'll find Python familiar—the concepts transfer, and we'll focus on what makes Python distinctive. If Python is your first language, this chapter moves quickly; consider supplementing with additional tutorials as needed. We'll cover four essential tools: Python itself, NumPy for numerical computing, Pandas for data manipulation, and PyTorch for deep learning. By the end, you'll be able to load clinical data, explore it visually, and prepare it for machine learning—skills you'll use in every subsequent chapter. ## Getting Started *Clinical Context: A radiology resident wants to build a chest X-ray classifier. They have MATLAB experience from undergrad engineering courses, but the deep learning community uses Python. This chapter bridges that gap.* ### Why Python for Medical AI? Python dominates machine learning and medical AI for several reasons: **Ecosystem**: The major deep learning frameworks (PyTorch, TensorFlow) are Python-first. Libraries for data manipulation (Pandas), visualization (Matplotlib), and scientific computing (NumPy, SciPy) are mature and well-documented. **Readability**: Python's clean syntax makes code easier to read, write, and share. In collaborative research, readability matters. **Community**: When you hit a problem, someone has probably solved it. Stack Overflow answers, GitHub repositories, and tutorials abound. **Integration**: Python connects easily to databases, web services, and other languages. Building end-to-end pipelines is straightforward. ### Environment Options You need somewhere to run Python code. Two main options: **Google Colab** (recommended for starting out): A free, cloud-based Jupyter notebook environment. No installation required—just open a browser. Includes free GPU access for deep learning. Perfect for learning and prototyping. To start: Go to [colab.research.google.com](https://colab.research.google.com), sign in with a Google account, and create a new notebook. **Local installation**: Install Python via Anaconda or miniconda, then use Jupyter notebooks or an IDE like VS Code. Required for working with protected health information (PHI) that can't go to cloud services. More setup, but more control. For this chapter, Colab works well. Later chapters involving PHI will require local environments with appropriate security. ### The Jupyter Notebook Workflow Jupyter notebooks mix code, output, and text in a single document. Each code block (called a "cell") can be run independently. This interactive style is ideal for data exploration and learning. Key concepts: - **Cells**: Individual blocks of code or text - **Run a cell**: Shift+Enter (runs and moves to next) or Ctrl+Enter (runs and stays) - **Kernel**: The Python process running your code; restart if things get stuck - **Output**: Appears directly below the cell that produced it ```{python} #| eval: false #| echo: true # This is a code cell. Run it with Shift+Enter. print("Hello, medical AI!") # Variables persist across cells patient_count = 1000 ``` ## Python Essentials *Clinical Context: You're reviewing a colleague's code for a sepsis prediction model. You need to understand Python syntax well enough to follow the logic and suggest improvements.* This section covers Python's core features. If you've programmed before, much will feel familiar—focus on Python's specific syntax and idioms. ### Variables and Data Types Python is dynamically typed—you don't declare types, and variables can change type. ```{python} #| eval: false #| echo: true # Numbers age = 45 # integer temperature = 98.6 # float complex_num = 3 + 4j # complex (rarely used in clinical work) # Strings patient_name = "Jane Doe" diagnosis = 'Pneumonia' # single or double quotes work note = """This is a multi-line string""" # triple quotes for multi-line # Booleans is_admitted = True has_allergies = False # None (like NULL in SQL, NA in R) discharge_date = None # patient still admitted # Check types print(type(age)) # <class 'int'> print(type(temperature)) # <class 'float'> print(type(patient_name)) # <class 'str'> ``` ### Collections Python has several built-in collection types: ```{python} #| eval: false #| echo: true # Lists: ordered, mutable, allows duplicates vitals = [98.6, 72, 120, 80] # temperature, HR, SBP, DBP vitals.append(16) # add respiratory rate vitals[0] = 99.1 # modify first element print(vitals) # [99.1, 72, 120, 80, 16] # List indexing (0-based!) print(vitals[0]) # first element: 99.1 print(vitals[-1]) # last element: 16 print(vitals[1:3]) # slice: [72, 120] (end index excluded) # Dictionaries: key-value pairs (like hash maps) patient = { "name": "Jane Doe", "age": 45, "diagnoses": ["Pneumonia", "Hypertension"], "admitted": True } print(patient["name"]) # Jane Doe patient["room"] = 401 # add new key print(patient.get("allergies", "None documented")) # safe access with default # Tuples: ordered, immutable (can't change after creation) blood_pressure = (120, 80) # SBP, DBP sbp, dbp = blood_pressure # tuple unpacking # Sets: unordered, no duplicates medications = {"aspirin", "lisinopril", "metformin"} medications.add("aspirin") # no effect, already present print(len(medications)) # still 3 ``` ### Control Flow Python uses indentation (not braces) to define code blocks. This is non-negotiable—inconsistent indentation causes errors. ```{python} #| eval: false #| echo: true # If/elif/else heart_rate = 95 if heart_rate < 60: status = "Bradycardia" elif heart_rate > 100: status = "Tachycardia" else: status = "Normal" print(f"Heart rate {heart_rate}: {status}") # For loops diagnoses = ["Pneumonia", "CHF", "COPD"] for diagnosis in diagnoses: print(f"Coding: {diagnosis}") # Loop with index for i, diagnosis in enumerate(diagnoses): print(f"{i+1}. {diagnosis}") # While loops attempts = 0 max_attempts = 3 while attempts < max_attempts: print(f"Attempt {attempts + 1}") attempts += 1 # List comprehensions: concise way to create lists # Traditional loop: lengths = [] for d in diagnoses: lengths.append(len(d)) # List comprehension (preferred): lengths = [len(d) for d in diagnoses] # With condition long_diagnoses = [d for d in diagnoses if len(d) > 4] ``` ### Functions Functions encapsulate reusable logic: ```{python} #| eval: false #| echo: true # Basic function def calculate_bmi(weight_kg, height_m): """Calculate body mass index. Args: weight_kg: Weight in kilograms height_m: Height in meters Returns: BMI as a float """ return weight_kg / (height_m ** 2) bmi = calculate_bmi(70, 1.75) print(f"BMI: {bmi:.1f}") # Default arguments def classify_bmi(bmi, thresholds=None): if thresholds is None: thresholds = {"underweight": 18.5, "overweight": 25, "obese": 30} if bmi < thresholds["underweight"]: return "Underweight" elif bmi < thresholds["overweight"]: return "Normal" elif bmi < thresholds["obese"]: return "Overweight" else: return "Obese" print(classify_bmi(22.9)) # Normal # Keyword arguments def create_patient(name, age, diagnoses=None, **kwargs): """Create patient dictionary with flexible additional fields.""" patient = {"name": name, "age": age, "diagnoses": diagnoses or []} patient.update(kwargs) # add any additional keyword arguments return patient patient = create_patient("John Smith", 65, diagnoses=["DM2"], room=302, attending="Dr. Jones") ``` ### Classes Classes bundle data and behavior. You'll mostly read PyTorch classes rather than write your own, but understanding the basics helps: ```{python} #| eval: false #| echo: true class Patient: """Represents a patient in the hospital.""" def __init__(self, name, age, mrn): """Initialize a new patient. Args: name: Patient's full name age: Age in years mrn: Medical record number """ self.name = name self.age = age self.mrn = mrn self.diagnoses = [] self.vitals_history = [] def add_diagnosis(self, diagnosis): """Add a diagnosis to the patient's problem list.""" if diagnosis not in self.diagnoses: self.diagnoses.append(diagnosis) def record_vitals(self, hr, sbp, dbp, temp): """Record a set of vital signs.""" from datetime import datetime self.vitals_history.append({ "timestamp": datetime.now(), "hr": hr, "sbp": sbp, "dbp": dbp, "temp": temp }) def __repr__(self): """String representation for debugging.""" return f"Patient({self.name}, {self.age}yo, MRN: {self.mrn})" # Using the class pt = Patient("Jane Doe", 45, "MRN001234") pt.add_diagnosis("Community-acquired pneumonia") pt.record_vitals(hr=88, sbp=118, dbp=72, temp=101.2) print(pt) print(pt.diagnoses) ``` ### Importing Modules Python's power comes from its ecosystem. Import what you need: ```{python} #| eval: false #| echo: true # Import entire module import math print(math.sqrt(16)) # Import with alias (conventional aliases shown) import numpy as np import pandas as pd import matplotlib.pyplot as plt # Import specific items from datetime import datetime, timedelta from pathlib import Path # Import everything (generally discouraged, but sometimes used) from math import * ``` ## NumPy: The Foundation of Numerical Computing *Clinical Context: Your sepsis prediction model takes vital signs as input—hundreds of measurements per patient. NumPy arrays make this data fast to load, transform, and feed to machine learning models.* NumPy (Numerical Python) provides the array operations that underlie all scientific computing in Python. Understanding NumPy is essential—Pandas and PyTorch are built on similar concepts. ### Why Arrays? Python lists are flexible but slow for numerical work. NumPy arrays are: - **Fast**: Operations are implemented in C - **Memory-efficient**: Homogeneous data types, no per-element overhead - **Convenient**: Vectorized operations eliminate explicit loops ```{python} #| eval: false #| echo: true import numpy as np # Python list approach (slow) vitals_list = [98.6, 99.1, 98.9, 100.2, 99.8] celsius_list = [(f - 32) * 5/9 for f in vitals_list] # requires loop # NumPy approach (fast, readable) vitals_array = np.array([98.6, 99.1, 98.9, 100.2, 99.8]) celsius_array = (vitals_array - 32) * 5/9 # operates on entire array print(celsius_array) ``` ### Creating Arrays ```{python} #| eval: false #| echo: true import numpy as np # From Python lists heart_rates = np.array([72, 68, 75, 80, 77]) print(heart_rates.shape) # (5,) — 1D array with 5 elements print(heart_rates.dtype) # int64 # 2D array (matrix): rows are patients, columns are features # Columns: [age, HR, SBP, DBP, temp] patient_data = np.array([ [45, 72, 120, 80, 98.6], [67, 88, 145, 92, 99.1], [34, 65, 110, 70, 98.2] ]) print(patient_data.shape) # (3, 5) — 3 patients, 5 features # Common creation functions zeros = np.zeros((100, 10)) # 100x10 array of zeros ones = np.ones((50, 5)) # 50x5 array of ones empty = np.empty((10, 10)) # uninitialized (fast but random values!) identity = np.eye(4) # 4x4 identity matrix # Ranges indices = np.arange(0, 100, 5) # [0, 5, 10, ..., 95] linspace = np.linspace(0, 1, 11) # 11 evenly spaced points from 0 to 1 # Random arrays (useful for initialization, simulation) np.random.seed(42) # for reproducibility random_uniform = np.random.rand(10, 5) # uniform [0, 1) random_normal = np.random.randn(10, 5) # standard normal random_integers = np.random.randint(60, 100, size=(10,)) # heart rates 60-99 ``` ### Indexing and Slicing ```{python} #| eval: false #| echo: true import numpy as np # Sample data: 5 patients, 4 features [age, HR, SBP, DBP] data = np.array([ [45, 72, 120, 80], [67, 88, 145, 92], [34, 65, 110, 70], [52, 78, 130, 85], [71, 92, 155, 88] ]) # Single element print(data[0, 1]) # first patient's heart rate: 72 # Entire row (one patient) print(data[2]) # [34, 65, 110, 70] # Entire column (one feature across all patients) print(data[:, 0]) # all ages: [45, 67, 34, 52, 71] # Slice: first 3 patients, last 2 features print(data[:3, 2:]) # SBP and DBP for patients 0, 1, 2 # Boolean indexing (very powerful!) elderly = data[:, 0] > 65 # boolean array: [False, True, False, False, True] print(data[elderly]) # only patients over 65 # Combining conditions high_risk = (data[:, 0] > 60) & (data[:, 2] > 140) # elderly AND high BP print(data[high_risk]) ``` ### Broadcasting Broadcasting is how NumPy handles operations between arrays of different shapes. It's powerful but can be confusing at first: ```{python} #| eval: false #| echo: true import numpy as np # Scalar broadcast: operation applied to every element temps_f = np.array([98.6, 99.1, 100.4, 98.9]) temps_c = (temps_f - 32) * 5/9 # scalar operations broadcast # Vector broadcast: normalizing features (z-score) # data shape: (5, 4) — 5 patients, 4 features data = np.array([ [45, 72, 120, 80], [67, 88, 145, 92], [34, 65, 110, 70], [52, 78, 130, 85], [71, 92, 155, 88] ], dtype=float) # Mean and std of each feature (column) means = data.mean(axis=0) # shape: (4,) — one mean per feature stds = data.std(axis=0) # shape: (4,) # Normalize: (data - mean) / std # NumPy broadcasts (4,) to match (5, 4) normalized = (data - means) / stds print(normalized.mean(axis=0)) # approximately [0, 0, 0, 0] print(normalized.std(axis=0)) # approximately [1, 1, 1, 1] ``` ### Common Operations ```{python} #| eval: false #| echo: true import numpy as np data = np.array([ [45, 72, 120, 80], [67, 88, 145, 92], [34, 65, 110, 70], [52, 78, 130, 85], [71, 92, 155, 88] ], dtype=float) # Aggregations print(data.sum()) # total of all elements print(data.mean()) # mean of all elements print(data.mean(axis=0)) # mean of each column (feature) print(data.mean(axis=1)) # mean of each row (patient) print(data.max(axis=0)) # max of each feature print(data.argmax(axis=0)) # index of max in each column # Reshaping flat = data.flatten() # 1D array with all elements reshaped = data.reshape(2, 10) # reshape to 2x10 (must have same total elements) transposed = data.T # transpose: (5, 4) -> (4, 5) # Linear algebra matrix = np.array([[1, 2], [3, 4]]) print(np.dot(matrix, matrix)) # matrix multiplication print(matrix @ matrix) # same thing, cleaner syntax print(np.linalg.inv(matrix)) # matrix inverse eigenvalues, eigenvectors = np.linalg.eig(matrix) # Stacking arrays a = np.array([1, 2, 3]) b = np.array([4, 5, 6]) print(np.vstack([a, b])) # vertical stack: (2, 3) print(np.hstack([a, b])) # horizontal stack: (6,) print(np.concatenate([a, b])) # same as hstack for 1D ``` ## Pandas: Working with Clinical Data *Clinical Context: You've received an EHR data extract with patient demographics, vital signs, and diagnoses in separate CSV files. Pandas makes it straightforward to load, clean, merge, and explore this data.* Pandas is built on NumPy and provides DataFrames—two-dimensional tables with labeled rows and columns. It's the go-to tool for tabular clinical data. ### DataFrames: The Workhorse ```{python} #| eval: false #| echo: true import pandas as pd import numpy as np # Creating a DataFrame from a dictionary patients = pd.DataFrame({ "mrn": ["MRN001", "MRN002", "MRN003", "MRN004", "MRN005"], "age": [45, 67, 34, 52, 71], "sex": ["F", "M", "F", "M", "F"], "diagnosis": ["Pneumonia", "CHF", "Appendicitis", "COPD", "MI"], "los_days": [5, 8, 3, 6, 10] # length of stay }) print(patients) print(patients.shape) # (5, 5) — 5 rows, 5 columns print(patients.columns) # column names print(patients.dtypes) # data type of each column print(patients.info()) # comprehensive summary print(patients.describe()) # statistics for numeric columns ``` ### Reading Data ```{python} #| eval: false #| echo: true import pandas as pd # CSV (most common) # df = pd.read_csv("patients.csv") # df = pd.read_csv("patients.csv", index_col="mrn") # use mrn as index # Excel # df = pd.read_excel("patients.xlsx", sheet_name="Demographics") # Other formats # df = pd.read_json("patients.json") # df = pd.read_parquet("patients.parquet") # efficient for large data # From SQL database # import sqlite3 # conn = sqlite3.connect("hospital.db") # df = pd.read_sql("SELECT * FROM patients WHERE age > 60", conn) # For this chapter, we'll create sample data patients = pd.DataFrame({ "mrn": ["MRN001", "MRN002", "MRN003", "MRN004", "MRN005"], "age": [45, 67, 34, 52, 71], "sex": ["F", "M", "F", "M", "F"], "diagnosis": ["Pneumonia", "CHF", "Appendicitis", "COPD", "MI"], "los_days": [5, 8, 3, 6, 10] }) ``` ### Selecting and Filtering Data ```{python} #| eval: false #| echo: true import pandas as pd patients = pd.DataFrame({ "mrn": ["MRN001", "MRN002", "MRN003", "MRN004", "MRN005"], "age": [45, 67, 34, 52, 71], "sex": ["F", "M", "F", "M", "F"], "diagnosis": ["Pneumonia", "CHF", "Appendicitis", "COPD", "MI"], "los_days": [5, 8, 3, 6, 10] }) # Select columns print(patients["age"]) # single column (Series) print(patients[["age", "sex"]]) # multiple columns (DataFrame) # Select rows by position print(patients.iloc[0]) # first row print(patients.iloc[0:3]) # first three rows print(patients.iloc[0:3, 1:3]) # rows 0-2, columns 1-2 # Select rows by label (if you have meaningful index) patients_indexed = patients.set_index("mrn") print(patients_indexed.loc["MRN002"]) # Boolean filtering (most common in practice) elderly = patients[patients["age"] > 60] print(elderly) # Multiple conditions female_elderly = patients[(patients["age"] > 60) & (patients["sex"] == "F")] long_stay = patients[patients["los_days"] >= 7] # Using query() for readability result = patients.query("age > 60 and sex == 'F'") # isin() for categorical matching cardiac = patients[patients["diagnosis"].isin(["CHF", "MI"])] ``` ### Handling Missing Values Missing data is ubiquitous in clinical datasets. Pandas uses `NaN` (Not a Number) for missing values: ```{python} #| eval: false #| echo: true import pandas as pd import numpy as np # Create data with missing values labs = pd.DataFrame({ "mrn": ["MRN001", "MRN002", "MRN003", "MRN004"], "hemoglobin": [12.5, np.nan, 14.2, 11.8], "wbc": [8.5, 12.1, np.nan, 15.3], "platelets": [250, 180, 320, np.nan] }) # Detecting missing values print(labs.isnull()) # boolean DataFrame print(labs.isnull().sum()) # count missing per column print(labs.isnull().any()) # any missing in each column? # Removing missing values labs_complete = labs.dropna() # drop rows with ANY missing labs_some = labs.dropna(thresh=3) # keep rows with at least 3 non-null values # Filling missing values labs_filled = labs.fillna(0) # fill with constant labs_mean = labs.fillna(labs.mean()) # fill with column means labs_ffill = labs.fillna(method="ffill") # forward fill (useful for time series) # More sophisticated: fill with group means # labs["hemoglobin"] = labs.groupby("department")["hemoglobin"].transform( # lambda x: x.fillna(x.mean()) # ) ``` ### Grouping and Aggregation ```{python} #| eval: false #| echo: true import pandas as pd # Sample data with multiple patients encounters = pd.DataFrame({ "mrn": ["MRN001", "MRN001", "MRN002", "MRN002", "MRN002", "MRN003"], "encounter_type": ["ED", "Inpatient", "ED", "ED", "Inpatient", "Outpatient"], "charges": [1500, 25000, 1200, 1800, 32000, 200], "los_hours": [4, 120, 3, 5, 168, 1] }) # Group by single column by_patient = encounters.groupby("mrn") print(by_patient["charges"].sum()) # total charges per patient print(by_patient["charges"].mean()) # average charge per patient print(by_patient.size()) # count encounters per patient # Multiple aggregations summary = by_patient.agg({ "charges": ["sum", "mean", "max"], "los_hours": ["sum", "mean"] }) print(summary) # Group by multiple columns by_patient_type = encounters.groupby(["mrn", "encounter_type"]).agg({ "charges": "sum", "los_hours": "mean" }) print(by_patient_type) ``` ### Merging Datasets Clinical data often comes in multiple tables that need to be joined: ```{python} #| eval: false #| echo: true import pandas as pd # Demographics table demographics = pd.DataFrame({ "mrn": ["MRN001", "MRN002", "MRN003"], "age": [45, 67, 34], "sex": ["F", "M", "F"] }) # Labs table labs = pd.DataFrame({ "mrn": ["MRN001", "MRN001", "MRN002", "MRN004"], "test": ["WBC", "Hemoglobin", "WBC", "WBC"], "value": [8.5, 12.1, 15.3, 9.2] }) # Inner join: only patients in both tables inner = pd.merge(demographics, labs, on="mrn", how="inner") print(inner) # Left join: all patients from demographics, matching labs where available left = pd.merge(demographics, labs, on="mrn", how="left") print(left) # Outer join: all patients from both tables outer = pd.merge(demographics, labs, on="mrn", how="outer") print(outer) # Join on different column names # pd.merge(df1, df2, left_on="patient_id", right_on="mrn") ``` ### Time Series Basics Clinical data often has timestamps—vital signs, medications, lab results over time: ```{python} #| eval: false #| echo: true import pandas as pd import numpy as np # Create time series data dates = pd.date_range("2024-01-01", periods=24, freq="H") vitals = pd.DataFrame({ "timestamp": dates, "heart_rate": np.random.normal(75, 10, 24).astype(int), "sbp": np.random.normal(120, 15, 24).astype(int), "temp": np.random.normal(98.6, 0.5, 24).round(1) }) # Set timestamp as index for time series operations vitals = vitals.set_index("timestamp") # Resample: aggregate to different time frequency hourly_mean = vitals.resample("1H").mean() # already hourly, so same four_hour_mean = vitals.resample("4H").mean() daily_mean = vitals.resample("D").mean() # Rolling calculations (moving averages) vitals["hr_rolling_4h"] = vitals["heart_rate"].rolling(window=4).mean() # Time-based selection morning = vitals.between_time("06:00", "12:00") first_day = vitals.loc["2024-01-01"] ``` ## Data Visualization Essentials *Clinical Context: Before building models, you need to understand your data. Visualization reveals patterns, outliers, and data quality issues that summary statistics miss.* Matplotlib is Python's foundational plotting library. Seaborn builds on it with cleaner defaults and statistical plots. Both are essential. ### Matplotlib Basics ```{python} #| eval: false #| echo: true import matplotlib.pyplot as plt import numpy as np # Simple line plot hours = np.arange(0, 24) heart_rates = np.random.normal(75, 8, 24) plt.figure(figsize=(10, 4)) plt.plot(hours, heart_rates, marker='o', linestyle='-', color='blue') plt.xlabel("Hour") plt.ylabel("Heart Rate (bpm)") plt.title("24-Hour Heart Rate Trend") plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() # Multiple lines np.random.seed(42) patient1 = np.random.normal(75, 8, 24) patient2 = np.random.normal(85, 12, 24) plt.figure(figsize=(10, 4)) plt.plot(hours, patient1, label="Patient 1", marker='o') plt.plot(hours, patient2, label="Patient 2", marker='s') plt.xlabel("Hour") plt.ylabel("Heart Rate (bpm)") plt.title("Heart Rate Comparison") plt.legend() plt.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ### Histograms and Distributions ```{python} #| eval: false #| echo: true import matplotlib.pyplot as plt import numpy as np # Simulated age distribution np.random.seed(42) ages = np.concatenate([ np.random.normal(45, 15, 500), # working age np.random.normal(75, 10, 300) # elderly ]) ages = ages[(ages > 18) & (ages < 100)] # realistic range plt.figure(figsize=(10, 4)) plt.hist(ages, bins=30, edgecolor='black', alpha=0.7) plt.xlabel("Age (years)") plt.ylabel("Count") plt.title("Patient Age Distribution") plt.axvline(np.median(ages), color='red', linestyle='--', label=f'Median: {np.median(ages):.1f}') plt.legend() plt.tight_layout() plt.show() ``` ### Scatter Plots ```{python} #| eval: false #| echo: true import matplotlib.pyplot as plt import numpy as np # Simulated relationship: age vs systolic BP np.random.seed(42) n = 200 ages = np.random.uniform(30, 80, n) sbp = 90 + 0.5 * ages + np.random.normal(0, 10, n) plt.figure(figsize=(8, 6)) plt.scatter(ages, sbp, alpha=0.6, c='blue', edgecolors='none') plt.xlabel("Age (years)") plt.ylabel("Systolic BP (mmHg)") plt.title("Age vs Systolic Blood Pressure") # Add trend line z = np.polyfit(ages, sbp, 1) p = np.poly1d(z) plt.plot(sorted(ages), p(sorted(ages)), "r--", linewidth=2, label="Trend") plt.legend() plt.tight_layout() plt.show() ``` ### Subplots ```{python} #| eval: false #| echo: true import matplotlib.pyplot as plt import numpy as np np.random.seed(42) # Create a 2x2 grid of plots fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Top left: vital signs over time hours = np.arange(24) axes[0, 0].plot(hours, np.random.normal(75, 8, 24), label="HR") axes[0, 0].set_xlabel("Hour") axes[0, 0].set_ylabel("Heart Rate") axes[0, 0].set_title("Heart Rate Trend") axes[0, 0].grid(True, alpha=0.3) # Top right: histogram of ages ages = np.random.normal(55, 18, 500) axes[0, 1].hist(ages, bins=25, edgecolor='black', alpha=0.7) axes[0, 1].set_xlabel("Age") axes[0, 1].set_ylabel("Count") axes[0, 1].set_title("Age Distribution") # Bottom left: scatter plot ages_sample = np.random.uniform(30, 80, 100) sbp = 90 + 0.5 * ages_sample + np.random.normal(0, 10, 100) axes[1, 0].scatter(ages_sample, sbp, alpha=0.6) axes[1, 0].set_xlabel("Age") axes[1, 0].set_ylabel("Systolic BP") axes[1, 0].set_title("Age vs BP") # Bottom right: bar chart diagnoses = ["Pneumonia", "CHF", "COPD", "MI", "Sepsis"] counts = [45, 38, 32, 28, 22] axes[1, 1].bar(diagnoses, counts, color='steelblue', edgecolor='black') axes[1, 1].set_xlabel("Diagnosis") axes[1, 1].set_ylabel("Count") axes[1, 1].set_title("Top Diagnoses") axes[1, 1].tick_params(axis='x', rotation=45) plt.tight_layout() plt.show() ``` ### Seaborn for Statistical Plots Seaborn provides higher-level plotting with better defaults: ```{python} #| eval: false #| echo: true import seaborn as sns import matplotlib.pyplot as plt import pandas as pd import numpy as np # Create sample clinical data np.random.seed(42) n = 300 df = pd.DataFrame({ "age": np.random.normal(55, 18, n), "sbp": np.random.normal(125, 20, n), "los_days": np.random.exponential(5, n), "sex": np.random.choice(["Male", "Female"], n), "outcome": np.random.choice(["Survived", "Died"], n, p=[0.85, 0.15]) }) # Set style sns.set_style("whitegrid") # Distribution plot plt.figure(figsize=(10, 4)) sns.histplot(data=df, x="age", hue="outcome", kde=True) plt.title("Age Distribution by Outcome") plt.show() # Box plot: compare groups plt.figure(figsize=(8, 5)) sns.boxplot(data=df, x="outcome", y="los_days") plt.title("Length of Stay by Outcome") plt.ylabel("Length of Stay (days)") plt.show() # Correlation heatmap numeric_cols = df[["age", "sbp", "los_days"]] plt.figure(figsize=(6, 5)) sns.heatmap(numeric_cols.corr(), annot=True, cmap="coolwarm", center=0) plt.title("Feature Correlations") plt.show() ``` ::: {.callout-note} ## For Clinical Readers The following PyTorch section covers deep learning implementation details. If you're reading for conceptual understanding rather than coding, you can **skip to Chapter 7** and return here when you're ready to build models. The concepts in Chapters 7-9 will make sense without this material. ::: ## PyTorch Fundamentals *Clinical Context: You're ready to build a neural network to predict hospital readmission. PyTorch provides the building blocks—tensors, automatic differentiation, and neural network modules—that make this possible.* PyTorch is a deep learning framework that combines NumPy-like tensor operations with automatic differentiation. It's the most popular framework in medical AI research. ### Tensors Tensors are PyTorch's fundamental data structure—essentially NumPy arrays that can run on GPUs and track gradients: ```{python} #| eval: false #| echo: true import torch import numpy as np # Creating tensors x = torch.tensor([1, 2, 3, 4, 5]) print(x) print(x.dtype) # torch.int64 print(x.shape) # torch.Size([5]) # From NumPy (common pattern) np_array = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32) tensor_from_np = torch.from_numpy(np_array) print(tensor_from_np) # Back to NumPy back_to_np = tensor_from_np.numpy() # Creation functions (similar to NumPy) zeros = torch.zeros(3, 4) ones = torch.ones(2, 3) rand = torch.rand(3, 3) # uniform [0, 1) randn = torch.randn(3, 3) # standard normal # Specify dtype float_tensor = torch.tensor([1, 2, 3], dtype=torch.float32) # Move to GPU (if available) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}") gpu_tensor = float_tensor.to(device) ``` ### Tensor Operations ```{python} #| eval: false #| echo: true import torch # Create sample tensors a = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32) b = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32) # Element-wise operations print(a + b) print(a * b) # element-wise multiplication print(a ** 2) # Matrix multiplication print(a @ b) # preferred syntax print(torch.mm(a, b)) # equivalent print(torch.matmul(a, b)) # also equivalent # Broadcasting (same rules as NumPy) c = torch.tensor([1, 2]) print(a + c) # broadcasts c across rows # Aggregations x = torch.randn(3, 4) print(x.sum()) print(x.mean()) print(x.mean(dim=0)) # mean across rows (result shape: 4) print(x.mean(dim=1)) # mean across columns (result shape: 3) # Reshaping print(x.view(4, 3)) # reshape to 4x3 print(x.view(-1)) # flatten (-1 infers dimension) print(x.unsqueeze(0)) # add dimension: (3, 4) -> (1, 3, 4) print(x.squeeze()) # remove dimensions of size 1 ``` ### Automatic Differentiation The key feature that makes neural network training possible: PyTorch automatically computes gradients. ```{python} #| eval: false #| echo: true import torch # Create tensor that tracks gradients x = torch.tensor([2.0, 3.0], requires_grad=True) print(x) # Perform computations y = x ** 2 z = y.sum() # scalar output print(f"z = {z}") # Compute gradients (backpropagation) z.backward() # Access gradients print(f"dz/dx = {x.grad}") # Should be [4.0, 6.0] = 2*x # More complex example: linear regression gradient # Model: y = wx + b # Loss: MSE = (y_pred - y_true)^2 # Data x_data = torch.tensor([1.0, 2.0, 3.0, 4.0]) y_true = torch.tensor([2.0, 4.0, 6.0, 8.0]) # Parameters (learnable) w = torch.tensor(1.0, requires_grad=True) b = torch.tensor(0.0, requires_grad=True) # Forward pass y_pred = w * x_data + b loss = ((y_pred - y_true) ** 2).mean() print(f"Loss: {loss.item()}") # Backward pass loss.backward() print(f"Gradient of w: {w.grad}") print(f"Gradient of b: {b.grad}") ``` ### The nn Module `torch.nn` provides building blocks for neural networks: ```{python} #| eval: false #| echo: true import torch import torch.nn as nn # Linear layer (fully connected) linear = nn.Linear(in_features=10, out_features=5) x = torch.randn(32, 10) # batch of 32, each with 10 features output = linear(x) print(output.shape) # torch.Size([32, 5]) # Common layers conv = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3) pool = nn.MaxPool2d(kernel_size=2) dropout = nn.Dropout(p=0.5) batchnorm = nn.BatchNorm1d(num_features=64) # Activation functions relu = nn.ReLU() sigmoid = nn.Sigmoid() softmax = nn.Softmax(dim=1) # Building a simple network class SimpleClassifier(nn.Module): def __init__(self, input_size, hidden_size, num_classes): super().__init__() self.layer1 = nn.Linear(input_size, hidden_size) self.relu = nn.ReLU() self.layer2 = nn.Linear(hidden_size, num_classes) def forward(self, x): x = self.layer1(x) x = self.relu(x) x = self.layer2(x) return x # Create model model = SimpleClassifier(input_size=10, hidden_size=32, num_classes=2) print(model) # Forward pass x = torch.randn(16, 10) # batch of 16 logits = model(x) print(logits.shape) # torch.Size([16, 2]) ``` ### The Training Loop The fundamental pattern for training neural networks: ```{python} #| eval: false #| echo: true import torch import torch.nn as nn import torch.optim as optim # Create synthetic data: predict if patient will be readmitted # Features: age, num_prior_admits, length_of_stay, num_diagnoses, num_medications torch.manual_seed(42) n_samples = 1000 X = torch.randn(n_samples, 5) # True relationship: risk increases with age, prior admits, diagnoses true_weights = torch.tensor([0.3, 0.5, -0.1, 0.4, 0.1]) y = (X @ true_weights + torch.randn(n_samples) * 0.5 > 0).float() # Split data train_X, test_X = X[:800], X[800:] train_y, test_y = y[:800], y[800:] # Define model model = nn.Sequential( nn.Linear(5, 32), nn.ReLU(), nn.Linear(32, 16), nn.ReLU(), nn.Linear(16, 1), nn.Sigmoid() ) # Loss function and optimizer criterion = nn.BCELoss() # Binary Cross Entropy for classification optimizer = optim.Adam(model.parameters(), lr=0.01) # Training loop n_epochs = 100 for epoch in range(n_epochs): # Forward pass predictions = model(train_X).squeeze() loss = criterion(predictions, train_y) # Backward pass optimizer.zero_grad() # clear previous gradients loss.backward() # compute gradients optimizer.step() # update parameters # Print progress if (epoch + 1) % 20 == 0: with torch.no_grad(): # disable gradient tracking for evaluation train_acc = ((predictions > 0.5) == train_y).float().mean() test_pred = model(test_X).squeeze() test_acc = ((test_pred > 0.5) == test_y).float().mean() print(f"Epoch {epoch+1}/{n_epochs}, Loss: {loss.item():.4f}, " f"Train Acc: {train_acc:.3f}, Test Acc: {test_acc:.3f}") ``` ### Moving Between NumPy and PyTorch You'll frequently convert between NumPy (for data manipulation) and PyTorch (for modeling): ```{python} #| eval: false #| echo: true import numpy as np import torch # NumPy to PyTorch np_array = np.array([1.0, 2.0, 3.0]) torch_tensor = torch.from_numpy(np_array) # shares memory! torch_tensor_copy = torch.tensor(np_array) # creates copy # PyTorch to NumPy tensor = torch.tensor([1.0, 2.0, 3.0]) np_from_tensor = tensor.numpy() # shares memory if on CPU # If tensor requires grad or is on GPU tensor_grad = torch.tensor([1.0, 2.0], requires_grad=True) np_from_grad = tensor_grad.detach().numpy() # detach from computation graph # If on GPU # gpu_tensor = tensor.cuda() # np_from_gpu = gpu_tensor.cpu().numpy() # move to CPU first ``` ## Putting It Together: A Clinical Data Pipeline *Clinical Context: You have CSV files with patient demographics, vital signs, and outcomes. You need to load, clean, merge, and prepare this data for a readmission prediction model.* Let's build a complete pipeline from raw data to model-ready tensors: ```{python} #| eval: false #| echo: true import pandas as pd import numpy as np import torch from torch.utils.data import Dataset, DataLoader from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # === Step 1: Create sample clinical data === np.random.seed(42) n_patients = 1000 # Demographics demographics = pd.DataFrame({ "patient_id": range(n_patients), "age": np.random.normal(55, 18, n_patients).clip(18, 95), "sex": np.random.choice(["M", "F"], n_patients), "insurance": np.random.choice(["Medicare", "Medicaid", "Private", "Self-pay"], n_patients, p=[0.4, 0.2, 0.35, 0.05]) }) # Encounter data encounters = pd.DataFrame({ "patient_id": range(n_patients), "length_of_stay": np.random.exponential(4, n_patients).clip(1, 30), "num_diagnoses": np.random.poisson(3, n_patients) + 1, "num_procedures": np.random.poisson(1.5, n_patients), "icu_stay": np.random.choice([0, 1], n_patients, p=[0.85, 0.15]) }) # Outcome (30-day readmission) # Higher risk for: older, longer stays, more diagnoses, ICU risk_score = ( 0.02 * demographics["age"] + 0.1 * encounters["length_of_stay"] + 0.15 * encounters["num_diagnoses"] + 0.3 * encounters["icu_stay"] ) readmit_prob = 1 / (1 + np.exp(-risk_score + 3)) outcomes = pd.DataFrame({ "patient_id": range(n_patients), "readmitted_30d": (np.random.random(n_patients) < readmit_prob).astype(int) }) print(f"Readmission rate: {outcomes['readmitted_30d'].mean():.1%}") # === Step 2: Merge and clean === df = demographics.merge(encounters, on="patient_id").merge(outcomes, on="patient_id") print(f"Dataset shape: {df.shape}") print(df.head()) # Handle categorical variables df["sex_encoded"] = (df["sex"] == "M").astype(int) df = pd.get_dummies(df, columns=["insurance"], prefix="ins") # Check for missing values print(f"\nMissing values:\n{df.isnull().sum()}") # === Step 3: Feature selection and preparation === feature_columns = [ "age", "sex_encoded", "length_of_stay", "num_diagnoses", "num_procedures", "icu_stay", "ins_Medicaid", "ins_Medicare", "ins_Private", "ins_Self-pay" ] target_column = "readmitted_30d" X = df[feature_columns].values y = df[target_column].values print(f"\nFeature matrix shape: {X.shape}") print(f"Target vector shape: {y.shape}") # === Step 4: Train/test split === X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) print(f"\nTrain set: {X_train.shape[0]} patients") print(f"Test set: {X_test.shape[0]} patients") print(f"Train readmission rate: {y_train.mean():.1%}") print(f"Test readmission rate: {y_test.mean():.1%}") # === Step 5: Normalize features === scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # fit on train only X_test_scaled = scaler.transform(X_test) # apply same transformation # === Step 6: Convert to PyTorch === X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32) X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32) y_train_tensor = torch.tensor(y_train, dtype=torch.float32) y_test_tensor = torch.tensor(y_test, dtype=torch.float32) print(f"\nPyTorch tensors ready:") print(f"X_train: {X_train_tensor.shape}, dtype: {X_train_tensor.dtype}") print(f"y_train: {y_train_tensor.shape}, dtype: {y_train_tensor.dtype}") # === Step 7: Create DataLoader for batching === class ClinicalDataset(Dataset): def __init__(self, features, labels): self.features = features self.labels = labels def __len__(self): return len(self.labels) def __getitem__(self, idx): return self.features[idx], self.labels[idx] train_dataset = ClinicalDataset(X_train_tensor, y_train_tensor) test_dataset = ClinicalDataset(X_test_tensor, y_test_tensor) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False) # Example: iterate through one batch for batch_X, batch_y in train_loader: print(f"\nBatch shapes:") print(f"Features: {batch_X.shape}") print(f"Labels: {batch_y.shape}") break print("\n✓ Data pipeline complete! Ready for model training.") ``` This pipeline demonstrates the complete workflow: 1. **Load data** from multiple sources 2. **Merge** related tables 3. **Clean** and handle missing values 4. **Encode** categorical variables 5. **Split** into training and test sets 6. **Normalize** features 7. **Convert** to PyTorch tensors 8. **Batch** using DataLoaders This pattern—with variations for specific data types—underlies virtually all medical AI work. Master it, and you're ready for the chapters ahead.