5 Exploratory Data Analysis for Clinical AI

Clinical Context: A team downloads a public EHR dataset and immediately trains a sepsis prediction model. The model achieves 0.95 AUC. They’re excited—until deployment, where performance drops to 0.68. The problem wasn’t the model. The problem was the data: 30% of “sepsis” labels were assigned retrospectively based on death certificates, timestamps were corrupted for weekend admissions, and lab values had impossible outliers from data entry errors. No amount of sophisticated modeling can fix garbage data. This chapter teaches you to understand your data before you model it.

Machine learning textbooks often jump from “here’s your dataset” to “here’s how to train a model.” But in clinical AI, the data is the hard part. Electronic health records were designed for billing and documentation, not machine learning. Research datasets have their own quirks. Understanding your data—its structure, its limitations, its biases—is the most important step in any clinical AI project.

This chapter covers exploratory data analysis (EDA) for clinical applications: systematic workflows for understanding data, handling missing values, identifying quality issues, and engineering features that capture clinical meaning.

5.1 The EDA Mindset

5.1.1 Why EDA Matters More in Healthcare

In many ML applications, data is relatively clean: images are images, text is text. Clinical data is different:

EHR data is a byproduct: It captures what was documented for patient care, not what’s ideal for research
Missingness is informative: A missing lab value often means “not suspected”
Timestamps lie: Orders backdated, results posted late, timezones inconsistent
Labels are imperfect: Diagnoses are billing codes, not ground truth

The consequence: models learn from data artifacts as much as clinical signal. A model might learn that patients with missing potassium values are healthier (because potassium wasn’t checked) rather than learning anything about potassium itself.

5.1.2 The EDA Workflow

Before any modeling, systematically answer:

What do I have? — Rows, columns, data types, memory footprint
What’s missing? — Patterns, mechanisms, implications
What does it look like? — Distributions, outliers, relationships
What’s wrong? — Impossible values, temporal inconsistencies, coding errors
What do I need to create? — Feature engineering, transformations

This isn’t a one-time step. Return to EDA whenever model performance surprises you.

5.2 Systematic Data Profiling

Clinical Context: You’ve received a dataset of 50,000 ICU admissions with 200 variables. Where do you start?

5.2.1 First Look: Structure and Size

import pandas as pd
import numpy as np

# Load data
df = pd.read_csv("icu_admissions.csv")

# Basic structure
print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"\nColumn types:")
print(df.dtypes.value_counts())

# First few rows (check for obvious issues)
df.head()

5.2.2 Summary Statistics by Data Type

Different variable types need different summaries:

def profile_dataframe(df):
    """Generate comprehensive data profile."""

    profile = []

    for col in df.columns:
        stats = {
            'column': col,
            'dtype': str(df[col].dtype),
            'n_missing': df[col].isna().sum(),
            'pct_missing': df[col].isna().mean() * 100,
            'n_unique': df[col].nunique()
        }

        if df[col].dtype in ['int64', 'float64']:
            # Numeric summaries
            stats.update({
                'mean': df[col].mean(),
                'std': df[col].std(),
                'min': df[col].min(),
                'p25': df[col].quantile(0.25),
                'median': df[col].median(),
                'p75': df[col].quantile(0.75),
                'max': df[col].max()
            })
        elif df[col].dtype == 'object':
            # Categorical summaries
            top_val = df[col].value_counts().index[0] if len(df[col].dropna()) > 0 else None
            stats.update({
                'top_value': top_val,
                'top_freq': df[col].value_counts().iloc[0] if top_val else 0
            })

        profile.append(stats)

    return pd.DataFrame(profile)

# Generate profile
profile = profile_dataframe(df)
print(profile.to_string())

5.2.3 Red Flags to Check Immediately

def check_red_flags(df):
    """Check for common data quality issues."""

    issues = []

    # Duplicate rows
    n_dupes = df.duplicated().sum()
    if n_dupes > 0:
        issues.append(f"⚠️ {n_dupes:,} duplicate rows ({n_dupes/len(df)*100:.1f}%)")

    # Columns with single value
    for col in df.columns:
        if df[col].nunique() == 1:
            issues.append(f"⚠️ Column '{col}' has only one unique value")

    # High cardinality categoricals (might be IDs)
    for col in df.select_dtypes(include='object').columns:
        if df[col].nunique() > len(df) * 0.5:
            issues.append(f"⚠️ Column '{col}' has {df[col].nunique():,} unique values (possible ID?)")

    # Columns that are mostly missing
    for col in df.columns:
        pct_missing = df[col].isna().mean()
        if pct_missing > 0.9:
            issues.append(f"⚠️ Column '{col}' is {pct_missing*100:.1f}% missing")

    # Numeric columns with suspicious ranges
    for col in df.select_dtypes(include=['int64', 'float64']).columns:
        if df[col].min() < -1e10 or df[col].max() > 1e10:
            issues.append(f"⚠️ Column '{col}' has extreme values (min={df[col].min():.2e}, max={df[col].max():.2e})")

    return issues

issues = check_red_flags(df)
for issue in issues:
    print(issue)

5.3 Missing Data in Clinical Settings

Clinical Context: Your sepsis dataset has 40% missing lactate values. Should you drop those patients? Impute the values? The answer depends on WHY the data is missing—and in clinical data, missingness is rarely random.

5.3.1 The Missing Data Taxonomy

Statistical theory distinguishes three mechanisms:

MCAR (Missing Completely at Random): Missingness is independent of both observed and unobserved data.

Clinical example: A lab machine randomly malfunctions, losing 5% of results regardless of patient characteristics.

Reality check: True MCAR is rare in clinical data.

MAR (Missing at Random): Missingness depends on observed data but not on the missing values themselves.

Clinical example: Sicker patients (observable via vital signs) get more labs. Missingness depends on severity, but given severity, the actual lab value doesn’t affect whether it’s measured.

Reality check: Often assumed but hard to verify.

MNAR (Missing Not at Random): Missingness depends on the unobserved values themselves.

Clinical example: HbA1c is only ordered when diabetes is suspected. Patients without diabetes suspicion don’t get tested—their (likely normal) HbA1c values are missing because they’re normal. The missing values are systematically different from observed values.

Reality check: This is the default assumption for clinical data.

5.3.2 Informative Missingness in EHR Data

In clinical practice, the decision to measure something is itself clinical information:

Variable	Why It Might Be Missing	What Missingness Signals
Lactate	Not suspected of sepsis	Lower acuity
Troponin	No chest pain	Lower cardiac risk
Blood culture	No infection concern	Lower infection probability
PT/INR	Not on anticoagulation	Different medication profile
HIV test	Not indicated	Different risk profile

The missingness pattern IS a feature. A model that imputes lactate to the population mean loses this signal.

5.3.3 Visualizing Missing Data Patterns

import matplotlib.pyplot as plt
import seaborn as sns

def plot_missing_matrix(df, max_cols=30):
    """Visualize missing data patterns."""

    # Select columns with some missingness
    missing_cols = df.columns[df.isna().any()].tolist()[:max_cols]

    if not missing_cols:
        print("No missing data!")
        return

    # Create missing indicator matrix
    missing_matrix = df[missing_cols].isna().astype(int)

    # Sort by total missingness
    col_order = missing_matrix.sum().sort_values(ascending=False).index
    missing_matrix = missing_matrix[col_order]

    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # Bar plot of missingness by column
    missing_pct = missing_matrix.mean() * 100
    axes[0].barh(range(len(missing_pct)), missing_pct.values)
    axes[0].set_yticks(range(len(missing_pct)))
    axes[0].set_yticklabels(missing_pct.index)
    axes[0].set_xlabel('Percent Missing')
    axes[0].set_title('Missing Data by Column')

    # Heatmap of co-missingness
    co_missing = missing_matrix.T.corr()
    sns.heatmap(co_missing, ax=axes[1], cmap='RdBu_r', center=0,
                xticklabels=True, yticklabels=True)
    axes[1].set_title('Co-Missingness Correlation')

    plt.tight_layout()
    plt.show()

plot_missing_matrix(df)

5.3.4 Strategies for Missing Clinical Data

1. Missingness Indicators (Recommended for clinical ML)

Create binary features indicating whether each variable was observed:

def add_missing_indicators(df, columns):
    """Add binary missing indicators for specified columns."""
    df_out = df.copy()

    for col in columns:
        df_out[f'{col}_missing'] = df[col].isna().astype(int)

    return df_out

# Add indicators for key labs
lab_cols = ['lactate', 'troponin', 'wbc', 'creatinine']
df = add_missing_indicators(df, lab_cols)

2. Simple Imputation with Missing Indicators

Impute to allow models to use observed values, but keep indicators:

from sklearn.impute import SimpleImputer

def impute_with_indicators(df, columns, strategy='median'):
    """Impute missing values and add indicators."""
    df_out = df.copy()

    for col in columns:
        # Add indicator first
        df_out[f'{col}_missing'] = df[col].isna().astype(int)

        # Then impute
        imputer = SimpleImputer(strategy=strategy)
        df_out[col] = imputer.fit_transform(df_out[[col]])

    return df_out

3. Imputation Within Clinical Subgroups

If you must impute, do it within clinically meaningful groups:

def impute_by_group(df, col, group_col, strategy='median'):
    """Impute missing values within groups."""
    df_out = df.copy()

    # Group-specific imputation
    for group in df[group_col].unique():
        mask = df[group_col] == group
        group_median = df.loc[mask, col].median()
        df_out.loc[mask & df[col].isna(), col] = group_median

    # Fill remaining with global median
    global_median = df[col].median()
    df_out[col] = df_out[col].fillna(global_median)

    return df_out

# Impute creatinine within age groups
df = impute_by_group(df, 'creatinine', 'age_group')

4. Multiple Imputation (For Inference)

For statistical inference (not prediction), multiple imputation properly accounts for uncertainty:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Multiple imputation
n_imputations = 5
imputed_datasets = []

for i in range(n_imputations):
    imputer = IterativeImputer(random_state=i, max_iter=10)
    imputed = imputer.fit_transform(df[numeric_cols])
    imputed_datasets.append(pd.DataFrame(imputed, columns=numeric_cols))

# Analysis should be run on each imputed dataset, then pooled

5.3.5 What NOT to Do

Don’t drop patients with missing data (unless MCAR, which is rare)
Don’t impute without indicators (loses informative missingness signal)
Don’t assume MAR without justification (MNAR is common in clinical data)
Don’t impute the outcome variable (that’s the thing you’re trying to predict)

5.4 Visualization Techniques for Clinical EDA

Clinical Context: A table tells you 15% of creatinine values are missing. A visualization shows you they’re missing specifically for patients admitted on weekends, pointing to a staffing or workflow issue. Effective visualization reveals patterns that summary statistics hide.

5.4.1 Distribution Plots

Understanding variable distributions is essential before modeling. Clinical variables often have non-normal distributions with meaningful patterns.

Histograms and Density Plots

import matplotlib.pyplot as plt
import seaborn as sns

def plot_distributions(df, columns, ncols=3, figsize=(15, 4)):
    """Plot distributions for multiple numeric columns."""
    nrows = (len(columns) + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(figsize[0], figsize[1] * nrows))
    axes = axes.flatten() if nrows > 1 else [axes] if ncols == 1 else axes

    for i, col in enumerate(columns):
        ax = axes[i]
        data = df[col].dropna()

        # Histogram with KDE overlay
        sns.histplot(data, kde=True, ax=ax, color='steelblue', alpha=0.7)
        ax.axvline(data.mean(), color='red', linestyle='--', label=f'Mean: {data.mean():.1f}')
        ax.axvline(data.median(), color='green', linestyle='--', label=f'Median: {data.median():.1f}')
        ax.set_title(f'{col}\n(n={len(data):,}, missing={df[col].isna().sum():,})')
        ax.legend(fontsize=8)

    # Hide unused subplots
    for j in range(i + 1, len(axes)):
        axes[j].set_visible(False)

    plt.tight_layout()
    plt.show()

# Plot key clinical variables
vital_cols = ['heart_rate', 'systolic_bp', 'temperature_c', 'respiratory_rate']
plot_distributions(df, vital_cols)

Identifying Distribution Shapes

Clinical variables often show characteristic patterns:

Bimodal: May indicate two patient populations (e.g., temperature with fever vs. normal)
Right-skewed: Common for lab values, lengths of stay, costs
Truncated: May indicate data collection limits or physiological bounds
Spike at specific value: Often indicates default or imputed values

def detect_distribution_anomalies(series, name="Variable"):
    """Flag potential distribution issues."""
    data = series.dropna()
    issues = []

    # Check for extreme skewness
    skewness = data.skew()
    if abs(skewness) > 2:
        direction = "right" if skewness > 0 else "left"
        issues.append(f"Highly {direction}-skewed (skew={skewness:.2f})")

    # Check for spikes (possible default values)
    value_counts = data.value_counts(normalize=True)
    if value_counts.iloc[0] > 0.2:
        issues.append(f"Mode appears in {value_counts.iloc[0]*100:.1f}% of values (possible default?)")

    # Check for outlier percentage (IQR method)
    Q1, Q3 = data.quantile([0.25, 0.75])
    IQR = Q3 - Q1
    outlier_mask = (data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)
    outlier_pct = outlier_mask.mean() * 100
    if outlier_pct > 5:
        issues.append(f"{outlier_pct:.1f}% outliers by IQR method")

    if issues:
        print(f"\n{name}:")
        for issue in issues:
            print(f"  ⚠️ {issue}")

5.4.2 Comparing Groups Visually

Comparing distributions across patient groups (outcome, demographics, time periods) reveals heterogeneity and potential bias.

Box Plots by Group

def plot_by_group(df, value_col, group_col, figsize=(10, 5)):
    """Compare distributions across groups."""
    fig, axes = plt.subplots(1, 2, figsize=figsize)

    # Box plot
    sns.boxplot(data=df, x=group_col, y=value_col, ax=axes[0])
    axes[0].set_title(f'{value_col} by {group_col}')

    # Overlapping density plots
    for group in df[group_col].unique():
        group_data = df[df[group_col] == group][value_col].dropna()
        sns.kdeplot(group_data, ax=axes[1], label=f'{group} (n={len(group_data):,})')
    axes[1].set_title(f'{value_col} distribution by {group_col}')
    axes[1].legend()

    plt.tight_layout()
    plt.show()

# Compare lactate between survivors and non-survivors
plot_by_group(df, 'lactate', 'mortality')

Violin Plots for Richer Comparison

def plot_violin_comparison(df, value_cols, group_col, figsize=(12, 4)):
    """Violin plots showing full distribution shape by group."""
    fig, axes = plt.subplots(1, len(value_cols), figsize=figsize)
    if len(value_cols) == 1:
        axes = [axes]

    for ax, col in zip(axes, value_cols):
        sns.violinplot(data=df, x=group_col, y=col, ax=ax, inner='quartile')
        ax.set_title(col)

    plt.tight_layout()
    plt.show()

# Compare multiple labs between outcome groups
plot_violin_comparison(df, ['creatinine', 'lactate', 'bilirubin'], 'mortality')

5.4.3 Correlation Heatmaps

Understanding variable relationships helps identify redundancy and potential multicollinearity.

def plot_correlation_heatmap(df, columns=None, method='pearson', figsize=(10, 8)):
    """Create correlation heatmap with hierarchical clustering."""
    if columns is None:
        columns = df.select_dtypes(include=[np.number]).columns.tolist()

    # Calculate correlation matrix
    corr = df[columns].corr(method=method)

    # Create clustered heatmap
    g = sns.clustermap(corr, cmap='RdBu_r', center=0,
                       figsize=figsize, annot=True, fmt='.2f',
                       linewidths=0.5, vmin=-1, vmax=1)
    g.fig.suptitle(f'{method.title()} Correlation Matrix (Clustered)', y=1.02)
    plt.show()

    return corr

# Correlation among lab values
lab_cols = ['wbc', 'hemoglobin', 'platelets', 'creatinine', 'bilirubin', 'lactate']
corr_matrix = plot_correlation_heatmap(df, lab_cols)

Identifying Highly Correlated Features

def find_high_correlations(corr_matrix, threshold=0.7):
    """Find pairs of features with correlation above threshold."""
    high_corr = []

    for i in range(len(corr_matrix.columns)):
        for j in range(i + 1, len(corr_matrix.columns)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                high_corr.append({
                    'feature_1': corr_matrix.columns[i],
                    'feature_2': corr_matrix.columns[j],
                    'correlation': corr_matrix.iloc[i, j]
                })

    return pd.DataFrame(high_corr).sort_values('correlation', ascending=False)

# Find highly correlated pairs
high_corr_pairs = find_high_correlations(corr_matrix, threshold=0.7)
print("Highly correlated feature pairs:")
print(high_corr_pairs)

5.4.4 Scatter Plots for Clinical Relationships

Scatter plots reveal non-linear relationships and subgroup patterns that correlation coefficients miss.

def plot_clinical_scatter(df, x_col, y_col, hue_col=None, figsize=(8, 6)):
    """Scatter plot with regression line and optional grouping."""
    fig, ax = plt.subplots(figsize=figsize)

    if hue_col:
        sns.scatterplot(data=df, x=x_col, y=y_col, hue=hue_col,
                       alpha=0.6, ax=ax)
    else:
        sns.scatterplot(data=df, x=x_col, y=y_col, alpha=0.6, ax=ax)
        # Add regression line
        sns.regplot(data=df, x=x_col, y=y_col, scatter=False,
                   ax=ax, color='red', line_kws={'linestyle': '--'})

    ax.set_title(f'{y_col} vs {x_col}')
    plt.tight_layout()
    plt.show()

# Creatinine vs BUN (should correlate in renal function)
plot_clinical_scatter(df, 'bun', 'creatinine', hue_col='aki_stage')

Pair Plots for Multiple Variables

def plot_pairs(df, columns, hue_col=None, sample_n=1000):
    """Pair plot for exploring multiple variable relationships."""
    # Sample for performance if dataset is large
    if len(df) > sample_n:
        df_sample = df.sample(n=sample_n, random_state=42)
    else:
        df_sample = df

    g = sns.pairplot(df_sample[columns + ([hue_col] if hue_col else [])],
                     hue=hue_col, diag_kind='kde',
                     plot_kws={'alpha': 0.5, 's': 20})
    g.fig.suptitle('Variable Relationships', y=1.02)
    plt.show()

# Explore relationships among vital signs by outcome
vital_cols = ['heart_rate', 'systolic_bp', 'respiratory_rate', 'oxygen_saturation']
plot_pairs(df, vital_cols, hue_col='mortality')

5.4.5 Temporal Visualizations

Clinical data has a time dimension. Visualizing temporal patterns reveals trends, seasonality, and data artifacts.

Outcome Rates Over Time

def plot_outcome_over_time(df, date_col, outcome_col, freq='M', figsize=(12, 5)):
    """Plot outcome rate and volume over time."""
    df_copy = df.copy()
    df_copy['period'] = pd.to_datetime(df_copy[date_col]).dt.to_period(freq)

    trends = df_copy.groupby('period').agg({
        outcome_col: ['mean', 'sum', 'count']
    })
    trends.columns = ['rate', 'events', 'total']
    trends = trends.reset_index()
    trends['period_str'] = trends['period'].astype(str)

    fig, ax1 = plt.subplots(figsize=figsize)

    # Outcome rate (line)
    color1 = 'steelblue'
    ax1.plot(trends['period_str'], trends['rate'], 'o-', color=color1, linewidth=2)
    ax1.fill_between(trends['period_str'], trends['rate'], alpha=0.2, color=color1)
    ax1.set_xlabel('Time Period')
    ax1.set_ylabel('Outcome Rate', color=color1)
    ax1.tick_params(axis='y', labelcolor=color1)
    ax1.tick_params(axis='x', rotation=45)

    # Volume (bars)
    ax2 = ax1.twinx()
    ax2.bar(trends['period_str'], trends['total'], alpha=0.3, color='gray', label='Volume')
    ax2.set_ylabel('Patient Volume', color='gray')

    plt.title(f'{outcome_col} Rate Over Time')
    plt.tight_layout()
    plt.show()

    # Flag suspicious jumps
    rate_changes = trends['rate'].pct_change().abs()
    if rate_changes.max() > 0.5:
        print(f"⚠️ Large rate change detected ({rate_changes.max()*100:.0f}%) - investigate for data artifacts")

plot_outcome_over_time(df, 'admission_date', 'sepsis', freq='Q')

Feature Distributions Over Time

def plot_feature_drift(df, date_col, feature_col, freq='Q', figsize=(12, 5)):
    """Visualize how a feature's distribution changes over time."""
    df_copy = df.copy()
    df_copy['period'] = pd.to_datetime(df_copy[date_col]).dt.to_period(freq)

    periods = df_copy['period'].unique()

    fig, axes = plt.subplots(1, 2, figsize=figsize)

    # Box plots over time
    df_copy['period_str'] = df_copy['period'].astype(str)
    sns.boxplot(data=df_copy, x='period_str', y=feature_col, ax=axes[0])
    axes[0].tick_params(axis='x', rotation=45)
    axes[0].set_title(f'{feature_col} Distribution Over Time')

    # Summary statistics
    summary = df_copy.groupby('period')[feature_col].agg(['mean', 'median', 'std'])
    summary = summary.reset_index()
    summary['period_str'] = summary['period'].astype(str)

    axes[1].plot(summary['period_str'], summary['mean'], 'o-', label='Mean')
    axes[1].plot(summary['period_str'], summary['median'], 's--', label='Median')
    axes[1].fill_between(summary['period_str'],
                        summary['mean'] - summary['std'],
                        summary['mean'] + summary['std'],
                        alpha=0.2)
    axes[1].tick_params(axis='x', rotation=45)
    axes[1].set_title(f'{feature_col} Summary Statistics')
    axes[1].legend()

    plt.tight_layout()
    plt.show()

# Check if lab values drift over time
plot_feature_drift(df, 'admission_date', 'creatinine', freq='Q')

5.4.6 Visualization Best Practices for Clinical Data

1. Always show sample sizes: A difference between groups is meaningless without knowing n.

2. Use clinically meaningful scales: Plot lab values on their clinically relevant range, not auto-scaled.

def plot_with_clinical_range(df, col, normal_range, figsize=(8, 5)):
    """Plot distribution with clinical reference range."""
    fig, ax = plt.subplots(figsize=figsize)

    data = df[col].dropna()
    sns.histplot(data, kde=True, ax=ax, alpha=0.7)

    # Add normal range
    ax.axvline(normal_range[0], color='green', linestyle='--', label='Normal range')
    ax.axvline(normal_range[1], color='green', linestyle='--')
    ax.axvspan(normal_range[0], normal_range[1], alpha=0.1, color='green')

    # Calculate % outside normal
    below = (data < normal_range[0]).mean() * 100
    above = (data > normal_range[1]).mean() * 100
    ax.set_title(f'{col}\n{below:.1f}% below, {above:.1f}% above normal range')
    ax.legend()

    plt.tight_layout()
    plt.show()

# Plot creatinine with normal range
plot_with_clinical_range(df, 'creatinine', normal_range=(0.7, 1.3))

3. Consider colorblind-friendly palettes: 8% of males are colorblind. Avoid red-green combinations alone.

4. Save plots for documentation: Every EDA plot should be saved for your analysis report.

def save_eda_plot(fig, name, output_dir='eda_figures'):
    """Save plot with consistent naming and quality."""
    import os
    os.makedirs(output_dir, exist_ok=True)
    filepath = os.path.join(output_dir, f'{name}.png')
    fig.savefig(filepath, dpi=150, bbox_inches='tight')
    print(f"Saved: {filepath}")

5.5 Clinical Data Quality Issues

Clinical Context: Your mortality prediction model has 99% AUC. Suspicious. Investigation reveals that “discharge disposition = expired” was included as a feature—information that’s only available after the patient dies. This is data leakage. It’s obvious in hindsight, but these issues are pervasive in clinical datasets.

5.5.1 Temporal Issues

Problem 1: Future Information Leakage

Variables that aren’t available at prediction time but appear in the dataset:

def check_temporal_leakage(df, outcome_time_col, feature_time_cols):
    """Check for features that occur after the outcome."""
    issues = []

    for feat_col in feature_time_cols:
        # Find cases where feature timestamp is after outcome
        n_leaks = (df[feat_col] > df[outcome_time_col]).sum()
        if n_leaks > 0:
            pct = n_leaks / len(df) * 100
            issues.append(f"⚠️ {feat_col}: {n_leaks:,} rows ({pct:.1f}%) have feature AFTER outcome")

    return issues

# Check feature timestamps against outcome time
outcome_time = 'sepsis_onset_time'
feature_times = ['last_vitals_time', 'last_lab_time', 'last_med_time']
issues = check_temporal_leakage(df, outcome_time, feature_times)

Problem 2: Backdated Documentation

Clinical notes and orders are sometimes documented hours after the clinical event:

Admission orders entered after initial workup complete
Notes dictated at shift end, timestamped then
Verbal orders documented retroactively

Mitigation: Use the earlier of documentation time and clinical event time, or add a safety buffer (e.g., only use data from 4+ hours before prediction time).

Problem 3: Timestamp Inconsistencies

def check_timestamp_quality(df, time_col):
    """Check timestamp column for common issues."""
    issues = []

    # Parse timestamps
    times = pd.to_datetime(df[time_col], errors='coerce')

    # Check for unparseable timestamps
    n_invalid = times.isna().sum() - df[time_col].isna().sum()
    if n_invalid > 0:
        issues.append(f"⚠️ {n_invalid} timestamps couldn't be parsed")

    # Check for suspicious patterns
    hours = times.dt.hour

    # Excess midnight timestamps (often default values)
    midnight_pct = (hours == 0).mean() * 100
    if midnight_pct > 20:
        issues.append(f"⚠️ {midnight_pct:.1f}% of timestamps are midnight (default value?)")

    # Check date range
    min_date, max_date = times.min(), times.max()
    issues.append(f"Date range: {min_date} to {max_date}")

    # Future dates
    n_future = (times > pd.Timestamp.now()).sum()
    if n_future > 0:
        issues.append(f"⚠️ {n_future} timestamps are in the future")

    return issues

5.5.2 Impossible Values

Clinical variables have physiologically plausible ranges. Values outside these ranges indicate data quality issues:

# Physiologically plausible ranges for common clinical variables
CLINICAL_RANGES = {
    'heart_rate': (20, 300),          # bpm
    'systolic_bp': (40, 300),         # mmHg
    'diastolic_bp': (20, 200),        # mmHg
    'temperature_c': (25, 45),        # Celsius
    'respiratory_rate': (4, 60),      # breaths/min
    'oxygen_saturation': (50, 100),   # percent
    'sodium': (100, 180),             # mEq/L
    'potassium': (1.5, 9.0),          # mEq/L
    'creatinine': (0.1, 30),          # mg/dL
    'glucose': (20, 1500),            # mg/dL
    'hemoglobin': (2, 25),            # g/dL
    'wbc': (0.1, 200),                # 10^9/L
    'platelets': (5, 1500),           # 10^9/L
    'age': (0, 120),                  # years
    'weight_kg': (1, 400),            # kg
    'height_cm': (40, 250),           # cm
}

def check_clinical_ranges(df, ranges=CLINICAL_RANGES):
    """Flag values outside physiologically plausible ranges."""
    issues = []

    for col, (min_val, max_val) in ranges.items():
        if col in df.columns:
            below = (df[col] < min_val).sum()
            above = (df[col] > max_val).sum()

            if below > 0 or above > 0:
                issues.append({
                    'column': col,
                    'valid_range': f"[{min_val}, {max_val}]",
                    'below_range': below,
                    'above_range': above,
                    'pct_invalid': (below + above) / len(df) * 100
                })

    return pd.DataFrame(issues)

# Check ranges
range_issues = check_clinical_ranges(df)
print(range_issues)

5.5.3 Coding Artifacts

ICD Code Changes

Diagnosis codes change over time (ICD-9 to ICD-10 transition in 2015). Models trained on one coding system may fail on another.

def detect_icd_version(codes):
    """Detect whether codes appear to be ICD-9 or ICD-10."""
    # ICD-10 codes typically start with a letter
    icd10_pattern = codes.str.match(r'^[A-Z]', na=False)
    pct_icd10 = icd10_pattern.mean() * 100

    if pct_icd10 > 90:
        return "ICD-10"
    elif pct_icd10 < 10:
        return "ICD-9"
    else:
        return f"Mixed ({pct_icd10:.0f}% ICD-10)"

# Check diagnosis code format
print(f"Primary diagnosis format: {detect_icd_version(df['primary_dx'])}")

Documentation Practice Changes

Changes in incentives, guidelines, or EHR systems alter what gets documented:

Sepsis coding increased dramatically after SEP-1 quality measure introduction
Problem list completeness varies by EHR implementation
Note templates change over time

Mitigation: Plot outcome rates and feature distributions over time. Sudden changes suggest artifacts, not clinical reality.

def plot_temporal_trends(df, date_col, outcome_col, freq='M'):
    """Plot outcome rate over time to detect artifacts."""
    df_copy = df.copy()
    df_copy['period'] = pd.to_datetime(df_copy[date_col]).dt.to_period(freq)

    trends = df_copy.groupby('period')[outcome_col].agg(['mean', 'count'])
    trends.columns = ['rate', 'n']

    fig, ax1 = plt.subplots(figsize=(12, 5))

    ax1.plot(trends.index.astype(str), trends['rate'], 'b-', linewidth=2)
    ax1.set_xlabel('Time Period')
    ax1.set_ylabel('Outcome Rate', color='blue')
    ax1.tick_params(axis='y', labelcolor='blue')
    ax1.tick_params(axis='x', rotation=45)

    ax2 = ax1.twinx()
    ax2.bar(trends.index.astype(str), trends['n'], alpha=0.3, color='gray')
    ax2.set_ylabel('Sample Size', color='gray')

    plt.title('Outcome Rate and Sample Size Over Time')
    plt.tight_layout()
    plt.show()

5.5.4 Label Quality

The outcome variable is data too. Common label problems:

Retrospective Labeling: Outcomes assigned using information not available at the time (death certificates, follow-up visits).

Billing-Based Labels: ICD codes reflect billing incentives, not clinical reality. “Sepsis” coding increased 30% after financial penalties for sepsis mortality were introduced.

Annotation Disagreement: When labels come from human annotators, inter-rater reliability matters.

def check_label_quality(df, label_col):
    """Basic label quality checks."""
    issues = []

    # Class distribution
    dist = df[label_col].value_counts(normalize=True)
    issues.append(f"Label distribution:\n{dist}")

    # Missing labels
    n_missing = df[label_col].isna().sum()
    if n_missing > 0:
        issues.append(f"⚠️ {n_missing} missing labels ({n_missing/len(df)*100:.1f}%)")

    # Unexpected values
    expected = {0, 1} if df[label_col].dtype in ['int64', 'float64'] else None
    if expected:
        unexpected = set(df[label_col].dropna().unique()) - expected
        if unexpected:
            issues.append(f"⚠️ Unexpected label values: {unexpected}")

    return issues

5.6 Feature Engineering from Clinical Variables

Clinical Context: Raw lab values tell part of the story. But a creatinine of 1.5 means something different if the patient’s baseline is 1.4 versus 0.8. Clinical feature engineering captures domain knowledge that raw variables miss.

5.6.1 Temporal Features

Change from Baseline

def calculate_delta_from_baseline(df, value_col, patient_col, time_col,
                                   baseline_window_hours=24):
    """Calculate change from baseline for each measurement."""
    df = df.sort_values([patient_col, time_col])

    results = []
    for patient_id, group in df.groupby(patient_col):
        baseline = group[value_col].iloc[:baseline_window_hours].median()

        group_result = group.copy()
        group_result[f'{value_col}_delta'] = group[value_col] - baseline
        group_result[f'{value_col}_pct_change'] = (group[value_col] - baseline) / baseline * 100
        results.append(group_result)

    return pd.concat(results)

Trends and Slopes

from scipy import stats

def calculate_trend(df, value_col, patient_col, time_col, window_hours=6):
    """Calculate trend (slope) of values over recent window."""
    df = df.sort_values([patient_col, time_col])

    def get_slope(group):
        if len(group) < 2:
            return np.nan

        # Convert time to hours
        times = (group[time_col] - group[time_col].min()).dt.total_seconds() / 3600
        values = group[value_col].values

        # Linear regression
        slope, _, _, _, _ = stats.linregress(times, values)
        return slope

    # Calculate slope for rolling window
    trends = df.groupby(patient_col).apply(
        lambda x: x.rolling(window=window_hours, on=time_col).apply(get_slope)
    )

    return trends

Time Since Event

def time_since_event(df, event_col, current_time_col):
    """Calculate hours since event occurred."""
    event_time = pd.to_datetime(df[event_col])
    current_time = pd.to_datetime(df[current_time_col])

    return (current_time - event_time).dt.total_seconds() / 3600

# Example: hours since admission, last medication, etc.
df['hours_since_admission'] = time_since_event(df, 'admission_time', 'current_time')
df['hours_since_last_vasopressor'] = time_since_event(df, 'last_vasopressor_time', 'current_time')

5.6.2 Aggregation Features

When multiple measurements exist, summarize meaningfully:

def aggregate_measurements(df, value_col, patient_col, time_col, window_hours=24):
    """Create aggregated features from multiple measurements."""
    df = df.sort_values([patient_col, time_col])

    agg_funcs = {
        f'{value_col}_min': 'min',
        f'{value_col}_max': 'max',
        f'{value_col}_mean': 'mean',
        f'{value_col}_std': 'std',
        f'{value_col}_last': 'last',
        f'{value_col}_first': 'first',
        f'{value_col}_count': 'count'
    }

    # Range (worst deviation from normal)
    agg_funcs[f'{value_col}_range'] = lambda x: x.max() - x.min()

    aggregated = df.groupby(patient_col)[value_col].agg(agg_funcs)

    return aggregated

5.6.3 Clinical Severity Scores

Combine variables into validated clinical scores:

def calculate_sofa_score(row):
    """Calculate SOFA (Sequential Organ Failure Assessment) score.

    Simplified version - real implementation needs more variables.
    """
    score = 0

    # Respiratory (PaO2/FiO2)
    pf_ratio = row.get('pao2_fio2_ratio', np.nan)
    if pd.notna(pf_ratio):
        if pf_ratio < 100:
            score += 4
        elif pf_ratio < 200:
            score += 3
        elif pf_ratio < 300:
            score += 2
        elif pf_ratio < 400:
            score += 1

    # Renal (Creatinine)
    creat = row.get('creatinine', np.nan)
    if pd.notna(creat):
        if creat >= 5.0:
            score += 4
        elif creat >= 3.5:
            score += 3
        elif creat >= 2.0:
            score += 2
        elif creat >= 1.2:
            score += 1

    # Coagulation (Platelets)
    plt = row.get('platelets', np.nan)
    if pd.notna(plt):
        if plt < 20:
            score += 4
        elif plt < 50:
            score += 3
        elif plt < 100:
            score += 2
        elif plt < 150:
            score += 1

    # Add liver, cardiovascular, neurological components...

    return score

df['sofa_score'] = df.apply(calculate_sofa_score, axis=1)

5.6.4 Interaction Features

Some clinical relationships are multiplicative:

# Shock Index: HR / SBP (elevated suggests shock)
df['shock_index'] = df['heart_rate'] / df['systolic_bp']

# BMI from height and weight
df['bmi'] = df['weight_kg'] / (df['height_cm'] / 100) ** 2

# Creatinine clearance (estimated GFR)
def estimate_gfr(row):
    """Cockcroft-Gault equation for estimated GFR."""
    age = row['age']
    weight = row['weight_kg']
    creat = row['creatinine']
    is_female = row['sex'] == 'F'

    gfr = ((140 - age) * weight) / (72 * creat)
    if is_female:
        gfr *= 0.85

    return gfr

df['egfr'] = df.apply(estimate_gfr, axis=1)

5.6.5 Categorical Encoding for Clinical Variables

# Age groups (clinically meaningful)
df['age_group'] = pd.cut(df['age'],
                          bins=[0, 18, 45, 65, 80, 120],
                          labels=['pediatric', 'young_adult', 'middle_aged', 'elderly', 'very_elderly'])

# BMI categories
df['bmi_category'] = pd.cut(df['bmi'],
                             bins=[0, 18.5, 25, 30, 35, 100],
                             labels=['underweight', 'normal', 'overweight', 'obese', 'severely_obese'])

# Time of day (captures staffing, workflow patterns)
df['hour_of_day'] = pd.to_datetime(df['admission_time']).dt.hour
df['time_of_day'] = pd.cut(df['hour_of_day'],
                            bins=[0, 6, 12, 18, 24],
                            labels=['night', 'morning', 'afternoon', 'evening'],
                            include_lowest=True)

# Weekend vs weekday
df['is_weekend'] = pd.to_datetime(df['admission_time']).dt.dayofweek >= 5

5.7 Handling Class Imbalance

Clinical Context: In your sepsis dataset, only 3% of patients develop sepsis. If your model predicts “no sepsis” for everyone, it achieves 97% accuracy. This is useless. Class imbalance requires special handling.

5.7.1 Quantifying Imbalance

def describe_imbalance(y):
    """Describe class distribution and imbalance ratio."""
    counts = pd.Series(y).value_counts()
    props = counts / len(y) * 100

    print("Class distribution:")
    for cls in counts.index:
        print(f"  Class {cls}: {counts[cls]:,} ({props[cls]:.1f}%)")

    imbalance_ratio = counts.max() / counts.min()
    print(f"\nImbalance ratio: {imbalance_ratio:.1f}:1")

    return imbalance_ratio

imbalance_ratio = describe_imbalance(df['outcome'])

5.7.2 Strategy 1: Choose Appropriate Metrics

Accuracy is meaningless for imbalanced data. Use:

AUROC: Overall ranking ability
AUPRC: Precision-recall (emphasizes minority class)
Sensitivity at fixed specificity: Clinically relevant operating points
F1 score: Harmonic mean of precision and recall

from sklearn.metrics import (roc_auc_score, average_precision_score,
                             f1_score, recall_score, precision_score)

def evaluate_imbalanced(y_true, y_prob, threshold=0.5):
    """Evaluation metrics for imbalanced classification."""
    y_pred = (y_prob >= threshold).astype(int)

    metrics = {
        'auroc': roc_auc_score(y_true, y_prob),
        'auprc': average_precision_score(y_true, y_prob),
        'f1': f1_score(y_true, y_pred),
        'sensitivity': recall_score(y_true, y_pred),
        'specificity': recall_score(y_true, y_pred, pos_label=0),
        'precision': precision_score(y_true, y_pred),
    }

    return metrics

5.7.3 Strategy 2: Resampling (Use Carefully)

Undersampling: Reduce majority class. Simple but loses information.

Oversampling: Increase minority class. Risk of overfitting.

SMOTE: Generate synthetic minority examples. Better than naive oversampling, but can create unrealistic examples.

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline

# IMPORTANT: Only apply resampling to training data, never test data
# Use within cross-validation, not before splitting

from sklearn.model_selection import StratifiedKFold

def evaluate_with_resampling(X, y, model, resampler=None, n_splits=5):
    """Cross-validation with resampling applied correctly."""
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
    scores = []

    for train_idx, test_idx in cv.split(X, y):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        # Apply resampling only to training data
        if resampler:
            X_train_resampled, y_train_resampled = resampler.fit_resample(X_train, y_train)
        else:
            X_train_resampled, y_train_resampled = X_train, y_train

        # Train and evaluate
        model.fit(X_train_resampled, y_train_resampled)
        y_prob = model.predict_proba(X_test)[:, 1]
        scores.append(roc_auc_score(y_test, y_prob))

    return np.mean(scores), np.std(scores)

5.7.4 Strategy 3: Cost-Sensitive Learning

Penalize errors on the minority class more heavily:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Option 1: Class weights
# 'balanced' automatically weights inversely proportional to frequency
model = LogisticRegression(class_weight='balanced')

# Option 2: Custom weights
# Weight the minority class more heavily
weight_ratio = (y == 0).sum() / (y == 1).sum()
model = LogisticRegression(class_weight={0: 1, 1: weight_ratio})

# Option 3: Sample weights during fit
sample_weights = np.where(y == 1, weight_ratio, 1.0)
model.fit(X, y, sample_weight=sample_weights)

5.7.5 Strategy 4: Threshold Optimization

The default 0.5 threshold is arbitrary. Choose based on clinical requirements:

def find_threshold_for_sensitivity(y_true, y_prob, target_sensitivity=0.90):
    """Find threshold that achieves target sensitivity."""
    from sklearn.metrics import roc_curve

    fpr, tpr, thresholds = roc_curve(y_true, y_prob)

    # Find threshold where sensitivity (tpr) >= target
    valid_idx = np.where(tpr >= target_sensitivity)[0]
    if len(valid_idx) == 0:
        return thresholds[0]  # Return lowest threshold if target unreachable

    # Among valid thresholds, find one with best specificity
    best_idx = valid_idx[np.argmin(fpr[valid_idx])]

    return thresholds[best_idx]

# Find threshold for 90% sensitivity
threshold_90sens = find_threshold_for_sensitivity(y_test, y_prob, 0.90)
print(f"Threshold for 90% sensitivity: {threshold_90sens:.3f}")

5.9 Chapter Summary

Exploratory data analysis is not optional in clinical AI—it’s where you discover the problems that would otherwise sink your model in deployment.

Key principles:

Garbage in, garbage out: No model can compensate for data quality issues
Missingness is information: In clinical data, what wasn’t measured tells you something
Timestamps lie: Documentation time ≠ clinical event time
Labels are imperfect: Billing codes and retrospective annotations have limitations
Class imbalance is the norm: Most clinical outcomes are rare; plan accordingly

Critical practices:

Profile every dataset before modeling
Visualize distributions and missingness patterns
Check for impossible values and temporal inconsistencies
Create missing indicators rather than naive imputation
Engineer features that capture clinical meaning
Choose metrics appropriate for imbalanced data

Red flags that signal data problems:

Implausibly high model performance (AUC > 0.95 on hard problems)
Features with unexplainably high importance
Performance that doesn’t generalize to new time periods
Results that contradict clinical intuition

The time spent on EDA is never wasted. Every hour invested in understanding your data saves days of debugging models that learned the wrong things.

5.10 Exercises

Missing Data Investigation: Load a clinical dataset (MIMIC-III or similar). For five key lab variables, calculate the missingness rate overall and stratified by patient outcome. Is missingness informative?
Temporal Leakage Hunt: Using the same dataset, identify three features that might contain information from after the prediction time. How would you modify them to prevent leakage?
Value Range Audit: Write a function that checks all numeric columns against physiologically plausible ranges. Apply it to a clinical dataset and report what you find.
Feature Engineering: Starting with raw vital signs (heart rate, blood pressure, temperature), create at least five derived features that capture clinical meaning (trends, ratios, deviations from normal).
Imbalance Simulation: Create a synthetic dataset with 2% positive rate. Train a classifier with and without class weighting. Compare accuracy, AUROC, and AUPRC. Which metrics are useful?
EDA Report: Conduct a full exploratory data analysis on a clinical dataset of your choice. Document your findings in a 3-5 page report covering data quality, missingness, distributions, and potential modeling challenges.

# Exploratory Data Analysis for Clinical AI {#sec-eda-clinical-data} *Clinical Context: A team downloads a public EHR dataset and immediately trains a sepsis prediction model. The model achieves 0.95 AUC. They're excited—until deployment, where performance drops to 0.68. The problem wasn't the model. The problem was the data: 30% of "sepsis" labels were assigned retrospectively based on death certificates, timestamps were corrupted for weekend admissions, and lab values had impossible outliers from data entry errors. No amount of sophisticated modeling can fix garbage data. This chapter teaches you to understand your data before you model it.* Machine learning textbooks often jump from "here's your dataset" to "here's how to train a model." But in clinical AI, **the data is the hard part**. Electronic health records were designed for billing and documentation, not machine learning. Research datasets have their own quirks. Understanding your data—its structure, its limitations, its biases—is the most important step in any clinical AI project. This chapter covers exploratory data analysis (EDA) for clinical applications: systematic workflows for understanding data, handling missing values, identifying quality issues, and engineering features that capture clinical meaning. ## The EDA Mindset ### Why EDA Matters More in Healthcare In many ML applications, data is relatively clean: images are images, text is text. Clinical data is different: - **EHR data is a byproduct**: It captures what was documented for patient care, not what's ideal for research - **Missingness is informative**: A missing lab value often means "not suspected" - **Timestamps lie**: Orders backdated, results posted late, timezones inconsistent - **Labels are imperfect**: Diagnoses are billing codes, not ground truth The consequence: **models learn from data artifacts as much as clinical signal**. A model might learn that patients with missing potassium values are healthier (because potassium wasn't checked) rather than learning anything about potassium itself. ### The EDA Workflow Before any modeling, systematically answer: 1. **What do I have?** — Rows, columns, data types, memory footprint 2. **What's missing?** — Patterns, mechanisms, implications 3. **What does it look like?** — Distributions, outliers, relationships 4. **What's wrong?** — Impossible values, temporal inconsistencies, coding errors 5. **What do I need to create?** — Feature engineering, transformations This isn't a one-time step. Return to EDA whenever model performance surprises you. ## Systematic Data Profiling *Clinical Context: You've received a dataset of 50,000 ICU admissions with 200 variables. Where do you start?* ### First Look: Structure and Size ```{python} #| eval: false import pandas as pd import numpy as np # Load data df = pd.read_csv("icu_admissions.csv") # Basic structure print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns") print(f"Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB") print(f"\nColumn types:") print(df.dtypes.value_counts()) # First few rows (check for obvious issues) df.head() ``` ### Summary Statistics by Data Type Different variable types need different summaries: ```{python} #| eval: false def profile_dataframe(df): """Generate comprehensive data profile.""" profile = [] for col in df.columns: stats = { 'column': col, 'dtype': str(df[col].dtype), 'n_missing': df[col].isna().sum(), 'pct_missing': df[col].isna().mean() * 100, 'n_unique': df[col].nunique() } if df[col].dtype in ['int64', 'float64']: # Numeric summaries stats.update({ 'mean': df[col].mean(), 'std': df[col].std(), 'min': df[col].min(), 'p25': df[col].quantile(0.25), 'median': df[col].median(), 'p75': df[col].quantile(0.75), 'max': df[col].max() }) elif df[col].dtype == 'object': # Categorical summaries top_val = df[col].value_counts().index[0] if len(df[col].dropna()) > 0 else None stats.update({ 'top_value': top_val, 'top_freq': df[col].value_counts().iloc[0] if top_val else 0 }) profile.append(stats) return pd.DataFrame(profile) # Generate profile profile = profile_dataframe(df) print(profile.to_string()) ``` ### Red Flags to Check Immediately ```{python} #| eval: false def check_red_flags(df): """Check for common data quality issues.""" issues = [] # Duplicate rows n_dupes = df.duplicated().sum() if n_dupes > 0: issues.append(f"⚠️ {n_dupes:,} duplicate rows ({n_dupes/len(df)*100:.1f}%)") # Columns with single value for col in df.columns: if df[col].nunique() == 1: issues.append(f"⚠️ Column '{col}' has only one unique value") # High cardinality categoricals (might be IDs) for col in df.select_dtypes(include='object').columns: if df[col].nunique() > len(df) * 0.5: issues.append(f"⚠️ Column '{col}' has {df[col].nunique():,} unique values (possible ID?)") # Columns that are mostly missing for col in df.columns: pct_missing = df[col].isna().mean() if pct_missing > 0.9: issues.append(f"⚠️ Column '{col}' is {pct_missing*100:.1f}% missing") # Numeric columns with suspicious ranges for col in df.select_dtypes(include=['int64', 'float64']).columns: if df[col].min() < -1e10 or df[col].max() > 1e10: issues.append(f"⚠️ Column '{col}' has extreme values (min={df[col].min():.2e}, max={df[col].max():.2e})") return issues issues = check_red_flags(df) for issue in issues: print(issue) ``` ## Missing Data in Clinical Settings *Clinical Context: Your sepsis dataset has 40% missing lactate values. Should you drop those patients? Impute the values? The answer depends on WHY the data is missing—and in clinical data, missingness is rarely random.* ### The Missing Data Taxonomy Statistical theory distinguishes three mechanisms: **MCAR (Missing Completely at Random)**: Missingness is independent of both observed and unobserved data. *Clinical example*: A lab machine randomly malfunctions, losing 5% of results regardless of patient characteristics. *Reality check*: True MCAR is rare in clinical data. **MAR (Missing at Random)**: Missingness depends on observed data but not on the missing values themselves. *Clinical example*: Sicker patients (observable via vital signs) get more labs. Missingness depends on severity, but given severity, the actual lab value doesn't affect whether it's measured. *Reality check*: Often assumed but hard to verify. **MNAR (Missing Not at Random)**: Missingness depends on the unobserved values themselves. *Clinical example*: HbA1c is only ordered when diabetes is suspected. Patients without diabetes suspicion don't get tested—their (likely normal) HbA1c values are missing *because* they're normal. The missing values are systematically different from observed values. *Reality check*: **This is the default assumption for clinical data.** ### Informative Missingness in EHR Data In clinical practice, the decision to measure something is itself clinical information: | Variable | Why It Might Be Missing | What Missingness Signals | |----------|------------------------|-------------------------| | Lactate | Not suspected of sepsis | Lower acuity | | Troponin | No chest pain | Lower cardiac risk | | Blood culture | No infection concern | Lower infection probability | | PT/INR | Not on anticoagulation | Different medication profile | | HIV test | Not indicated | Different risk profile | **The missingness pattern IS a feature.** A model that imputes lactate to the population mean loses this signal. ### Visualizing Missing Data Patterns ```{python} #| eval: false import matplotlib.pyplot as plt import seaborn as sns def plot_missing_matrix(df, max_cols=30): """Visualize missing data patterns.""" # Select columns with some missingness missing_cols = df.columns[df.isna().any()].tolist()[:max_cols] if not missing_cols: print("No missing data!") return # Create missing indicator matrix missing_matrix = df[missing_cols].isna().astype(int) # Sort by total missingness col_order = missing_matrix.sum().sort_values(ascending=False).index missing_matrix = missing_matrix[col_order] # Plot fig, axes = plt.subplots(1, 2, figsize=(14, 6)) # Bar plot of missingness by column missing_pct = missing_matrix.mean() * 100 axes[0].barh(range(len(missing_pct)), missing_pct.values) axes[0].set_yticks(range(len(missing_pct))) axes[0].set_yticklabels(missing_pct.index) axes[0].set_xlabel('Percent Missing') axes[0].set_title('Missing Data by Column') # Heatmap of co-missingness co_missing = missing_matrix.T.corr() sns.heatmap(co_missing, ax=axes[1], cmap='RdBu_r', center=0, xticklabels=True, yticklabels=True) axes[1].set_title('Co-Missingness Correlation') plt.tight_layout() plt.show() plot_missing_matrix(df) ``` ### Strategies for Missing Clinical Data **1. Missingness Indicators (Recommended for clinical ML)** Create binary features indicating whether each variable was observed: ```{python} #| eval: false def add_missing_indicators(df, columns): """Add binary missing indicators for specified columns.""" df_out = df.copy() for col in columns: df_out[f'{col}_missing'] = df[col].isna().astype(int) return df_out # Add indicators for key labs lab_cols = ['lactate', 'troponin', 'wbc', 'creatinine'] df = add_missing_indicators(df, lab_cols) ``` **2. Simple Imputation with Missing Indicators** Impute to allow models to use observed values, but keep indicators: ```{python} #| eval: false from sklearn.impute import SimpleImputer def impute_with_indicators(df, columns, strategy='median'): """Impute missing values and add indicators.""" df_out = df.copy() for col in columns: # Add indicator first df_out[f'{col}_missing'] = df[col].isna().astype(int) # Then impute imputer = SimpleImputer(strategy=strategy) df_out[col] = imputer.fit_transform(df_out[[col]]) return df_out ``` **3. Imputation Within Clinical Subgroups** If you must impute, do it within clinically meaningful groups: ```{python} #| eval: false def impute_by_group(df, col, group_col, strategy='median'): """Impute missing values within groups.""" df_out = df.copy() # Group-specific imputation for group in df[group_col].unique(): mask = df[group_col] == group group_median = df.loc[mask, col].median() df_out.loc[mask & df[col].isna(), col] = group_median # Fill remaining with global median global_median = df[col].median() df_out[col] = df_out[col].fillna(global_median) return df_out # Impute creatinine within age groups df = impute_by_group(df, 'creatinine', 'age_group') ``` **4. Multiple Imputation (For Inference)** For statistical inference (not prediction), multiple imputation properly accounts for uncertainty: ```{python} #| eval: false from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Multiple imputation n_imputations = 5 imputed_datasets = [] for i in range(n_imputations): imputer = IterativeImputer(random_state=i, max_iter=10) imputed = imputer.fit_transform(df[numeric_cols]) imputed_datasets.append(pd.DataFrame(imputed, columns=numeric_cols)) # Analysis should be run on each imputed dataset, then pooled ``` ### What NOT to Do - **Don't drop patients with missing data** (unless MCAR, which is rare) - **Don't impute without indicators** (loses informative missingness signal) - **Don't assume MAR without justification** (MNAR is common in clinical data) - **Don't impute the outcome variable** (that's the thing you're trying to predict) ## Visualization Techniques for Clinical EDA *Clinical Context: A table tells you 15% of creatinine values are missing. A visualization shows you they're missing specifically for patients admitted on weekends, pointing to a staffing or workflow issue. Effective visualization reveals patterns that summary statistics hide.* ### Distribution Plots Understanding variable distributions is essential before modeling. Clinical variables often have non-normal distributions with meaningful patterns. **Histograms and Density Plots** ```{python} #| eval: false import matplotlib.pyplot as plt import seaborn as sns def plot_distributions(df, columns, ncols=3, figsize=(15, 4)): """Plot distributions for multiple numeric columns.""" nrows = (len(columns) + ncols - 1) // ncols fig, axes = plt.subplots(nrows, ncols, figsize=(figsize[0], figsize[1] * nrows)) axes = axes.flatten() if nrows > 1 else [axes] if ncols == 1 else axes for i, col in enumerate(columns): ax = axes[i] data = df[col].dropna() # Histogram with KDE overlay sns.histplot(data, kde=True, ax=ax, color='steelblue', alpha=0.7) ax.axvline(data.mean(), color='red', linestyle='--', label=f'Mean: {data.mean():.1f}') ax.axvline(data.median(), color='green', linestyle='--', label=f'Median: {data.median():.1f}') ax.set_title(f'{col}\n(n={len(data):,}, missing={df[col].isna().sum():,})') ax.legend(fontsize=8) # Hide unused subplots for j in range(i + 1, len(axes)): axes[j].set_visible(False) plt.tight_layout() plt.show() # Plot key clinical variables vital_cols = ['heart_rate', 'systolic_bp', 'temperature_c', 'respiratory_rate'] plot_distributions(df, vital_cols) ``` **Identifying Distribution Shapes** Clinical variables often show characteristic patterns: - **Bimodal**: May indicate two patient populations (e.g., temperature with fever vs. normal) - **Right-skewed**: Common for lab values, lengths of stay, costs - **Truncated**: May indicate data collection limits or physiological bounds - **Spike at specific value**: Often indicates default or imputed values ```{python} #| eval: false def detect_distribution_anomalies(series, name="Variable"): """Flag potential distribution issues.""" data = series.dropna() issues = [] # Check for extreme skewness skewness = data.skew() if abs(skewness) > 2: direction = "right" if skewness > 0 else "left" issues.append(f"Highly {direction}-skewed (skew={skewness:.2f})") # Check for spikes (possible default values) value_counts = data.value_counts(normalize=True) if value_counts.iloc[0] > 0.2: issues.append(f"Mode appears in {value_counts.iloc[0]*100:.1f}% of values (possible default?)") # Check for outlier percentage (IQR method) Q1, Q3 = data.quantile([0.25, 0.75]) IQR = Q3 - Q1 outlier_mask = (data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR) outlier_pct = outlier_mask.mean() * 100 if outlier_pct > 5: issues.append(f"{outlier_pct:.1f}% outliers by IQR method") if issues: print(f"\n{name}:") for issue in issues: print(f" ⚠️ {issue}") ``` ### Comparing Groups Visually Comparing distributions across patient groups (outcome, demographics, time periods) reveals heterogeneity and potential bias. **Box Plots by Group** ```{python} #| eval: false def plot_by_group(df, value_col, group_col, figsize=(10, 5)): """Compare distributions across groups.""" fig, axes = plt.subplots(1, 2, figsize=figsize) # Box plot sns.boxplot(data=df, x=group_col, y=value_col, ax=axes[0]) axes[0].set_title(f'{value_col} by {group_col}') # Overlapping density plots for group in df[group_col].unique(): group_data = df[df[group_col] == group][value_col].dropna() sns.kdeplot(group_data, ax=axes[1], label=f'{group} (n={len(group_data):,})') axes[1].set_title(f'{value_col} distribution by {group_col}') axes[1].legend() plt.tight_layout() plt.show() # Compare lactate between survivors and non-survivors plot_by_group(df, 'lactate', 'mortality') ``` **Violin Plots for Richer Comparison** ```{python} #| eval: false def plot_violin_comparison(df, value_cols, group_col, figsize=(12, 4)): """Violin plots showing full distribution shape by group.""" fig, axes = plt.subplots(1, len(value_cols), figsize=figsize) if len(value_cols) == 1: axes = [axes] for ax, col in zip(axes, value_cols): sns.violinplot(data=df, x=group_col, y=col, ax=ax, inner='quartile') ax.set_title(col) plt.tight_layout() plt.show() # Compare multiple labs between outcome groups plot_violin_comparison(df, ['creatinine', 'lactate', 'bilirubin'], 'mortality') ``` ### Correlation Heatmaps Understanding variable relationships helps identify redundancy and potential multicollinearity. ```{python} #| eval: false def plot_correlation_heatmap(df, columns=None, method='pearson', figsize=(10, 8)): """Create correlation heatmap with hierarchical clustering.""" if columns is None: columns = df.select_dtypes(include=[np.number]).columns.tolist() # Calculate correlation matrix corr = df[columns].corr(method=method) # Create clustered heatmap g = sns.clustermap(corr, cmap='RdBu_r', center=0, figsize=figsize, annot=True, fmt='.2f', linewidths=0.5, vmin=-1, vmax=1) g.fig.suptitle(f'{method.title()} Correlation Matrix (Clustered)', y=1.02) plt.show() return corr # Correlation among lab values lab_cols = ['wbc', 'hemoglobin', 'platelets', 'creatinine', 'bilirubin', 'lactate'] corr_matrix = plot_correlation_heatmap(df, lab_cols) ``` **Identifying Highly Correlated Features** ```{python} #| eval: false def find_high_correlations(corr_matrix, threshold=0.7): """Find pairs of features with correlation above threshold.""" high_corr = [] for i in range(len(corr_matrix.columns)): for j in range(i + 1, len(corr_matrix.columns)): if abs(corr_matrix.iloc[i, j]) > threshold: high_corr.append({ 'feature_1': corr_matrix.columns[i], 'feature_2': corr_matrix.columns[j], 'correlation': corr_matrix.iloc[i, j] }) return pd.DataFrame(high_corr).sort_values('correlation', ascending=False) # Find highly correlated pairs high_corr_pairs = find_high_correlations(corr_matrix, threshold=0.7) print("Highly correlated feature pairs:") print(high_corr_pairs) ``` ### Scatter Plots for Clinical Relationships Scatter plots reveal non-linear relationships and subgroup patterns that correlation coefficients miss. ```{python} #| eval: false def plot_clinical_scatter(df, x_col, y_col, hue_col=None, figsize=(8, 6)): """Scatter plot with regression line and optional grouping.""" fig, ax = plt.subplots(figsize=figsize) if hue_col: sns.scatterplot(data=df, x=x_col, y=y_col, hue=hue_col, alpha=0.6, ax=ax) else: sns.scatterplot(data=df, x=x_col, y=y_col, alpha=0.6, ax=ax) # Add regression line sns.regplot(data=df, x=x_col, y=y_col, scatter=False, ax=ax, color='red', line_kws={'linestyle': '--'}) ax.set_title(f'{y_col} vs {x_col}') plt.tight_layout() plt.show() # Creatinine vs BUN (should correlate in renal function) plot_clinical_scatter(df, 'bun', 'creatinine', hue_col='aki_stage') ``` **Pair Plots for Multiple Variables** ```{python} #| eval: false def plot_pairs(df, columns, hue_col=None, sample_n=1000): """Pair plot for exploring multiple variable relationships.""" # Sample for performance if dataset is large if len(df) > sample_n: df_sample = df.sample(n=sample_n, random_state=42) else: df_sample = df g = sns.pairplot(df_sample[columns + ([hue_col] if hue_col else [])], hue=hue_col, diag_kind='kde', plot_kws={'alpha': 0.5, 's': 20}) g.fig.suptitle('Variable Relationships', y=1.02) plt.show() # Explore relationships among vital signs by outcome vital_cols = ['heart_rate', 'systolic_bp', 'respiratory_rate', 'oxygen_saturation'] plot_pairs(df, vital_cols, hue_col='mortality') ``` ### Temporal Visualizations Clinical data has a time dimension. Visualizing temporal patterns reveals trends, seasonality, and data artifacts. **Outcome Rates Over Time** ```{python} #| eval: false def plot_outcome_over_time(df, date_col, outcome_col, freq='M', figsize=(12, 5)): """Plot outcome rate and volume over time.""" df_copy = df.copy() df_copy['period'] = pd.to_datetime(df_copy[date_col]).dt.to_period(freq) trends = df_copy.groupby('period').agg({ outcome_col: ['mean', 'sum', 'count'] }) trends.columns = ['rate', 'events', 'total'] trends = trends.reset_index() trends['period_str'] = trends['period'].astype(str) fig, ax1 = plt.subplots(figsize=figsize) # Outcome rate (line) color1 = 'steelblue' ax1.plot(trends['period_str'], trends['rate'], 'o-', color=color1, linewidth=2) ax1.fill_between(trends['period_str'], trends['rate'], alpha=0.2, color=color1) ax1.set_xlabel('Time Period') ax1.set_ylabel('Outcome Rate', color=color1) ax1.tick_params(axis='y', labelcolor=color1) ax1.tick_params(axis='x', rotation=45) # Volume (bars) ax2 = ax1.twinx() ax2.bar(trends['period_str'], trends['total'], alpha=0.3, color='gray', label='Volume') ax2.set_ylabel('Patient Volume', color='gray') plt.title(f'{outcome_col} Rate Over Time') plt.tight_layout() plt.show() # Flag suspicious jumps rate_changes = trends['rate'].pct_change().abs() if rate_changes.max() > 0.5: print(f"⚠️ Large rate change detected ({rate_changes.max()*100:.0f}%) - investigate for data artifacts") plot_outcome_over_time(df, 'admission_date', 'sepsis', freq='Q') ``` **Feature Distributions Over Time** ```{python} #| eval: false def plot_feature_drift(df, date_col, feature_col, freq='Q', figsize=(12, 5)): """Visualize how a feature's distribution changes over time.""" df_copy = df.copy() df_copy['period'] = pd.to_datetime(df_copy[date_col]).dt.to_period(freq) periods = df_copy['period'].unique() fig, axes = plt.subplots(1, 2, figsize=figsize) # Box plots over time df_copy['period_str'] = df_copy['period'].astype(str) sns.boxplot(data=df_copy, x='period_str', y=feature_col, ax=axes[0]) axes[0].tick_params(axis='x', rotation=45) axes[0].set_title(f'{feature_col} Distribution Over Time') # Summary statistics summary = df_copy.groupby('period')[feature_col].agg(['mean', 'median', 'std']) summary = summary.reset_index() summary['period_str'] = summary['period'].astype(str) axes[1].plot(summary['period_str'], summary['mean'], 'o-', label='Mean') axes[1].plot(summary['period_str'], summary['median'], 's--', label='Median') axes[1].fill_between(summary['period_str'], summary['mean'] - summary['std'], summary['mean'] + summary['std'], alpha=0.2) axes[1].tick_params(axis='x', rotation=45) axes[1].set_title(f'{feature_col} Summary Statistics') axes[1].legend() plt.tight_layout() plt.show() # Check if lab values drift over time plot_feature_drift(df, 'admission_date', 'creatinine', freq='Q') ``` ### Visualization Best Practices for Clinical Data **1. Always show sample sizes**: A difference between groups is meaningless without knowing n. **2. Use clinically meaningful scales**: Plot lab values on their clinically relevant range, not auto-scaled. ```{python} #| eval: false def plot_with_clinical_range(df, col, normal_range, figsize=(8, 5)): """Plot distribution with clinical reference range.""" fig, ax = plt.subplots(figsize=figsize) data = df[col].dropna() sns.histplot(data, kde=True, ax=ax, alpha=0.7) # Add normal range ax.axvline(normal_range[0], color='green', linestyle='--', label='Normal range') ax.axvline(normal_range[1], color='green', linestyle='--') ax.axvspan(normal_range[0], normal_range[1], alpha=0.1, color='green') # Calculate % outside normal below = (data < normal_range[0]).mean() * 100 above = (data > normal_range[1]).mean() * 100 ax.set_title(f'{col}\n{below:.1f}% below, {above:.1f}% above normal range') ax.legend() plt.tight_layout() plt.show() # Plot creatinine with normal range plot_with_clinical_range(df, 'creatinine', normal_range=(0.7, 1.3)) ``` **3. Consider colorblind-friendly palettes**: 8% of males are colorblind. Avoid red-green combinations alone. **4. Save plots for documentation**: Every EDA plot should be saved for your analysis report. ```{python} #| eval: false def save_eda_plot(fig, name, output_dir='eda_figures'): """Save plot with consistent naming and quality.""" import os os.makedirs(output_dir, exist_ok=True) filepath = os.path.join(output_dir, f'{name}.png') fig.savefig(filepath, dpi=150, bbox_inches='tight') print(f"Saved: {filepath}") ``` ## Clinical Data Quality Issues *Clinical Context: Your mortality prediction model has 99% AUC. Suspicious. Investigation reveals that "discharge disposition = expired" was included as a feature—information that's only available after the patient dies. This is data leakage. It's obvious in hindsight, but these issues are pervasive in clinical datasets.* ### Temporal Issues **Problem 1: Future Information Leakage** Variables that aren't available at prediction time but appear in the dataset: ```{python} #| eval: false def check_temporal_leakage(df, outcome_time_col, feature_time_cols): """Check for features that occur after the outcome.""" issues = [] for feat_col in feature_time_cols: # Find cases where feature timestamp is after outcome n_leaks = (df[feat_col] > df[outcome_time_col]).sum() if n_leaks > 0: pct = n_leaks / len(df) * 100 issues.append(f"⚠️ {feat_col}: {n_leaks:,} rows ({pct:.1f}%) have feature AFTER outcome") return issues # Check feature timestamps against outcome time outcome_time = 'sepsis_onset_time' feature_times = ['last_vitals_time', 'last_lab_time', 'last_med_time'] issues = check_temporal_leakage(df, outcome_time, feature_times) ``` **Problem 2: Backdated Documentation** Clinical notes and orders are sometimes documented hours after the clinical event: - Admission orders entered after initial workup complete - Notes dictated at shift end, timestamped then - Verbal orders documented retroactively **Mitigation**: Use the earlier of documentation time and clinical event time, or add a safety buffer (e.g., only use data from 4+ hours before prediction time). **Problem 3: Timestamp Inconsistencies** ```{python} #| eval: false def check_timestamp_quality(df, time_col): """Check timestamp column for common issues.""" issues = [] # Parse timestamps times = pd.to_datetime(df[time_col], errors='coerce') # Check for unparseable timestamps n_invalid = times.isna().sum() - df[time_col].isna().sum() if n_invalid > 0: issues.append(f"⚠️ {n_invalid} timestamps couldn't be parsed") # Check for suspicious patterns hours = times.dt.hour # Excess midnight timestamps (often default values) midnight_pct = (hours == 0).mean() * 100 if midnight_pct > 20: issues.append(f"⚠️ {midnight_pct:.1f}% of timestamps are midnight (default value?)") # Check date range min_date, max_date = times.min(), times.max() issues.append(f"Date range: {min_date} to {max_date}") # Future dates n_future = (times > pd.Timestamp.now()).sum() if n_future > 0: issues.append(f"⚠️ {n_future} timestamps are in the future") return issues ``` ### Impossible Values Clinical variables have physiologically plausible ranges. Values outside these ranges indicate data quality issues: ```{python} #| eval: false # Physiologically plausible ranges for common clinical variables CLINICAL_RANGES = { 'heart_rate': (20, 300), # bpm 'systolic_bp': (40, 300), # mmHg 'diastolic_bp': (20, 200), # mmHg 'temperature_c': (25, 45), # Celsius 'respiratory_rate': (4, 60), # breaths/min 'oxygen_saturation': (50, 100), # percent 'sodium': (100, 180), # mEq/L 'potassium': (1.5, 9.0), # mEq/L 'creatinine': (0.1, 30), # mg/dL 'glucose': (20, 1500), # mg/dL 'hemoglobin': (2, 25), # g/dL 'wbc': (0.1, 200), # 10^9/L 'platelets': (5, 1500), # 10^9/L 'age': (0, 120), # years 'weight_kg': (1, 400), # kg 'height_cm': (40, 250), # cm } def check_clinical_ranges(df, ranges=CLINICAL_RANGES): """Flag values outside physiologically plausible ranges.""" issues = [] for col, (min_val, max_val) in ranges.items(): if col in df.columns: below = (df[col] < min_val).sum() above = (df[col] > max_val).sum() if below > 0 or above > 0: issues.append({ 'column': col, 'valid_range': f"[{min_val}, {max_val}]", 'below_range': below, 'above_range': above, 'pct_invalid': (below + above) / len(df) * 100 }) return pd.DataFrame(issues) # Check ranges range_issues = check_clinical_ranges(df) print(range_issues) ``` ### Coding Artifacts **ICD Code Changes** Diagnosis codes change over time (ICD-9 to ICD-10 transition in 2015). Models trained on one coding system may fail on another. ```{python} #| eval: false def detect_icd_version(codes): """Detect whether codes appear to be ICD-9 or ICD-10.""" # ICD-10 codes typically start with a letter icd10_pattern = codes.str.match(r'^[A-Z]', na=False) pct_icd10 = icd10_pattern.mean() * 100 if pct_icd10 > 90: return "ICD-10" elif pct_icd10 < 10: return "ICD-9" else: return f"Mixed ({pct_icd10:.0f}% ICD-10)" # Check diagnosis code format print(f"Primary diagnosis format: {detect_icd_version(df['primary_dx'])}") ``` **Documentation Practice Changes** Changes in incentives, guidelines, or EHR systems alter what gets documented: - Sepsis coding increased dramatically after SEP-1 quality measure introduction - Problem list completeness varies by EHR implementation - Note templates change over time **Mitigation**: Plot outcome rates and feature distributions over time. Sudden changes suggest artifacts, not clinical reality. ```{python} #| eval: false def plot_temporal_trends(df, date_col, outcome_col, freq='M'): """Plot outcome rate over time to detect artifacts.""" df_copy = df.copy() df_copy['period'] = pd.to_datetime(df_copy[date_col]).dt.to_period(freq) trends = df_copy.groupby('period')[outcome_col].agg(['mean', 'count']) trends.columns = ['rate', 'n'] fig, ax1 = plt.subplots(figsize=(12, 5)) ax1.plot(trends.index.astype(str), trends['rate'], 'b-', linewidth=2) ax1.set_xlabel('Time Period') ax1.set_ylabel('Outcome Rate', color='blue') ax1.tick_params(axis='y', labelcolor='blue') ax1.tick_params(axis='x', rotation=45) ax2 = ax1.twinx() ax2.bar(trends.index.astype(str), trends['n'], alpha=0.3, color='gray') ax2.set_ylabel('Sample Size', color='gray') plt.title('Outcome Rate and Sample Size Over Time') plt.tight_layout() plt.show() ``` ### Label Quality The outcome variable is data too. Common label problems: **Retrospective Labeling**: Outcomes assigned using information not available at the time (death certificates, follow-up visits). **Billing-Based Labels**: ICD codes reflect billing incentives, not clinical reality. "Sepsis" coding increased 30% after financial penalties for sepsis mortality were introduced. **Annotation Disagreement**: When labels come from human annotators, inter-rater reliability matters. ```{python} #| eval: false def check_label_quality(df, label_col): """Basic label quality checks.""" issues = [] # Class distribution dist = df[label_col].value_counts(normalize=True) issues.append(f"Label distribution:\n{dist}") # Missing labels n_missing = df[label_col].isna().sum() if n_missing > 0: issues.append(f"⚠️ {n_missing} missing labels ({n_missing/len(df)*100:.1f}%)") # Unexpected values expected = {0, 1} if df[label_col].dtype in ['int64', 'float64'] else None if expected: unexpected = set(df[label_col].dropna().unique()) - expected if unexpected: issues.append(f"⚠️ Unexpected label values: {unexpected}") return issues ``` ## Feature Engineering from Clinical Variables *Clinical Context: Raw lab values tell part of the story. But a creatinine of 1.5 means something different if the patient's baseline is 1.4 versus 0.8. Clinical feature engineering captures domain knowledge that raw variables miss.* ### Temporal Features **Change from Baseline** ```{python} #| eval: false def calculate_delta_from_baseline(df, value_col, patient_col, time_col, baseline_window_hours=24): """Calculate change from baseline for each measurement.""" df = df.sort_values([patient_col, time_col]) results = [] for patient_id, group in df.groupby(patient_col): baseline = group[value_col].iloc[:baseline_window_hours].median() group_result = group.copy() group_result[f'{value_col}_delta'] = group[value_col] - baseline group_result[f'{value_col}_pct_change'] = (group[value_col] - baseline) / baseline * 100 results.append(group_result) return pd.concat(results) ``` **Trends and Slopes** ```{python} #| eval: false from scipy import stats def calculate_trend(df, value_col, patient_col, time_col, window_hours=6): """Calculate trend (slope) of values over recent window.""" df = df.sort_values([patient_col, time_col]) def get_slope(group): if len(group) < 2: return np.nan # Convert time to hours times = (group[time_col] - group[time_col].min()).dt.total_seconds() / 3600 values = group[value_col].values # Linear regression slope, _, _, _, _ = stats.linregress(times, values) return slope # Calculate slope for rolling window trends = df.groupby(patient_col).apply( lambda x: x.rolling(window=window_hours, on=time_col).apply(get_slope) ) return trends ``` **Time Since Event** ```{python} #| eval: false def time_since_event(df, event_col, current_time_col): """Calculate hours since event occurred.""" event_time = pd.to_datetime(df[event_col]) current_time = pd.to_datetime(df[current_time_col]) return (current_time - event_time).dt.total_seconds() / 3600 # Example: hours since admission, last medication, etc. df['hours_since_admission'] = time_since_event(df, 'admission_time', 'current_time') df['hours_since_last_vasopressor'] = time_since_event(df, 'last_vasopressor_time', 'current_time') ``` ### Aggregation Features When multiple measurements exist, summarize meaningfully: ```{python} #| eval: false def aggregate_measurements(df, value_col, patient_col, time_col, window_hours=24): """Create aggregated features from multiple measurements.""" df = df.sort_values([patient_col, time_col]) agg_funcs = { f'{value_col}_min': 'min', f'{value_col}_max': 'max', f'{value_col}_mean': 'mean', f'{value_col}_std': 'std', f'{value_col}_last': 'last', f'{value_col}_first': 'first', f'{value_col}_count': 'count' } # Range (worst deviation from normal) agg_funcs[f'{value_col}_range'] = lambda x: x.max() - x.min() aggregated = df.groupby(patient_col)[value_col].agg(agg_funcs) return aggregated ``` ### Clinical Severity Scores Combine variables into validated clinical scores: ```{python} #| eval: false def calculate_sofa_score(row): """Calculate SOFA (Sequential Organ Failure Assessment) score. Simplified version - real implementation needs more variables. """ score = 0 # Respiratory (PaO2/FiO2) pf_ratio = row.get('pao2_fio2_ratio', np.nan) if pd.notna(pf_ratio): if pf_ratio < 100: score += 4 elif pf_ratio < 200: score += 3 elif pf_ratio < 300: score += 2 elif pf_ratio < 400: score += 1 # Renal (Creatinine) creat = row.get('creatinine', np.nan) if pd.notna(creat): if creat >= 5.0: score += 4 elif creat >= 3.5: score += 3 elif creat >= 2.0: score += 2 elif creat >= 1.2: score += 1 # Coagulation (Platelets) plt = row.get('platelets', np.nan) if pd.notna(plt): if plt < 20: score += 4 elif plt < 50: score += 3 elif plt < 100: score += 2 elif plt < 150: score += 1 # Add liver, cardiovascular, neurological components... return score df['sofa_score'] = df.apply(calculate_sofa_score, axis=1) ``` ### Interaction Features Some clinical relationships are multiplicative: ```{python} #| eval: false # Shock Index: HR / SBP (elevated suggests shock) df['shock_index'] = df['heart_rate'] / df['systolic_bp'] # BMI from height and weight df['bmi'] = df['weight_kg'] / (df['height_cm'] / 100) ** 2 # Creatinine clearance (estimated GFR) def estimate_gfr(row): """Cockcroft-Gault equation for estimated GFR.""" age = row['age'] weight = row['weight_kg'] creat = row['creatinine'] is_female = row['sex'] == 'F' gfr = ((140 - age) * weight) / (72 * creat) if is_female: gfr *= 0.85 return gfr df['egfr'] = df.apply(estimate_gfr, axis=1) ``` ### Categorical Encoding for Clinical Variables ```{python} #| eval: false # Age groups (clinically meaningful) df['age_group'] = pd.cut(df['age'], bins=[0, 18, 45, 65, 80, 120], labels=['pediatric', 'young_adult', 'middle_aged', 'elderly', 'very_elderly']) # BMI categories df['bmi_category'] = pd.cut(df['bmi'], bins=[0, 18.5, 25, 30, 35, 100], labels=['underweight', 'normal', 'overweight', 'obese', 'severely_obese']) # Time of day (captures staffing, workflow patterns) df['hour_of_day'] = pd.to_datetime(df['admission_time']).dt.hour df['time_of_day'] = pd.cut(df['hour_of_day'], bins=[0, 6, 12, 18, 24], labels=['night', 'morning', 'afternoon', 'evening'], include_lowest=True) # Weekend vs weekday df['is_weekend'] = pd.to_datetime(df['admission_time']).dt.dayofweek >= 5 ``` ## Handling Class Imbalance *Clinical Context: In your sepsis dataset, only 3% of patients develop sepsis. If your model predicts "no sepsis" for everyone, it achieves 97% accuracy. This is useless. Class imbalance requires special handling.* ### Quantifying Imbalance ```{python} #| eval: false def describe_imbalance(y): """Describe class distribution and imbalance ratio.""" counts = pd.Series(y).value_counts() props = counts / len(y) * 100 print("Class distribution:") for cls in counts.index: print(f" Class {cls}: {counts[cls]:,} ({props[cls]:.1f}%)") imbalance_ratio = counts.max() / counts.min() print(f"\nImbalance ratio: {imbalance_ratio:.1f}:1") return imbalance_ratio imbalance_ratio = describe_imbalance(df['outcome']) ``` ### Strategy 1: Choose Appropriate Metrics Accuracy is meaningless for imbalanced data. Use: - **AUROC**: Overall ranking ability - **AUPRC**: Precision-recall (emphasizes minority class) - **Sensitivity at fixed specificity**: Clinically relevant operating points - **F1 score**: Harmonic mean of precision and recall ```{python} #| eval: false from sklearn.metrics import (roc_auc_score, average_precision_score, f1_score, recall_score, precision_score) def evaluate_imbalanced(y_true, y_prob, threshold=0.5): """Evaluation metrics for imbalanced classification.""" y_pred = (y_prob >= threshold).astype(int) metrics = { 'auroc': roc_auc_score(y_true, y_prob), 'auprc': average_precision_score(y_true, y_prob), 'f1': f1_score(y_true, y_pred), 'sensitivity': recall_score(y_true, y_pred), 'specificity': recall_score(y_true, y_pred, pos_label=0), 'precision': precision_score(y_true, y_pred), } return metrics ``` ### Strategy 2: Resampling (Use Carefully) **Undersampling**: Reduce majority class. Simple but loses information. **Oversampling**: Increase minority class. Risk of overfitting. **SMOTE**: Generate synthetic minority examples. Better than naive oversampling, but can create unrealistic examples. ```{python} #| eval: false from imblearn.over_sampling import SMOTE from imblearn.under_sampling import RandomUnderSampler from imblearn.pipeline import Pipeline as ImbPipeline # IMPORTANT: Only apply resampling to training data, never test data # Use within cross-validation, not before splitting from sklearn.model_selection import StratifiedKFold def evaluate_with_resampling(X, y, model, resampler=None, n_splits=5): """Cross-validation with resampling applied correctly.""" cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42) scores = [] for train_idx, test_idx in cv.split(X, y): X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] y_train, y_test = y.iloc[train_idx], y.iloc[test_idx] # Apply resampling only to training data if resampler: X_train_resampled, y_train_resampled = resampler.fit_resample(X_train, y_train) else: X_train_resampled, y_train_resampled = X_train, y_train # Train and evaluate model.fit(X_train_resampled, y_train_resampled) y_prob = model.predict_proba(X_test)[:, 1] scores.append(roc_auc_score(y_test, y_prob)) return np.mean(scores), np.std(scores) ``` ### Strategy 3: Cost-Sensitive Learning Penalize errors on the minority class more heavily: ```{python} #| eval: false from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier # Option 1: Class weights # 'balanced' automatically weights inversely proportional to frequency model = LogisticRegression(class_weight='balanced') # Option 2: Custom weights # Weight the minority class more heavily weight_ratio = (y == 0).sum() / (y == 1).sum() model = LogisticRegression(class_weight={0: 1, 1: weight_ratio}) # Option 3: Sample weights during fit sample_weights = np.where(y == 1, weight_ratio, 1.0) model.fit(X, y, sample_weight=sample_weights) ``` ### Strategy 4: Threshold Optimization The default 0.5 threshold is arbitrary. Choose based on clinical requirements: ```{python} #| eval: false def find_threshold_for_sensitivity(y_true, y_prob, target_sensitivity=0.90): """Find threshold that achieves target sensitivity.""" from sklearn.metrics import roc_curve fpr, tpr, thresholds = roc_curve(y_true, y_prob) # Find threshold where sensitivity (tpr) >= target valid_idx = np.where(tpr >= target_sensitivity)[0] if len(valid_idx) == 0: return thresholds[0] # Return lowest threshold if target unreachable # Among valid thresholds, find one with best specificity best_idx = valid_idx[np.argmin(fpr[valid_idx])] return thresholds[best_idx] # Find threshold for 90% sensitivity threshold_90sens = find_threshold_for_sensitivity(y_test, y_prob, 0.90) print(f"Threshold for 90% sensitivity: {threshold_90sens:.3f}") ``` ## Putting It All Together: EDA Checklist Before modeling any clinical dataset: ### Data Structure - [ ] How many patients, encounters, observations? - [ ] What is the unit of analysis (patient, encounter, time-slice)? - [ ] What is the time range of the data? - [ ] What is the source (single site, multi-site, public dataset)? ### Missing Data - [ ] What percentage of each variable is missing? - [ ] What is the pattern of missingness (random or structured)? - [ ] Is missingness likely informative for your outcome? - [ ] How will you handle missing data in modeling? ### Data Quality - [ ] Are there impossible values (negative ages, impossible vitals)? - [ ] Are timestamps consistent and reliable? - [ ] Are there duplicate records? - [ ] Do trends over time suggest data artifacts? ### Labels - [ ] How were labels defined and collected? - [ ] What is the class distribution? - [ ] Are labels available at the time predictions would be needed? - [ ] Is there potential label leakage from features? ### Features - [ ] What features are available at prediction time? - [ ] What clinically meaningful features can be engineered? - [ ] Are categorical variables encoded appropriately? - [ ] Have you added missing indicators for key variables? ### Generalization - [ ] How similar is this data to your deployment population? - [ ] What subgroups exist and how do they differ? - [ ] What temporal patterns might affect generalization? ## Chapter Summary Exploratory data analysis is not optional in clinical AI—it's where you discover the problems that would otherwise sink your model in deployment. **Key principles:** - **Garbage in, garbage out**: No model can compensate for data quality issues - **Missingness is information**: In clinical data, what wasn't measured tells you something - **Timestamps lie**: Documentation time ≠ clinical event time - **Labels are imperfect**: Billing codes and retrospective annotations have limitations - **Class imbalance is the norm**: Most clinical outcomes are rare; plan accordingly **Critical practices:** - Profile every dataset before modeling - Visualize distributions and missingness patterns - Check for impossible values and temporal inconsistencies - Create missing indicators rather than naive imputation - Engineer features that capture clinical meaning - Choose metrics appropriate for imbalanced data **Red flags that signal data problems:** - Implausibly high model performance (AUC > 0.95 on hard problems) - Features with unexplainably high importance - Performance that doesn't generalize to new time periods - Results that contradict clinical intuition The time spent on EDA is never wasted. Every hour invested in understanding your data saves days of debugging models that learned the wrong things. ## Exercises 1. **Missing Data Investigation**: Load a clinical dataset (MIMIC-III or similar). For five key lab variables, calculate the missingness rate overall and stratified by patient outcome. Is missingness informative? 2. **Temporal Leakage Hunt**: Using the same dataset, identify three features that might contain information from after the prediction time. How would you modify them to prevent leakage? 3. **Value Range Audit**: Write a function that checks all numeric columns against physiologically plausible ranges. Apply it to a clinical dataset and report what you find. 4. **Feature Engineering**: Starting with raw vital signs (heart rate, blood pressure, temperature), create at least five derived features that capture clinical meaning (trends, ratios, deviations from normal). 5. **Imbalance Simulation**: Create a synthetic dataset with 2% positive rate. Train a classifier with and without class weighting. Compare accuracy, AUROC, and AUPRC. Which metrics are useful? 6. **EDA Report**: Conduct a full exploratory data analysis on a clinical dataset of your choice. Document your findings in a 3-5 page report covering data quality, missingness, distributions, and potential modeling challenges.