import pandas as pd
import numpy as np
# Load data
df = pd.read_csv("icu_admissions.csv")
# Basic structure
print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory: {df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"\nColumn types:")
print(df.dtypes.value_counts())
# First few rows (check for obvious issues)
df.head()5 Exploratory Data Analysis for Clinical AI
Clinical Context: A team downloads a public EHR dataset and immediately trains a sepsis prediction model. The model achieves 0.95 AUC. They’re excited—until deployment, where performance drops to 0.68. The problem wasn’t the model. The problem was the data: 30% of “sepsis” labels were assigned retrospectively based on death certificates, timestamps were corrupted for weekend admissions, and lab values had impossible outliers from data entry errors. No amount of sophisticated modeling can fix garbage data. This chapter teaches you to understand your data before you model it.
Machine learning textbooks often jump from “here’s your dataset” to “here’s how to train a model.” But in clinical AI, the data is the hard part. Electronic health records were designed for billing and documentation, not machine learning. Research datasets have their own quirks. Understanding your data—its structure, its limitations, its biases—is the most important step in any clinical AI project.
This chapter covers exploratory data analysis (EDA) for clinical applications: systematic workflows for understanding data, handling missing values, identifying quality issues, and engineering features that capture clinical meaning.
5.1 The EDA Mindset
5.1.1 Why EDA Matters More in Healthcare
In many ML applications, data is relatively clean: images are images, text is text. Clinical data is different:
- EHR data is a byproduct: It captures what was documented for patient care, not what’s ideal for research
- Missingness is informative: A missing lab value often means “not suspected”
- Timestamps lie: Orders backdated, results posted late, timezones inconsistent
- Labels are imperfect: Diagnoses are billing codes, not ground truth
The consequence: models learn from data artifacts as much as clinical signal. A model might learn that patients with missing potassium values are healthier (because potassium wasn’t checked) rather than learning anything about potassium itself.
5.1.2 The EDA Workflow
Before any modeling, systematically answer:
- What do I have? — Rows, columns, data types, memory footprint
- What’s missing? — Patterns, mechanisms, implications
- What does it look like? — Distributions, outliers, relationships
- What’s wrong? — Impossible values, temporal inconsistencies, coding errors
- What do I need to create? — Feature engineering, transformations
This isn’t a one-time step. Return to EDA whenever model performance surprises you.
5.2 Systematic Data Profiling
Clinical Context: You’ve received a dataset of 50,000 ICU admissions with 200 variables. Where do you start?
5.2.1 First Look: Structure and Size
5.2.2 Summary Statistics by Data Type
Different variable types need different summaries:
def profile_dataframe(df):
"""Generate comprehensive data profile."""
profile = []
for col in df.columns:
stats = {
'column': col,
'dtype': str(df[col].dtype),
'n_missing': df[col].isna().sum(),
'pct_missing': df[col].isna().mean() * 100,
'n_unique': df[col].nunique()
}
if df[col].dtype in ['int64', 'float64']:
# Numeric summaries
stats.update({
'mean': df[col].mean(),
'std': df[col].std(),
'min': df[col].min(),
'p25': df[col].quantile(0.25),
'median': df[col].median(),
'p75': df[col].quantile(0.75),
'max': df[col].max()
})
elif df[col].dtype == 'object':
# Categorical summaries
top_val = df[col].value_counts().index[0] if len(df[col].dropna()) > 0 else None
stats.update({
'top_value': top_val,
'top_freq': df[col].value_counts().iloc[0] if top_val else 0
})
profile.append(stats)
return pd.DataFrame(profile)
# Generate profile
profile = profile_dataframe(df)
print(profile.to_string())5.2.3 Red Flags to Check Immediately
def check_red_flags(df):
"""Check for common data quality issues."""
issues = []
# Duplicate rows
n_dupes = df.duplicated().sum()
if n_dupes > 0:
issues.append(f"⚠️ {n_dupes:,} duplicate rows ({n_dupes/len(df)*100:.1f}%)")
# Columns with single value
for col in df.columns:
if df[col].nunique() == 1:
issues.append(f"⚠️ Column '{col}' has only one unique value")
# High cardinality categoricals (might be IDs)
for col in df.select_dtypes(include='object').columns:
if df[col].nunique() > len(df) * 0.5:
issues.append(f"⚠️ Column '{col}' has {df[col].nunique():,} unique values (possible ID?)")
# Columns that are mostly missing
for col in df.columns:
pct_missing = df[col].isna().mean()
if pct_missing > 0.9:
issues.append(f"⚠️ Column '{col}' is {pct_missing*100:.1f}% missing")
# Numeric columns with suspicious ranges
for col in df.select_dtypes(include=['int64', 'float64']).columns:
if df[col].min() < -1e10 or df[col].max() > 1e10:
issues.append(f"⚠️ Column '{col}' has extreme values (min={df[col].min():.2e}, max={df[col].max():.2e})")
return issues
issues = check_red_flags(df)
for issue in issues:
print(issue)5.3 Missing Data in Clinical Settings
Clinical Context: Your sepsis dataset has 40% missing lactate values. Should you drop those patients? Impute the values? The answer depends on WHY the data is missing—and in clinical data, missingness is rarely random.
5.3.1 The Missing Data Taxonomy
Statistical theory distinguishes three mechanisms:
MCAR (Missing Completely at Random): Missingness is independent of both observed and unobserved data.
Clinical example: A lab machine randomly malfunctions, losing 5% of results regardless of patient characteristics.
Reality check: True MCAR is rare in clinical data.
MAR (Missing at Random): Missingness depends on observed data but not on the missing values themselves.
Clinical example: Sicker patients (observable via vital signs) get more labs. Missingness depends on severity, but given severity, the actual lab value doesn’t affect whether it’s measured.
Reality check: Often assumed but hard to verify.
MNAR (Missing Not at Random): Missingness depends on the unobserved values themselves.
Clinical example: HbA1c is only ordered when diabetes is suspected. Patients without diabetes suspicion don’t get tested—their (likely normal) HbA1c values are missing because they’re normal. The missing values are systematically different from observed values.
Reality check: This is the default assumption for clinical data.
5.3.2 Informative Missingness in EHR Data
In clinical practice, the decision to measure something is itself clinical information:
| Variable | Why It Might Be Missing | What Missingness Signals |
|---|---|---|
| Lactate | Not suspected of sepsis | Lower acuity |
| Troponin | No chest pain | Lower cardiac risk |
| Blood culture | No infection concern | Lower infection probability |
| PT/INR | Not on anticoagulation | Different medication profile |
| HIV test | Not indicated | Different risk profile |
The missingness pattern IS a feature. A model that imputes lactate to the population mean loses this signal.
5.3.3 Visualizing Missing Data Patterns
import matplotlib.pyplot as plt
import seaborn as sns
def plot_missing_matrix(df, max_cols=30):
"""Visualize missing data patterns."""
# Select columns with some missingness
missing_cols = df.columns[df.isna().any()].tolist()[:max_cols]
if not missing_cols:
print("No missing data!")
return
# Create missing indicator matrix
missing_matrix = df[missing_cols].isna().astype(int)
# Sort by total missingness
col_order = missing_matrix.sum().sort_values(ascending=False).index
missing_matrix = missing_matrix[col_order]
# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Bar plot of missingness by column
missing_pct = missing_matrix.mean() * 100
axes[0].barh(range(len(missing_pct)), missing_pct.values)
axes[0].set_yticks(range(len(missing_pct)))
axes[0].set_yticklabels(missing_pct.index)
axes[0].set_xlabel('Percent Missing')
axes[0].set_title('Missing Data by Column')
# Heatmap of co-missingness
co_missing = missing_matrix.T.corr()
sns.heatmap(co_missing, ax=axes[1], cmap='RdBu_r', center=0,
xticklabels=True, yticklabels=True)
axes[1].set_title('Co-Missingness Correlation')
plt.tight_layout()
plt.show()
plot_missing_matrix(df)5.3.4 Strategies for Missing Clinical Data
1. Missingness Indicators (Recommended for clinical ML)
Create binary features indicating whether each variable was observed:
def add_missing_indicators(df, columns):
"""Add binary missing indicators for specified columns."""
df_out = df.copy()
for col in columns:
df_out[f'{col}_missing'] = df[col].isna().astype(int)
return df_out
# Add indicators for key labs
lab_cols = ['lactate', 'troponin', 'wbc', 'creatinine']
df = add_missing_indicators(df, lab_cols)2. Simple Imputation with Missing Indicators
Impute to allow models to use observed values, but keep indicators:
from sklearn.impute import SimpleImputer
def impute_with_indicators(df, columns, strategy='median'):
"""Impute missing values and add indicators."""
df_out = df.copy()
for col in columns:
# Add indicator first
df_out[f'{col}_missing'] = df[col].isna().astype(int)
# Then impute
imputer = SimpleImputer(strategy=strategy)
df_out[col] = imputer.fit_transform(df_out[[col]])
return df_out3. Imputation Within Clinical Subgroups
If you must impute, do it within clinically meaningful groups:
def impute_by_group(df, col, group_col, strategy='median'):
"""Impute missing values within groups."""
df_out = df.copy()
# Group-specific imputation
for group in df[group_col].unique():
mask = df[group_col] == group
group_median = df.loc[mask, col].median()
df_out.loc[mask & df[col].isna(), col] = group_median
# Fill remaining with global median
global_median = df[col].median()
df_out[col] = df_out[col].fillna(global_median)
return df_out
# Impute creatinine within age groups
df = impute_by_group(df, 'creatinine', 'age_group')4. Multiple Imputation (For Inference)
For statistical inference (not prediction), multiple imputation properly accounts for uncertainty:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Multiple imputation
n_imputations = 5
imputed_datasets = []
for i in range(n_imputations):
imputer = IterativeImputer(random_state=i, max_iter=10)
imputed = imputer.fit_transform(df[numeric_cols])
imputed_datasets.append(pd.DataFrame(imputed, columns=numeric_cols))
# Analysis should be run on each imputed dataset, then pooled5.3.5 What NOT to Do
- Don’t drop patients with missing data (unless MCAR, which is rare)
- Don’t impute without indicators (loses informative missingness signal)
- Don’t assume MAR without justification (MNAR is common in clinical data)
- Don’t impute the outcome variable (that’s the thing you’re trying to predict)
5.4 Visualization Techniques for Clinical EDA
Clinical Context: A table tells you 15% of creatinine values are missing. A visualization shows you they’re missing specifically for patients admitted on weekends, pointing to a staffing or workflow issue. Effective visualization reveals patterns that summary statistics hide.
5.4.1 Distribution Plots
Understanding variable distributions is essential before modeling. Clinical variables often have non-normal distributions with meaningful patterns.
Histograms and Density Plots
import matplotlib.pyplot as plt
import seaborn as sns
def plot_distributions(df, columns, ncols=3, figsize=(15, 4)):
"""Plot distributions for multiple numeric columns."""
nrows = (len(columns) + ncols - 1) // ncols
fig, axes = plt.subplots(nrows, ncols, figsize=(figsize[0], figsize[1] * nrows))
axes = axes.flatten() if nrows > 1 else [axes] if ncols == 1 else axes
for i, col in enumerate(columns):
ax = axes[i]
data = df[col].dropna()
# Histogram with KDE overlay
sns.histplot(data, kde=True, ax=ax, color='steelblue', alpha=0.7)
ax.axvline(data.mean(), color='red', linestyle='--', label=f'Mean: {data.mean():.1f}')
ax.axvline(data.median(), color='green', linestyle='--', label=f'Median: {data.median():.1f}')
ax.set_title(f'{col}\n(n={len(data):,}, missing={df[col].isna().sum():,})')
ax.legend(fontsize=8)
# Hide unused subplots
for j in range(i + 1, len(axes)):
axes[j].set_visible(False)
plt.tight_layout()
plt.show()
# Plot key clinical variables
vital_cols = ['heart_rate', 'systolic_bp', 'temperature_c', 'respiratory_rate']
plot_distributions(df, vital_cols)Identifying Distribution Shapes
Clinical variables often show characteristic patterns:
- Bimodal: May indicate two patient populations (e.g., temperature with fever vs. normal)
- Right-skewed: Common for lab values, lengths of stay, costs
- Truncated: May indicate data collection limits or physiological bounds
- Spike at specific value: Often indicates default or imputed values
def detect_distribution_anomalies(series, name="Variable"):
"""Flag potential distribution issues."""
data = series.dropna()
issues = []
# Check for extreme skewness
skewness = data.skew()
if abs(skewness) > 2:
direction = "right" if skewness > 0 else "left"
issues.append(f"Highly {direction}-skewed (skew={skewness:.2f})")
# Check for spikes (possible default values)
value_counts = data.value_counts(normalize=True)
if value_counts.iloc[0] > 0.2:
issues.append(f"Mode appears in {value_counts.iloc[0]*100:.1f}% of values (possible default?)")
# Check for outlier percentage (IQR method)
Q1, Q3 = data.quantile([0.25, 0.75])
IQR = Q3 - Q1
outlier_mask = (data < Q1 - 1.5*IQR) | (data > Q3 + 1.5*IQR)
outlier_pct = outlier_mask.mean() * 100
if outlier_pct > 5:
issues.append(f"{outlier_pct:.1f}% outliers by IQR method")
if issues:
print(f"\n{name}:")
for issue in issues:
print(f" ⚠️ {issue}")5.4.2 Comparing Groups Visually
Comparing distributions across patient groups (outcome, demographics, time periods) reveals heterogeneity and potential bias.
Box Plots by Group
def plot_by_group(df, value_col, group_col, figsize=(10, 5)):
"""Compare distributions across groups."""
fig, axes = plt.subplots(1, 2, figsize=figsize)
# Box plot
sns.boxplot(data=df, x=group_col, y=value_col, ax=axes[0])
axes[0].set_title(f'{value_col} by {group_col}')
# Overlapping density plots
for group in df[group_col].unique():
group_data = df[df[group_col] == group][value_col].dropna()
sns.kdeplot(group_data, ax=axes[1], label=f'{group} (n={len(group_data):,})')
axes[1].set_title(f'{value_col} distribution by {group_col}')
axes[1].legend()
plt.tight_layout()
plt.show()
# Compare lactate between survivors and non-survivors
plot_by_group(df, 'lactate', 'mortality')Violin Plots for Richer Comparison
def plot_violin_comparison(df, value_cols, group_col, figsize=(12, 4)):
"""Violin plots showing full distribution shape by group."""
fig, axes = plt.subplots(1, len(value_cols), figsize=figsize)
if len(value_cols) == 1:
axes = [axes]
for ax, col in zip(axes, value_cols):
sns.violinplot(data=df, x=group_col, y=col, ax=ax, inner='quartile')
ax.set_title(col)
plt.tight_layout()
plt.show()
# Compare multiple labs between outcome groups
plot_violin_comparison(df, ['creatinine', 'lactate', 'bilirubin'], 'mortality')5.4.3 Correlation Heatmaps
Understanding variable relationships helps identify redundancy and potential multicollinearity.
def plot_correlation_heatmap(df, columns=None, method='pearson', figsize=(10, 8)):
"""Create correlation heatmap with hierarchical clustering."""
if columns is None:
columns = df.select_dtypes(include=[np.number]).columns.tolist()
# Calculate correlation matrix
corr = df[columns].corr(method=method)
# Create clustered heatmap
g = sns.clustermap(corr, cmap='RdBu_r', center=0,
figsize=figsize, annot=True, fmt='.2f',
linewidths=0.5, vmin=-1, vmax=1)
g.fig.suptitle(f'{method.title()} Correlation Matrix (Clustered)', y=1.02)
plt.show()
return corr
# Correlation among lab values
lab_cols = ['wbc', 'hemoglobin', 'platelets', 'creatinine', 'bilirubin', 'lactate']
corr_matrix = plot_correlation_heatmap(df, lab_cols)Identifying Highly Correlated Features
def find_high_correlations(corr_matrix, threshold=0.7):
"""Find pairs of features with correlation above threshold."""
high_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i + 1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > threshold:
high_corr.append({
'feature_1': corr_matrix.columns[i],
'feature_2': corr_matrix.columns[j],
'correlation': corr_matrix.iloc[i, j]
})
return pd.DataFrame(high_corr).sort_values('correlation', ascending=False)
# Find highly correlated pairs
high_corr_pairs = find_high_correlations(corr_matrix, threshold=0.7)
print("Highly correlated feature pairs:")
print(high_corr_pairs)5.4.4 Scatter Plots for Clinical Relationships
Scatter plots reveal non-linear relationships and subgroup patterns that correlation coefficients miss.
def plot_clinical_scatter(df, x_col, y_col, hue_col=None, figsize=(8, 6)):
"""Scatter plot with regression line and optional grouping."""
fig, ax = plt.subplots(figsize=figsize)
if hue_col:
sns.scatterplot(data=df, x=x_col, y=y_col, hue=hue_col,
alpha=0.6, ax=ax)
else:
sns.scatterplot(data=df, x=x_col, y=y_col, alpha=0.6, ax=ax)
# Add regression line
sns.regplot(data=df, x=x_col, y=y_col, scatter=False,
ax=ax, color='red', line_kws={'linestyle': '--'})
ax.set_title(f'{y_col} vs {x_col}')
plt.tight_layout()
plt.show()
# Creatinine vs BUN (should correlate in renal function)
plot_clinical_scatter(df, 'bun', 'creatinine', hue_col='aki_stage')Pair Plots for Multiple Variables
def plot_pairs(df, columns, hue_col=None, sample_n=1000):
"""Pair plot for exploring multiple variable relationships."""
# Sample for performance if dataset is large
if len(df) > sample_n:
df_sample = df.sample(n=sample_n, random_state=42)
else:
df_sample = df
g = sns.pairplot(df_sample[columns + ([hue_col] if hue_col else [])],
hue=hue_col, diag_kind='kde',
plot_kws={'alpha': 0.5, 's': 20})
g.fig.suptitle('Variable Relationships', y=1.02)
plt.show()
# Explore relationships among vital signs by outcome
vital_cols = ['heart_rate', 'systolic_bp', 'respiratory_rate', 'oxygen_saturation']
plot_pairs(df, vital_cols, hue_col='mortality')5.4.5 Temporal Visualizations
Clinical data has a time dimension. Visualizing temporal patterns reveals trends, seasonality, and data artifacts.
Outcome Rates Over Time
def plot_outcome_over_time(df, date_col, outcome_col, freq='M', figsize=(12, 5)):
"""Plot outcome rate and volume over time."""
df_copy = df.copy()
df_copy['period'] = pd.to_datetime(df_copy[date_col]).dt.to_period(freq)
trends = df_copy.groupby('period').agg({
outcome_col: ['mean', 'sum', 'count']
})
trends.columns = ['rate', 'events', 'total']
trends = trends.reset_index()
trends['period_str'] = trends['period'].astype(str)
fig, ax1 = plt.subplots(figsize=figsize)
# Outcome rate (line)
color1 = 'steelblue'
ax1.plot(trends['period_str'], trends['rate'], 'o-', color=color1, linewidth=2)
ax1.fill_between(trends['period_str'], trends['rate'], alpha=0.2, color=color1)
ax1.set_xlabel('Time Period')
ax1.set_ylabel('Outcome Rate', color=color1)
ax1.tick_params(axis='y', labelcolor=color1)
ax1.tick_params(axis='x', rotation=45)
# Volume (bars)
ax2 = ax1.twinx()
ax2.bar(trends['period_str'], trends['total'], alpha=0.3, color='gray', label='Volume')
ax2.set_ylabel('Patient Volume', color='gray')
plt.title(f'{outcome_col} Rate Over Time')
plt.tight_layout()
plt.show()
# Flag suspicious jumps
rate_changes = trends['rate'].pct_change().abs()
if rate_changes.max() > 0.5:
print(f"⚠️ Large rate change detected ({rate_changes.max()*100:.0f}%) - investigate for data artifacts")
plot_outcome_over_time(df, 'admission_date', 'sepsis', freq='Q')Feature Distributions Over Time
def plot_feature_drift(df, date_col, feature_col, freq='Q', figsize=(12, 5)):
"""Visualize how a feature's distribution changes over time."""
df_copy = df.copy()
df_copy['period'] = pd.to_datetime(df_copy[date_col]).dt.to_period(freq)
periods = df_copy['period'].unique()
fig, axes = plt.subplots(1, 2, figsize=figsize)
# Box plots over time
df_copy['period_str'] = df_copy['period'].astype(str)
sns.boxplot(data=df_copy, x='period_str', y=feature_col, ax=axes[0])
axes[0].tick_params(axis='x', rotation=45)
axes[0].set_title(f'{feature_col} Distribution Over Time')
# Summary statistics
summary = df_copy.groupby('period')[feature_col].agg(['mean', 'median', 'std'])
summary = summary.reset_index()
summary['period_str'] = summary['period'].astype(str)
axes[1].plot(summary['period_str'], summary['mean'], 'o-', label='Mean')
axes[1].plot(summary['period_str'], summary['median'], 's--', label='Median')
axes[1].fill_between(summary['period_str'],
summary['mean'] - summary['std'],
summary['mean'] + summary['std'],
alpha=0.2)
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_title(f'{feature_col} Summary Statistics')
axes[1].legend()
plt.tight_layout()
plt.show()
# Check if lab values drift over time
plot_feature_drift(df, 'admission_date', 'creatinine', freq='Q')5.4.6 Visualization Best Practices for Clinical Data
1. Always show sample sizes: A difference between groups is meaningless without knowing n.
2. Use clinically meaningful scales: Plot lab values on their clinically relevant range, not auto-scaled.
def plot_with_clinical_range(df, col, normal_range, figsize=(8, 5)):
"""Plot distribution with clinical reference range."""
fig, ax = plt.subplots(figsize=figsize)
data = df[col].dropna()
sns.histplot(data, kde=True, ax=ax, alpha=0.7)
# Add normal range
ax.axvline(normal_range[0], color='green', linestyle='--', label='Normal range')
ax.axvline(normal_range[1], color='green', linestyle='--')
ax.axvspan(normal_range[0], normal_range[1], alpha=0.1, color='green')
# Calculate % outside normal
below = (data < normal_range[0]).mean() * 100
above = (data > normal_range[1]).mean() * 100
ax.set_title(f'{col}\n{below:.1f}% below, {above:.1f}% above normal range')
ax.legend()
plt.tight_layout()
plt.show()
# Plot creatinine with normal range
plot_with_clinical_range(df, 'creatinine', normal_range=(0.7, 1.3))3. Consider colorblind-friendly palettes: 8% of males are colorblind. Avoid red-green combinations alone.
4. Save plots for documentation: Every EDA plot should be saved for your analysis report.
def save_eda_plot(fig, name, output_dir='eda_figures'):
"""Save plot with consistent naming and quality."""
import os
os.makedirs(output_dir, exist_ok=True)
filepath = os.path.join(output_dir, f'{name}.png')
fig.savefig(filepath, dpi=150, bbox_inches='tight')
print(f"Saved: {filepath}")5.5 Clinical Data Quality Issues
Clinical Context: Your mortality prediction model has 99% AUC. Suspicious. Investigation reveals that “discharge disposition = expired” was included as a feature—information that’s only available after the patient dies. This is data leakage. It’s obvious in hindsight, but these issues are pervasive in clinical datasets.
5.5.1 Temporal Issues
Problem 1: Future Information Leakage
Variables that aren’t available at prediction time but appear in the dataset:
def check_temporal_leakage(df, outcome_time_col, feature_time_cols):
"""Check for features that occur after the outcome."""
issues = []
for feat_col in feature_time_cols:
# Find cases where feature timestamp is after outcome
n_leaks = (df[feat_col] > df[outcome_time_col]).sum()
if n_leaks > 0:
pct = n_leaks / len(df) * 100
issues.append(f"⚠️ {feat_col}: {n_leaks:,} rows ({pct:.1f}%) have feature AFTER outcome")
return issues
# Check feature timestamps against outcome time
outcome_time = 'sepsis_onset_time'
feature_times = ['last_vitals_time', 'last_lab_time', 'last_med_time']
issues = check_temporal_leakage(df, outcome_time, feature_times)Problem 2: Backdated Documentation
Clinical notes and orders are sometimes documented hours after the clinical event:
- Admission orders entered after initial workup complete
- Notes dictated at shift end, timestamped then
- Verbal orders documented retroactively
Mitigation: Use the earlier of documentation time and clinical event time, or add a safety buffer (e.g., only use data from 4+ hours before prediction time).
Problem 3: Timestamp Inconsistencies
def check_timestamp_quality(df, time_col):
"""Check timestamp column for common issues."""
issues = []
# Parse timestamps
times = pd.to_datetime(df[time_col], errors='coerce')
# Check for unparseable timestamps
n_invalid = times.isna().sum() - df[time_col].isna().sum()
if n_invalid > 0:
issues.append(f"⚠️ {n_invalid} timestamps couldn't be parsed")
# Check for suspicious patterns
hours = times.dt.hour
# Excess midnight timestamps (often default values)
midnight_pct = (hours == 0).mean() * 100
if midnight_pct > 20:
issues.append(f"⚠️ {midnight_pct:.1f}% of timestamps are midnight (default value?)")
# Check date range
min_date, max_date = times.min(), times.max()
issues.append(f"Date range: {min_date} to {max_date}")
# Future dates
n_future = (times > pd.Timestamp.now()).sum()
if n_future > 0:
issues.append(f"⚠️ {n_future} timestamps are in the future")
return issues5.5.2 Impossible Values
Clinical variables have physiologically plausible ranges. Values outside these ranges indicate data quality issues:
# Physiologically plausible ranges for common clinical variables
CLINICAL_RANGES = {
'heart_rate': (20, 300), # bpm
'systolic_bp': (40, 300), # mmHg
'diastolic_bp': (20, 200), # mmHg
'temperature_c': (25, 45), # Celsius
'respiratory_rate': (4, 60), # breaths/min
'oxygen_saturation': (50, 100), # percent
'sodium': (100, 180), # mEq/L
'potassium': (1.5, 9.0), # mEq/L
'creatinine': (0.1, 30), # mg/dL
'glucose': (20, 1500), # mg/dL
'hemoglobin': (2, 25), # g/dL
'wbc': (0.1, 200), # 10^9/L
'platelets': (5, 1500), # 10^9/L
'age': (0, 120), # years
'weight_kg': (1, 400), # kg
'height_cm': (40, 250), # cm
}
def check_clinical_ranges(df, ranges=CLINICAL_RANGES):
"""Flag values outside physiologically plausible ranges."""
issues = []
for col, (min_val, max_val) in ranges.items():
if col in df.columns:
below = (df[col] < min_val).sum()
above = (df[col] > max_val).sum()
if below > 0 or above > 0:
issues.append({
'column': col,
'valid_range': f"[{min_val}, {max_val}]",
'below_range': below,
'above_range': above,
'pct_invalid': (below + above) / len(df) * 100
})
return pd.DataFrame(issues)
# Check ranges
range_issues = check_clinical_ranges(df)
print(range_issues)5.5.3 Coding Artifacts
ICD Code Changes
Diagnosis codes change over time (ICD-9 to ICD-10 transition in 2015). Models trained on one coding system may fail on another.
def detect_icd_version(codes):
"""Detect whether codes appear to be ICD-9 or ICD-10."""
# ICD-10 codes typically start with a letter
icd10_pattern = codes.str.match(r'^[A-Z]', na=False)
pct_icd10 = icd10_pattern.mean() * 100
if pct_icd10 > 90:
return "ICD-10"
elif pct_icd10 < 10:
return "ICD-9"
else:
return f"Mixed ({pct_icd10:.0f}% ICD-10)"
# Check diagnosis code format
print(f"Primary diagnosis format: {detect_icd_version(df['primary_dx'])}")Documentation Practice Changes
Changes in incentives, guidelines, or EHR systems alter what gets documented:
- Sepsis coding increased dramatically after SEP-1 quality measure introduction
- Problem list completeness varies by EHR implementation
- Note templates change over time
Mitigation: Plot outcome rates and feature distributions over time. Sudden changes suggest artifacts, not clinical reality.
def plot_temporal_trends(df, date_col, outcome_col, freq='M'):
"""Plot outcome rate over time to detect artifacts."""
df_copy = df.copy()
df_copy['period'] = pd.to_datetime(df_copy[date_col]).dt.to_period(freq)
trends = df_copy.groupby('period')[outcome_col].agg(['mean', 'count'])
trends.columns = ['rate', 'n']
fig, ax1 = plt.subplots(figsize=(12, 5))
ax1.plot(trends.index.astype(str), trends['rate'], 'b-', linewidth=2)
ax1.set_xlabel('Time Period')
ax1.set_ylabel('Outcome Rate', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
ax1.tick_params(axis='x', rotation=45)
ax2 = ax1.twinx()
ax2.bar(trends.index.astype(str), trends['n'], alpha=0.3, color='gray')
ax2.set_ylabel('Sample Size', color='gray')
plt.title('Outcome Rate and Sample Size Over Time')
plt.tight_layout()
plt.show()5.5.4 Label Quality
The outcome variable is data too. Common label problems:
Retrospective Labeling: Outcomes assigned using information not available at the time (death certificates, follow-up visits).
Billing-Based Labels: ICD codes reflect billing incentives, not clinical reality. “Sepsis” coding increased 30% after financial penalties for sepsis mortality were introduced.
Annotation Disagreement: When labels come from human annotators, inter-rater reliability matters.
def check_label_quality(df, label_col):
"""Basic label quality checks."""
issues = []
# Class distribution
dist = df[label_col].value_counts(normalize=True)
issues.append(f"Label distribution:\n{dist}")
# Missing labels
n_missing = df[label_col].isna().sum()
if n_missing > 0:
issues.append(f"⚠️ {n_missing} missing labels ({n_missing/len(df)*100:.1f}%)")
# Unexpected values
expected = {0, 1} if df[label_col].dtype in ['int64', 'float64'] else None
if expected:
unexpected = set(df[label_col].dropna().unique()) - expected
if unexpected:
issues.append(f"⚠️ Unexpected label values: {unexpected}")
return issues5.6 Feature Engineering from Clinical Variables
Clinical Context: Raw lab values tell part of the story. But a creatinine of 1.5 means something different if the patient’s baseline is 1.4 versus 0.8. Clinical feature engineering captures domain knowledge that raw variables miss.
5.6.1 Temporal Features
Change from Baseline
def calculate_delta_from_baseline(df, value_col, patient_col, time_col,
baseline_window_hours=24):
"""Calculate change from baseline for each measurement."""
df = df.sort_values([patient_col, time_col])
results = []
for patient_id, group in df.groupby(patient_col):
baseline = group[value_col].iloc[:baseline_window_hours].median()
group_result = group.copy()
group_result[f'{value_col}_delta'] = group[value_col] - baseline
group_result[f'{value_col}_pct_change'] = (group[value_col] - baseline) / baseline * 100
results.append(group_result)
return pd.concat(results)Trends and Slopes
from scipy import stats
def calculate_trend(df, value_col, patient_col, time_col, window_hours=6):
"""Calculate trend (slope) of values over recent window."""
df = df.sort_values([patient_col, time_col])
def get_slope(group):
if len(group) < 2:
return np.nan
# Convert time to hours
times = (group[time_col] - group[time_col].min()).dt.total_seconds() / 3600
values = group[value_col].values
# Linear regression
slope, _, _, _, _ = stats.linregress(times, values)
return slope
# Calculate slope for rolling window
trends = df.groupby(patient_col).apply(
lambda x: x.rolling(window=window_hours, on=time_col).apply(get_slope)
)
return trendsTime Since Event
def time_since_event(df, event_col, current_time_col):
"""Calculate hours since event occurred."""
event_time = pd.to_datetime(df[event_col])
current_time = pd.to_datetime(df[current_time_col])
return (current_time - event_time).dt.total_seconds() / 3600
# Example: hours since admission, last medication, etc.
df['hours_since_admission'] = time_since_event(df, 'admission_time', 'current_time')
df['hours_since_last_vasopressor'] = time_since_event(df, 'last_vasopressor_time', 'current_time')5.6.2 Aggregation Features
When multiple measurements exist, summarize meaningfully:
def aggregate_measurements(df, value_col, patient_col, time_col, window_hours=24):
"""Create aggregated features from multiple measurements."""
df = df.sort_values([patient_col, time_col])
agg_funcs = {
f'{value_col}_min': 'min',
f'{value_col}_max': 'max',
f'{value_col}_mean': 'mean',
f'{value_col}_std': 'std',
f'{value_col}_last': 'last',
f'{value_col}_first': 'first',
f'{value_col}_count': 'count'
}
# Range (worst deviation from normal)
agg_funcs[f'{value_col}_range'] = lambda x: x.max() - x.min()
aggregated = df.groupby(patient_col)[value_col].agg(agg_funcs)
return aggregated5.6.3 Clinical Severity Scores
Combine variables into validated clinical scores:
def calculate_sofa_score(row):
"""Calculate SOFA (Sequential Organ Failure Assessment) score.
Simplified version - real implementation needs more variables.
"""
score = 0
# Respiratory (PaO2/FiO2)
pf_ratio = row.get('pao2_fio2_ratio', np.nan)
if pd.notna(pf_ratio):
if pf_ratio < 100:
score += 4
elif pf_ratio < 200:
score += 3
elif pf_ratio < 300:
score += 2
elif pf_ratio < 400:
score += 1
# Renal (Creatinine)
creat = row.get('creatinine', np.nan)
if pd.notna(creat):
if creat >= 5.0:
score += 4
elif creat >= 3.5:
score += 3
elif creat >= 2.0:
score += 2
elif creat >= 1.2:
score += 1
# Coagulation (Platelets)
plt = row.get('platelets', np.nan)
if pd.notna(plt):
if plt < 20:
score += 4
elif plt < 50:
score += 3
elif plt < 100:
score += 2
elif plt < 150:
score += 1
# Add liver, cardiovascular, neurological components...
return score
df['sofa_score'] = df.apply(calculate_sofa_score, axis=1)5.6.4 Interaction Features
Some clinical relationships are multiplicative:
# Shock Index: HR / SBP (elevated suggests shock)
df['shock_index'] = df['heart_rate'] / df['systolic_bp']
# BMI from height and weight
df['bmi'] = df['weight_kg'] / (df['height_cm'] / 100) ** 2
# Creatinine clearance (estimated GFR)
def estimate_gfr(row):
"""Cockcroft-Gault equation for estimated GFR."""
age = row['age']
weight = row['weight_kg']
creat = row['creatinine']
is_female = row['sex'] == 'F'
gfr = ((140 - age) * weight) / (72 * creat)
if is_female:
gfr *= 0.85
return gfr
df['egfr'] = df.apply(estimate_gfr, axis=1)5.6.5 Categorical Encoding for Clinical Variables
# Age groups (clinically meaningful)
df['age_group'] = pd.cut(df['age'],
bins=[0, 18, 45, 65, 80, 120],
labels=['pediatric', 'young_adult', 'middle_aged', 'elderly', 'very_elderly'])
# BMI categories
df['bmi_category'] = pd.cut(df['bmi'],
bins=[0, 18.5, 25, 30, 35, 100],
labels=['underweight', 'normal', 'overweight', 'obese', 'severely_obese'])
# Time of day (captures staffing, workflow patterns)
df['hour_of_day'] = pd.to_datetime(df['admission_time']).dt.hour
df['time_of_day'] = pd.cut(df['hour_of_day'],
bins=[0, 6, 12, 18, 24],
labels=['night', 'morning', 'afternoon', 'evening'],
include_lowest=True)
# Weekend vs weekday
df['is_weekend'] = pd.to_datetime(df['admission_time']).dt.dayofweek >= 55.7 Handling Class Imbalance
Clinical Context: In your sepsis dataset, only 3% of patients develop sepsis. If your model predicts “no sepsis” for everyone, it achieves 97% accuracy. This is useless. Class imbalance requires special handling.
5.7.1 Quantifying Imbalance
def describe_imbalance(y):
"""Describe class distribution and imbalance ratio."""
counts = pd.Series(y).value_counts()
props = counts / len(y) * 100
print("Class distribution:")
for cls in counts.index:
print(f" Class {cls}: {counts[cls]:,} ({props[cls]:.1f}%)")
imbalance_ratio = counts.max() / counts.min()
print(f"\nImbalance ratio: {imbalance_ratio:.1f}:1")
return imbalance_ratio
imbalance_ratio = describe_imbalance(df['outcome'])5.7.2 Strategy 1: Choose Appropriate Metrics
Accuracy is meaningless for imbalanced data. Use:
- AUROC: Overall ranking ability
- AUPRC: Precision-recall (emphasizes minority class)
- Sensitivity at fixed specificity: Clinically relevant operating points
- F1 score: Harmonic mean of precision and recall
from sklearn.metrics import (roc_auc_score, average_precision_score,
f1_score, recall_score, precision_score)
def evaluate_imbalanced(y_true, y_prob, threshold=0.5):
"""Evaluation metrics for imbalanced classification."""
y_pred = (y_prob >= threshold).astype(int)
metrics = {
'auroc': roc_auc_score(y_true, y_prob),
'auprc': average_precision_score(y_true, y_prob),
'f1': f1_score(y_true, y_pred),
'sensitivity': recall_score(y_true, y_pred),
'specificity': recall_score(y_true, y_pred, pos_label=0),
'precision': precision_score(y_true, y_pred),
}
return metrics5.7.3 Strategy 2: Resampling (Use Carefully)
Undersampling: Reduce majority class. Simple but loses information.
Oversampling: Increase minority class. Risk of overfitting.
SMOTE: Generate synthetic minority examples. Better than naive oversampling, but can create unrealistic examples.
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
# IMPORTANT: Only apply resampling to training data, never test data
# Use within cross-validation, not before splitting
from sklearn.model_selection import StratifiedKFold
def evaluate_with_resampling(X, y, model, resampler=None, n_splits=5):
"""Cross-validation with resampling applied correctly."""
cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
scores = []
for train_idx, test_idx in cv.split(X, y):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
# Apply resampling only to training data
if resampler:
X_train_resampled, y_train_resampled = resampler.fit_resample(X_train, y_train)
else:
X_train_resampled, y_train_resampled = X_train, y_train
# Train and evaluate
model.fit(X_train_resampled, y_train_resampled)
y_prob = model.predict_proba(X_test)[:, 1]
scores.append(roc_auc_score(y_test, y_prob))
return np.mean(scores), np.std(scores)5.7.4 Strategy 3: Cost-Sensitive Learning
Penalize errors on the minority class more heavily:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Option 1: Class weights
# 'balanced' automatically weights inversely proportional to frequency
model = LogisticRegression(class_weight='balanced')
# Option 2: Custom weights
# Weight the minority class more heavily
weight_ratio = (y == 0).sum() / (y == 1).sum()
model = LogisticRegression(class_weight={0: 1, 1: weight_ratio})
# Option 3: Sample weights during fit
sample_weights = np.where(y == 1, weight_ratio, 1.0)
model.fit(X, y, sample_weight=sample_weights)5.7.5 Strategy 4: Threshold Optimization
The default 0.5 threshold is arbitrary. Choose based on clinical requirements:
def find_threshold_for_sensitivity(y_true, y_prob, target_sensitivity=0.90):
"""Find threshold that achieves target sensitivity."""
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_true, y_prob)
# Find threshold where sensitivity (tpr) >= target
valid_idx = np.where(tpr >= target_sensitivity)[0]
if len(valid_idx) == 0:
return thresholds[0] # Return lowest threshold if target unreachable
# Among valid thresholds, find one with best specificity
best_idx = valid_idx[np.argmin(fpr[valid_idx])]
return thresholds[best_idx]
# Find threshold for 90% sensitivity
threshold_90sens = find_threshold_for_sensitivity(y_test, y_prob, 0.90)
print(f"Threshold for 90% sensitivity: {threshold_90sens:.3f}")5.8 Putting It All Together: EDA Checklist
Before modeling any clinical dataset:
5.8.1 Data Structure
5.8.2 Missing Data
5.8.3 Data Quality
5.8.4 Labels
5.8.5 Features
5.8.6 Generalization
5.9 Chapter Summary
Exploratory data analysis is not optional in clinical AI—it’s where you discover the problems that would otherwise sink your model in deployment.
Key principles:
- Garbage in, garbage out: No model can compensate for data quality issues
- Missingness is information: In clinical data, what wasn’t measured tells you something
- Timestamps lie: Documentation time ≠ clinical event time
- Labels are imperfect: Billing codes and retrospective annotations have limitations
- Class imbalance is the norm: Most clinical outcomes are rare; plan accordingly
Critical practices:
- Profile every dataset before modeling
- Visualize distributions and missingness patterns
- Check for impossible values and temporal inconsistencies
- Create missing indicators rather than naive imputation
- Engineer features that capture clinical meaning
- Choose metrics appropriate for imbalanced data
Red flags that signal data problems:
- Implausibly high model performance (AUC > 0.95 on hard problems)
- Features with unexplainably high importance
- Performance that doesn’t generalize to new time periods
- Results that contradict clinical intuition
The time spent on EDA is never wasted. Every hour invested in understanding your data saves days of debugging models that learned the wrong things.
5.10 Exercises
Missing Data Investigation: Load a clinical dataset (MIMIC-III or similar). For five key lab variables, calculate the missingness rate overall and stratified by patient outcome. Is missingness informative?
Temporal Leakage Hunt: Using the same dataset, identify three features that might contain information from after the prediction time. How would you modify them to prevent leakage?
Value Range Audit: Write a function that checks all numeric columns against physiologically plausible ranges. Apply it to a clinical dataset and report what you find.
Feature Engineering: Starting with raw vital signs (heart rate, blood pressure, temperature), create at least five derived features that capture clinical meaning (trends, ratios, deviations from normal).
Imbalance Simulation: Create a synthetic dataset with 2% positive rate. Train a classifier with and without class weighting. Compare accuracy, AUROC, and AUPRC. Which metrics are useful?
EDA Report: Conduct a full exploratory data analysis on a clinical dataset of your choice. Document your findings in a 3-5 page report covering data quality, missingness, distributions, and potential modeling challenges.