16 AI Ops & Deployment

Developing a model is only half the battle. This chapter covers the operational infrastructure needed to deploy, serve, monitor, and maintain AI systems in clinical environments.

16.1 Containerization with Docker

Clinical Context: Your pneumonia classifier works perfectly on your laptop. But deploying to hospital servers with different Python versions, operating systems, and library configurations creates a nightmare of “it works on my machine.” Containers solve this by packaging your model with its entire runtime environment.

16.1.1 Why Containers?

Docker containers provide:

Reproducibility: Same environment everywhere—development, testing, production
Isolation: Model dependencies don’t conflict with hospital IT systems
Portability: Deploy to any infrastructure that runs Docker
Scalability: Spin up multiple containers to handle load

16.1.2 Basic Dockerfile for Model Serving

A Dockerfile defines the container’s environment:

# Start from Python base image
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Install dependencies first (for caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and application code
COPY model.pt .
COPY app.py .

# Expose the port your API runs on
EXPOSE 8000

# Command to run when container starts
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run:

# Build the image
docker build -t pneumonia-classifier:v1.0 .

# Run the container
docker run -p 8000:8000 pneumonia-classifier:v1.0

# Test the endpoint
curl http://localhost:8000/health

16.1.3 Best Practices for Medical AI Containers

Pin versions: Specify exact library versions in requirements.txt
Minimize image size: Use slim base images, multi-stage builds
No secrets in images: Use environment variables or secret managers
Health checks: Include endpoints to verify the model is functioning
Logging: Configure structured logging for audit trails

16.2 Model Serving with REST APIs

Clinical Context: The radiology PACS system needs to send images to your classifier and receive predictions. A REST API provides a standardized interface that any system can call, regardless of programming language.

16.2.1 FastAPI for Model Serving

FastAPI is a modern Python framework ideal for ML serving:

from fastapi import FastAPI, File, UploadFile
from pydantic import BaseModel
import torch
from torchvision import transforms
from PIL import Image
import io

app = FastAPI(title="Pneumonia Classifier API")

# Load model at startup
model = torch.load("model.pt", map_location="cpu")
model.eval()

# Define response schema
class PredictionResponse(BaseModel):
    prediction: str
    confidence: float
    model_version: str

# Preprocessing pipeline
preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485], [0.229])
])

@app.get("/health")
def health_check():
    return {"status": "healthy", "model_loaded": True}

@app.post("/predict", response_model=PredictionResponse)
async def predict(file: UploadFile = File(...)):
    # Read and preprocess image
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("L")
    tensor = preprocess(image).unsqueeze(0)

    # Run inference
    with torch.no_grad():
        logits = model(tensor)
        probs = torch.softmax(logits, dim=1)
        confidence, predicted = probs.max(dim=1)

    labels = ["Normal", "Pneumonia"]
    return PredictionResponse(
        prediction=labels[predicted.item()],
        confidence=confidence.item(),
        model_version="1.0.0"
    )

16.2.2 API Design Considerations

For clinical deployment:

Versioning: Include API version in URL (e.g., /v1/predict)
Input validation: Reject malformed requests before inference
Timeout handling: Set reasonable limits for inference time
Batch endpoints: Support multiple images per request for efficiency
Metadata: Return model version, timestamp, and confidence with every prediction

16.2.3 Authentication and Security

Medical AI APIs must be secured:

from fastapi import Depends, HTTPException, Security
from fastapi.security import APIKeyHeader

API_KEY = "your-secret-key"  # In production, use env vars
api_key_header = APIKeyHeader(name="X-API-Key")

def verify_api_key(api_key: str = Security(api_key_header)):
    if api_key != API_KEY:
        raise HTTPException(status_code=403, detail="Invalid API key")
    return api_key

@app.post("/predict")
async def predict(
    file: UploadFile = File(...),
    api_key: str = Depends(verify_api_key)
):
    # ... prediction logic

16.3 Monitoring and Drift Detection

Clinical Context: Your model performed well during validation, but three months into deployment, accuracy has dropped. The patient population shifted—flu season brought different chest X-ray presentations. Without monitoring, you wouldn’t know until patient outcomes suffered.

16.3.1 What to Monitor

Model monitoring tracks three categories:

1. System metrics (infrastructure health):

Request latency (p50, p95, p99)
Throughput (requests per second)
Error rates and types
CPU/memory/GPU utilization

2. Data metrics (input distribution):

Feature statistics (mean, variance, range)
Missing value rates
Input data quality scores

3. Model metrics (prediction behavior):

Prediction distribution (class frequencies)
Confidence score distribution
Outcome metrics when ground truth available

16.3.2 Detecting Data Drift

Data drift occurs when the input distribution changes from training. Common detection methods:

Population Stability Index (PSI):

\[ \text{PSI} = \sum_i (A_i - E_i) \times \ln\left(\frac{A_i}{E_i}\right) \]

where $A_i$ is actual proportion in bin $i$ and $E_i$ is expected (training) proportion.

PSI < 0.1: No significant drift
PSI 0.1–0.2: Moderate drift, investigate
PSI > 0.2: Significant drift, action required

Kolmogorov-Smirnov Test: Statistical test for whether two distributions differ.

import numpy as np
from scipy import stats

def detect_drift(reference_data, current_data, threshold=0.1):
    """Detect drift using KS test."""
    statistic, p_value = stats.ks_2samp(reference_data, current_data)
    return {
        "drift_detected": p_value < 0.05,
        "ks_statistic": statistic,
        "p_value": p_value
    }

# Monitor prediction confidence distribution
reference_confidences = np.load("training_confidences.npy")
current_confidences = get_recent_confidences(last_n_days=7)

drift_result = detect_drift(reference_confidences, current_confidences)
if drift_result["drift_detected"]:
    alert_team("Confidence distribution drift detected")

16.3.3 Logging for Audit Trails

Healthcare regulations require comprehensive logging:

import logging
import json
from datetime import datetime

# Structured logging
logging.basicConfig(
    format='%(message)s',
    level=logging.INFO
)
logger = logging.getLogger(__name__)

def log_prediction(request_id, input_hash, prediction, confidence):
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "request_id": request_id,
        "input_hash": input_hash,  # Don't log PHI
        "prediction": prediction,
        "confidence": confidence,
        "model_version": "1.0.0"
    }
    logger.info(json.dumps(log_entry))

16.4 Acceptance Testing and Validation

Clinical Context: Before deploying a model trained at Stanford to your community hospital, you need to verify it works for your patient population. Acceptance testing is the gatekeeper between development and deployment.

16.4.1 Local Validation Protocol

HW7 asks you to design acceptance testing. Key components:

1. Holdout test set: Reserve local data never seen during training

Minimum sample size for statistical power (often 200+ per class)
Representative of your patient demographics
Recent data (within last 6-12 months)

2. Performance thresholds: Define minimum acceptable metrics

AUROC ≥ 0.85 (or match published performance)
Sensitivity ≥ 0.90 for screening applications
Calibration error < 0.05

3. Subgroup analysis: Verify performance across:

Age groups
Sex
Disease severity
Scanner/equipment type

def run_acceptance_tests(model, test_data, config):
    """Run acceptance test suite."""
    results = {}

    # Overall performance
    y_pred = model.predict(test_data.X)
    y_prob = model.predict_proba(test_data.X)[:, 1]

    results["auroc"] = roc_auc_score(test_data.y, y_prob)
    results["sensitivity"] = recall_score(test_data.y, y_pred)
    results["specificity"] = recall_score(test_data.y, y_pred, pos_label=0)

    # Check against thresholds
    results["auroc_pass"] = results["auroc"] >= config["min_auroc"]
    results["sensitivity_pass"] = results["sensitivity"] >= config["min_sens"]

    # Subgroup analysis
    for group_col in config["subgroup_columns"]:
        results[f"subgroup_{group_col}"] = analyze_subgroups(
            test_data, y_prob, group_col
        )

    return results

# Run and report
config = {
    "min_auroc": 0.85,
    "min_sens": 0.90,
    "subgroup_columns": ["age_group", "sex", "scanner_type"]
}
results = run_acceptance_tests(model, local_test_data, config)

16.4.2 Continuous Validation

Acceptance testing isn’t one-time. Establish ongoing validation:

Weekly/monthly performance reports
Automatic alerts when metrics drop below thresholds
Quarterly review with clinical stakeholders
Re-validation after any model update

16.5 Governance and Documentation

Clinical Context: A year from now, who knows why this model was deployed, what its limitations are, or who to contact when issues arise? Governance documentation ensures institutional knowledge persists.

16.5.1 Model Cards

A model card documents essential information:

# Pneumonia Classifier v1.0 - Model Card

## Model Details
- **Developer**: AI Team, University Hospital
- **Date**: December 2024
- **Version**: 1.0.0
- **Type**: Binary image classifier (ResNet-18)

## Intended Use
- **Primary use**: Chest X-ray triage for pneumonia
- **Users**: Radiologists, ED physicians
- **Out of scope**: Pediatric patients, CT images

## Training Data
- **Source**: ChestX-ray14, PneumoniaMNIST
- **Size**: 50,000 images
- **Demographics**: 55% male, mean age 52

## Performance
- **AUROC**: 0.92 (95% CI: 0.90-0.94)
- **Sensitivity**: 0.88 at 0.5 threshold
- **Specificity**: 0.85 at 0.5 threshold

## Limitations
- Lower performance on portable X-rays
- Not validated for immunocompromised patients
- Requires PA or AP view chest radiograph

## Ethical Considerations
- Subgroup analysis showed 3% lower AUROC for age >80
- Model should support, not replace, clinical judgment

16.5.2 Escalation Protocols

Define what happens when things go wrong:

Level 1: Performance dip 5-10% → Engineering review within 48 hours
Level 2: Performance dip >10% → Pause deployment, clinical review
Level 3: Patient safety event → Immediate shutdown, incident report

16.5.3 Review Cadence

Establish regular governance meetings:

Monthly: Technical performance review
Quarterly: Clinical outcomes review with stakeholders
Annually: Full model re-evaluation and retraining decision

Document all decisions, including decisions not to act. If drift is detected but deemed acceptable, record why.

# AI Ops & Deployment {#sec-ai-ops-deployment} Developing a model is only half the battle. This chapter covers the operational infrastructure needed to deploy, serve, monitor, and maintain AI systems in clinical environments. ## Containerization with Docker *Clinical Context:* Your pneumonia classifier works perfectly on your laptop. But deploying to hospital servers with different Python versions, operating systems, and library configurations creates a nightmare of "it works on my machine." Containers solve this by packaging your model with its entire runtime environment. ### Why Containers? **Docker** containers provide: - **Reproducibility**: Same environment everywhere—development, testing, production - **Isolation**: Model dependencies don't conflict with hospital IT systems - **Portability**: Deploy to any infrastructure that runs Docker - **Scalability**: Spin up multiple containers to handle load ### Basic Dockerfile for Model Serving A Dockerfile defines the container's environment: ```dockerfile # Start from Python base image FROM python:3.10-slim # Set working directory WORKDIR /app # Install dependencies first (for caching) COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy model and application code COPY model.pt . COPY app.py . # Expose the port your API runs on EXPOSE 8000 # Command to run when container starts CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] ``` Build and run: ```bash # Build the image docker build -t pneumonia-classifier:v1.0 . # Run the container docker run -p 8000:8000 pneumonia-classifier:v1.0 # Test the endpoint curl http://localhost:8000/health ``` ### Best Practices for Medical AI Containers - **Pin versions**: Specify exact library versions in requirements.txt - **Minimize image size**: Use slim base images, multi-stage builds - **No secrets in images**: Use environment variables or secret managers - **Health checks**: Include endpoints to verify the model is functioning - **Logging**: Configure structured logging for audit trails ## Model Serving with REST APIs *Clinical Context:* The radiology PACS system needs to send images to your classifier and receive predictions. A REST API provides a standardized interface that any system can call, regardless of programming language. ### FastAPI for Model Serving **FastAPI** is a modern Python framework ideal for ML serving: ```{python} #| eval: false from fastapi import FastAPI, File, UploadFile from pydantic import BaseModel import torch from torchvision import transforms from PIL import Image import io app = FastAPI(title="Pneumonia Classifier API") # Load model at startup model = torch.load("model.pt", map_location="cpu") model.eval() # Define response schema class PredictionResponse(BaseModel): prediction: str confidence: float model_version: str # Preprocessing pipeline preprocess = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize([0.485], [0.229]) ]) @app.get("/health") def health_check(): return {"status": "healthy", "model_loaded": True} @app.post("/predict", response_model=PredictionResponse) async def predict(file: UploadFile = File(...)): # Read and preprocess image image_bytes = await file.read() image = Image.open(io.BytesIO(image_bytes)).convert("L") tensor = preprocess(image).unsqueeze(0) # Run inference with torch.no_grad(): logits = model(tensor) probs = torch.softmax(logits, dim=1) confidence, predicted = probs.max(dim=1) labels = ["Normal", "Pneumonia"] return PredictionResponse( prediction=labels[predicted.item()], confidence=confidence.item(), model_version="1.0.0" ) ``` ### API Design Considerations For clinical deployment: - **Versioning**: Include API version in URL (e.g., `/v1/predict`) - **Input validation**: Reject malformed requests before inference - **Timeout handling**: Set reasonable limits for inference time - **Batch endpoints**: Support multiple images per request for efficiency - **Metadata**: Return model version, timestamp, and confidence with every prediction ### Authentication and Security Medical AI APIs must be secured: ```{python} #| eval: false from fastapi import Depends, HTTPException, Security from fastapi.security import APIKeyHeader API_KEY = "your-secret-key" # In production, use env vars api_key_header = APIKeyHeader(name="X-API-Key") def verify_api_key(api_key: str = Security(api_key_header)): if api_key != API_KEY: raise HTTPException(status_code=403, detail="Invalid API key") return api_key @app.post("/predict") async def predict( file: UploadFile = File(...), api_key: str = Depends(verify_api_key) ): # ... prediction logic ``` ## Monitoring and Drift Detection *Clinical Context:* Your model performed well during validation, but three months into deployment, accuracy has dropped. The patient population shifted—flu season brought different chest X-ray presentations. Without monitoring, you wouldn't know until patient outcomes suffered. ### What to Monitor **Model monitoring** tracks three categories: **1. System metrics** (infrastructure health): - Request latency (p50, p95, p99) - Throughput (requests per second) - Error rates and types - CPU/memory/GPU utilization **2. Data metrics** (input distribution): - Feature statistics (mean, variance, range) - Missing value rates - Input data quality scores **3. Model metrics** (prediction behavior): - Prediction distribution (class frequencies) - Confidence score distribution - Outcome metrics when ground truth available ### Detecting Data Drift **Data drift** occurs when the input distribution changes from training. Common detection methods: **Population Stability Index (PSI)**: $$ \text{PSI} = \sum_i (A_i - E_i) \times \ln\left(\frac{A_i}{E_i}\right) $$ where $A_i$ is actual proportion in bin $i$ and $E_i$ is expected (training) proportion. - PSI < 0.1: No significant drift - PSI 0.1–0.2: Moderate drift, investigate - PSI > 0.2: Significant drift, action required **Kolmogorov-Smirnov Test**: Statistical test for whether two distributions differ. ```{python} #| eval: false import numpy as np from scipy import stats def detect_drift(reference_data, current_data, threshold=0.1): """Detect drift using KS test.""" statistic, p_value = stats.ks_2samp(reference_data, current_data) return { "drift_detected": p_value < 0.05, "ks_statistic": statistic, "p_value": p_value } # Monitor prediction confidence distribution reference_confidences = np.load("training_confidences.npy") current_confidences = get_recent_confidences(last_n_days=7) drift_result = detect_drift(reference_confidences, current_confidences) if drift_result["drift_detected"]: alert_team("Confidence distribution drift detected") ``` ### Logging for Audit Trails Healthcare regulations require comprehensive logging: ```{python} #| eval: false import logging import json from datetime import datetime # Structured logging logging.basicConfig( format='%(message)s', level=logging.INFO ) logger = logging.getLogger(__name__) def log_prediction(request_id, input_hash, prediction, confidence): log_entry = { "timestamp": datetime.utcnow().isoformat(), "request_id": request_id, "input_hash": input_hash, # Don't log PHI "prediction": prediction, "confidence": confidence, "model_version": "1.0.0" } logger.info(json.dumps(log_entry)) ``` ## Acceptance Testing and Validation *Clinical Context:* Before deploying a model trained at Stanford to your community hospital, you need to verify it works for your patient population. Acceptance testing is the gatekeeper between development and deployment. ### Local Validation Protocol HW7 asks you to design acceptance testing. Key components: **1. Holdout test set**: Reserve local data never seen during training - Minimum sample size for statistical power (often 200+ per class) - Representative of your patient demographics - Recent data (within last 6-12 months) **2. Performance thresholds**: Define minimum acceptable metrics - AUROC ≥ 0.85 (or match published performance) - Sensitivity ≥ 0.90 for screening applications - Calibration error < 0.05 **3. Subgroup analysis**: Verify performance across: - Age groups - Sex - Disease severity - Scanner/equipment type ```{python} #| eval: false def run_acceptance_tests(model, test_data, config): """Run acceptance test suite.""" results = {} # Overall performance y_pred = model.predict(test_data.X) y_prob = model.predict_proba(test_data.X)[:, 1] results["auroc"] = roc_auc_score(test_data.y, y_prob) results["sensitivity"] = recall_score(test_data.y, y_pred) results["specificity"] = recall_score(test_data.y, y_pred, pos_label=0) # Check against thresholds results["auroc_pass"] = results["auroc"] >= config["min_auroc"] results["sensitivity_pass"] = results["sensitivity"] >= config["min_sens"] # Subgroup analysis for group_col in config["subgroup_columns"]: results[f"subgroup_{group_col}"] = analyze_subgroups( test_data, y_prob, group_col ) return results # Run and report config = { "min_auroc": 0.85, "min_sens": 0.90, "subgroup_columns": ["age_group", "sex", "scanner_type"] } results = run_acceptance_tests(model, local_test_data, config) ``` ### Continuous Validation Acceptance testing isn't one-time. Establish ongoing validation: - Weekly/monthly performance reports - Automatic alerts when metrics drop below thresholds - Quarterly review with clinical stakeholders - Re-validation after any model update ## Governance and Documentation *Clinical Context:* A year from now, who knows why this model was deployed, what its limitations are, or who to contact when issues arise? Governance documentation ensures institutional knowledge persists. ### Model Cards A **model card** documents essential information: ```markdown # Pneumonia Classifier v1.0 - Model Card ## Model Details - **Developer**: AI Team, University Hospital - **Date**: December 2024 - **Version**: 1.0.0 - **Type**: Binary image classifier (ResNet-18) ## Intended Use - **Primary use**: Chest X-ray triage for pneumonia - **Users**: Radiologists, ED physicians - **Out of scope**: Pediatric patients, CT images ## Training Data - **Source**: ChestX-ray14, PneumoniaMNIST - **Size**: 50,000 images - **Demographics**: 55% male, mean age 52 ## Performance - **AUROC**: 0.92 (95% CI: 0.90-0.94) - **Sensitivity**: 0.88 at 0.5 threshold - **Specificity**: 0.85 at 0.5 threshold ## Limitations - Lower performance on portable X-rays - Not validated for immunocompromised patients - Requires PA or AP view chest radiograph ## Ethical Considerations - Subgroup analysis showed 3% lower AUROC for age >80 - Model should support, not replace, clinical judgment ``` ### Escalation Protocols Define what happens when things go wrong: - **Level 1**: Performance dip 5-10% → Engineering review within 48 hours - **Level 2**: Performance dip >10% → Pause deployment, clinical review - **Level 3**: Patient safety event → Immediate shutdown, incident report ### Review Cadence Establish regular governance meetings: - Monthly: Technical performance review - Quarterly: Clinical outcomes review with stakeholders - Annually: Full model re-evaluation and retraining decision Document all decisions, including decisions *not* to act. If drift is detected but deemed acceptable, record why. ## Quick Reference: Pre-Deployment Checklist Before any clinical AI system goes live, verify: ### Performance Validation - [ ] Holdout test set performance meets defined thresholds - [ ] Subgroup analysis shows no significant disparities across patient populations - [ ] Calibration is acceptable (predicted probabilities match observed frequencies) - [ ] Performance on edge cases and failure modes is documented ### Infrastructure - [ ] API endpoints respond within latency requirements - [ ] Health checks are configured and monitored - [ ] Logging captures inputs, outputs, and errors (without PHI exposure) - [ ] Rollback procedure is tested and documented ### Monitoring - [ ] Drift detection pipeline is active - [ ] Alerting thresholds are configured - [ ] Dashboard shows key metrics in real-time - [ ] On-call rotation is established ### Governance - [ ] Model card is complete and accessible - [ ] Escalation protocols are documented and communicated - [ ] Clinical users are trained on appropriate use and limitations - [ ] Feedback mechanism for reporting errors is in place ### Compliance - [ ] Legal/compliance review is complete - [ ] IRB approval obtained (if required) - [ ] Data use agreements are in place - [ ] Documentation meets regulatory requirements (FDA, EU MDR if applicable) ### Go/No-Go Decision - [ ] All stakeholders have signed off - [ ] Rollback trigger conditions are defined - [ ] Post-launch monitoring plan is scheduled This checklist should be completed and signed before every production deployment. Store completed checklists for audit purposes.