16  AI Ops & Deployment

Developing a model is only half the battle. This chapter covers the operational infrastructure needed to deploy, serve, monitor, and maintain AI systems in clinical environments.

16.1 Containerization with Docker

Clinical Context: Your pneumonia classifier works perfectly on your laptop. But deploying to hospital servers with different Python versions, operating systems, and library configurations creates a nightmare of “it works on my machine.” Containers solve this by packaging your model with its entire runtime environment.

16.1.1 Why Containers?

Docker containers provide:

  • Reproducibility: Same environment everywhere—development, testing, production
  • Isolation: Model dependencies don’t conflict with hospital IT systems
  • Portability: Deploy to any infrastructure that runs Docker
  • Scalability: Spin up multiple containers to handle load

16.1.2 Basic Dockerfile for Model Serving

A Dockerfile defines the container’s environment:

# Start from Python base image
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Install dependencies first (for caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and application code
COPY model.pt .
COPY app.py .

# Expose the port your API runs on
EXPOSE 8000

# Command to run when container starts
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run:

# Build the image
docker build -t pneumonia-classifier:v1.0 .

# Run the container
docker run -p 8000:8000 pneumonia-classifier:v1.0

# Test the endpoint
curl http://localhost:8000/health

16.1.3 Best Practices for Medical AI Containers

  • Pin versions: Specify exact library versions in requirements.txt
  • Minimize image size: Use slim base images, multi-stage builds
  • No secrets in images: Use environment variables or secret managers
  • Health checks: Include endpoints to verify the model is functioning
  • Logging: Configure structured logging for audit trails

16.2 Model Serving with REST APIs

Clinical Context: The radiology PACS system needs to send images to your classifier and receive predictions. A REST API provides a standardized interface that any system can call, regardless of programming language.

16.2.1 FastAPI for Model Serving

FastAPI is a modern Python framework ideal for ML serving:

from fastapi import FastAPI, File, UploadFile
from pydantic import BaseModel
import torch
from torchvision import transforms
from PIL import Image
import io

app = FastAPI(title="Pneumonia Classifier API")

# Load model at startup
model = torch.load("model.pt", map_location="cpu")
model.eval()

# Define response schema
class PredictionResponse(BaseModel):
    prediction: str
    confidence: float
    model_version: str

# Preprocessing pipeline
preprocess = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485], [0.229])
])

@app.get("/health")
def health_check():
    return {"status": "healthy", "model_loaded": True}

@app.post("/predict", response_model=PredictionResponse)
async def predict(file: UploadFile = File(...)):
    # Read and preprocess image
    image_bytes = await file.read()
    image = Image.open(io.BytesIO(image_bytes)).convert("L")
    tensor = preprocess(image).unsqueeze(0)

    # Run inference
    with torch.no_grad():
        logits = model(tensor)
        probs = torch.softmax(logits, dim=1)
        confidence, predicted = probs.max(dim=1)

    labels = ["Normal", "Pneumonia"]
    return PredictionResponse(
        prediction=labels[predicted.item()],
        confidence=confidence.item(),
        model_version="1.0.0"
    )

16.2.2 API Design Considerations

For clinical deployment:

  • Versioning: Include API version in URL (e.g., /v1/predict)
  • Input validation: Reject malformed requests before inference
  • Timeout handling: Set reasonable limits for inference time
  • Batch endpoints: Support multiple images per request for efficiency
  • Metadata: Return model version, timestamp, and confidence with every prediction

16.2.3 Authentication and Security

Medical AI APIs must be secured:

from fastapi import Depends, HTTPException, Security
from fastapi.security import APIKeyHeader

API_KEY = "your-secret-key"  # In production, use env vars
api_key_header = APIKeyHeader(name="X-API-Key")

def verify_api_key(api_key: str = Security(api_key_header)):
    if api_key != API_KEY:
        raise HTTPException(status_code=403, detail="Invalid API key")
    return api_key

@app.post("/predict")
async def predict(
    file: UploadFile = File(...),
    api_key: str = Depends(verify_api_key)
):
    # ... prediction logic

16.3 Monitoring and Drift Detection

Clinical Context: Your model performed well during validation, but three months into deployment, accuracy has dropped. The patient population shifted—flu season brought different chest X-ray presentations. Without monitoring, you wouldn’t know until patient outcomes suffered.

16.3.1 What to Monitor

Model monitoring tracks three categories:

1. System metrics (infrastructure health):

  • Request latency (p50, p95, p99)
  • Throughput (requests per second)
  • Error rates and types
  • CPU/memory/GPU utilization

2. Data metrics (input distribution):

  • Feature statistics (mean, variance, range)
  • Missing value rates
  • Input data quality scores

3. Model metrics (prediction behavior):

  • Prediction distribution (class frequencies)
  • Confidence score distribution
  • Outcome metrics when ground truth available

16.3.2 Detecting Data Drift

Data drift occurs when the input distribution changes from training. Common detection methods:

Population Stability Index (PSI):

\[ \text{PSI} = \sum_i (A_i - E_i) \times \ln\left(\frac{A_i}{E_i}\right) \]

where \(A_i\) is actual proportion in bin \(i\) and \(E_i\) is expected (training) proportion.

  • PSI < 0.1: No significant drift
  • PSI 0.1–0.2: Moderate drift, investigate
  • PSI > 0.2: Significant drift, action required

Kolmogorov-Smirnov Test: Statistical test for whether two distributions differ.

import numpy as np
from scipy import stats

def detect_drift(reference_data, current_data, threshold=0.1):
    """Detect drift using KS test."""
    statistic, p_value = stats.ks_2samp(reference_data, current_data)
    return {
        "drift_detected": p_value < 0.05,
        "ks_statistic": statistic,
        "p_value": p_value
    }

# Monitor prediction confidence distribution
reference_confidences = np.load("training_confidences.npy")
current_confidences = get_recent_confidences(last_n_days=7)

drift_result = detect_drift(reference_confidences, current_confidences)
if drift_result["drift_detected"]:
    alert_team("Confidence distribution drift detected")

16.3.3 Logging for Audit Trails

Healthcare regulations require comprehensive logging:

import logging
import json
from datetime import datetime

# Structured logging
logging.basicConfig(
    format='%(message)s',
    level=logging.INFO
)
logger = logging.getLogger(__name__)

def log_prediction(request_id, input_hash, prediction, confidence):
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "request_id": request_id,
        "input_hash": input_hash,  # Don't log PHI
        "prediction": prediction,
        "confidence": confidence,
        "model_version": "1.0.0"
    }
    logger.info(json.dumps(log_entry))

16.4 Acceptance Testing and Validation

Clinical Context: Before deploying a model trained at Stanford to your community hospital, you need to verify it works for your patient population. Acceptance testing is the gatekeeper between development and deployment.

16.4.1 Local Validation Protocol

HW7 asks you to design acceptance testing. Key components:

1. Holdout test set: Reserve local data never seen during training

  • Minimum sample size for statistical power (often 200+ per class)
  • Representative of your patient demographics
  • Recent data (within last 6-12 months)

2. Performance thresholds: Define minimum acceptable metrics

  • AUROC ≥ 0.85 (or match published performance)
  • Sensitivity ≥ 0.90 for screening applications
  • Calibration error < 0.05

3. Subgroup analysis: Verify performance across:

  • Age groups
  • Sex
  • Disease severity
  • Scanner/equipment type
def run_acceptance_tests(model, test_data, config):
    """Run acceptance test suite."""
    results = {}

    # Overall performance
    y_pred = model.predict(test_data.X)
    y_prob = model.predict_proba(test_data.X)[:, 1]

    results["auroc"] = roc_auc_score(test_data.y, y_prob)
    results["sensitivity"] = recall_score(test_data.y, y_pred)
    results["specificity"] = recall_score(test_data.y, y_pred, pos_label=0)

    # Check against thresholds
    results["auroc_pass"] = results["auroc"] >= config["min_auroc"]
    results["sensitivity_pass"] = results["sensitivity"] >= config["min_sens"]

    # Subgroup analysis
    for group_col in config["subgroup_columns"]:
        results[f"subgroup_{group_col}"] = analyze_subgroups(
            test_data, y_prob, group_col
        )

    return results

# Run and report
config = {
    "min_auroc": 0.85,
    "min_sens": 0.90,
    "subgroup_columns": ["age_group", "sex", "scanner_type"]
}
results = run_acceptance_tests(model, local_test_data, config)

16.4.2 Continuous Validation

Acceptance testing isn’t one-time. Establish ongoing validation:

  • Weekly/monthly performance reports
  • Automatic alerts when metrics drop below thresholds
  • Quarterly review with clinical stakeholders
  • Re-validation after any model update

16.5 Governance and Documentation

Clinical Context: A year from now, who knows why this model was deployed, what its limitations are, or who to contact when issues arise? Governance documentation ensures institutional knowledge persists.

16.5.1 Model Cards

A model card documents essential information:

# Pneumonia Classifier v1.0 - Model Card

## Model Details
- **Developer**: AI Team, University Hospital
- **Date**: December 2024
- **Version**: 1.0.0
- **Type**: Binary image classifier (ResNet-18)

## Intended Use
- **Primary use**: Chest X-ray triage for pneumonia
- **Users**: Radiologists, ED physicians
- **Out of scope**: Pediatric patients, CT images

## Training Data
- **Source**: ChestX-ray14, PneumoniaMNIST
- **Size**: 50,000 images
- **Demographics**: 55% male, mean age 52

## Performance
- **AUROC**: 0.92 (95% CI: 0.90-0.94)
- **Sensitivity**: 0.88 at 0.5 threshold
- **Specificity**: 0.85 at 0.5 threshold

## Limitations
- Lower performance on portable X-rays
- Not validated for immunocompromised patients
- Requires PA or AP view chest radiograph

## Ethical Considerations
- Subgroup analysis showed 3% lower AUROC for age >80
- Model should support, not replace, clinical judgment

16.5.2 Escalation Protocols

Define what happens when things go wrong:

  • Level 1: Performance dip 5-10% → Engineering review within 48 hours
  • Level 2: Performance dip >10% → Pause deployment, clinical review
  • Level 3: Patient safety event → Immediate shutdown, incident report

16.5.3 Review Cadence

Establish regular governance meetings:

  • Monthly: Technical performance review
  • Quarterly: Clinical outcomes review with stakeholders
  • Annually: Full model re-evaluation and retraining decision

Document all decisions, including decisions not to act. If drift is detected but deemed acceptable, record why.

16.6 Quick Reference: Pre-Deployment Checklist

Before any clinical AI system goes live, verify:

16.6.1 Performance Validation

16.6.2 Infrastructure

16.6.3 Monitoring

16.6.4 Governance

16.6.5 Compliance

16.6.6 Go/No-Go Decision

This checklist should be completed and signed before every production deployment. Store completed checklists for audit purposes.