22  Writing the Field Guide

Clinical Context: Your AI system is built, validated, and approved. Now someone has to explain it to the radiologists, nurses, and physicians who will use it daily. The technical documentation that satisfied regulators won’t help a clinician at 3 AM decide whether to trust an AI recommendation. This chapter teaches you to write the documentation that bridges that gap—the field guide.

A field guide is a practical reference designed for use in the field. In clinical AI, it’s the document that helps healthcare professionals understand, use, and appropriately question AI tools in their practice. Writing effective field guides is a skill distinct from building models or satisfying regulators—and it’s essential for responsible deployment.

22.1 What Is a Field Guide Document?

22.1.1 Purpose

A field guide document serves a specific purpose: help clinicians use an AI tool appropriately. It answers the questions a clinician has when facing a real patient:

  • What does this tool do?
  • When should I use it?
  • When should I not use it?
  • How do I interpret the output?
  • When should I override it?
  • What do I do when it fails?

These are not the same questions regulators ask, which focus on safety and efficacy data. Field guide questions are operational and immediate.

22.1.2 Audience

Your audience is clinicians who will use the tool, not build it. This means:

  • They don’t need to know the architecture (ResNet vs. DenseNet)
  • They do need to know when the model struggles (portable X-rays, pediatric patients)
  • They don’t need training curves or loss functions
  • They do need calibration information and confidence interpretation
  • They don’t need code
  • They do need workflows and decision points

Assume clinical expertise and no machine learning background.

22.1.3 Length and Format

A field guide for a focused clinical AI tool should be 3-5 pages. Longer documents don’t get read in clinical settings. If you can’t explain appropriate use in 5 pages, the tool may be too complex for safe deployment.

Format for quick reference: - Clear headings and subheadings - Bullet points for key information - Tables for thresholds and decision rules - Boxed warnings for critical limitations - Minimal prose, maximum information density

22.2 The Seven Components

Every clinical AI field guide should include these seven components.

22.2.1 1. Tool Summary

A single paragraph explaining what the tool does in plain language. This should answer: If a clinician has 30 seconds to understand this tool, what must they know?

Example: > The Chest X-ray Pneumonia Detector analyzes PA chest radiographs to identify findings suggestive of pneumonia. It produces a probability score (0-100%) and highlights suspicious regions on the image. The tool is intended to assist—not replace—radiologist interpretation. It does not detect other pathologies.

Avoid jargon. “Convolutional neural network” becomes “AI system trained on chest X-rays.” Technical accuracy matters less than clinical clarity.

22.2.2 2. Intended Use and Scope

Explicitly state what the tool is designed for and what it is not designed for.

In scope: - Patient populations (adults over 18, emergency department presentations) - Image types (PA chest radiographs, standard equipment) - Clinical contexts (triage support, second-reader confirmation)

Out of scope: - Patient populations not validated (pediatrics, pregnant patients) - Image types not tested (portable X-rays, CT scans) - Clinical contexts not intended (autonomous diagnosis, ICU monitoring)

Be specific. “Adults” is less useful than “Adults 18-85 years, non-pregnant, without known immunocompromise.” The more precisely you define scope, the easier it is for clinicians to recognize when they’re outside it.

22.2.3 3. How It Works (Plain Language)

Explain the mechanism at a conceptual level—enough to understand behavior, not enough to implement.

Example: > The system was trained on 150,000 chest X-rays from three academic medical centers, each reviewed by board-certified radiologists. It learns patterns in the images that correlate with pneumonia findings. When you submit a new image, the system compares it to these learned patterns and estimates the probability that pneumonia-consistent findings are present. > > The system works best on images similar to its training data: standard PA views from adult patients on stationary equipment. Performance may degrade on images that differ substantially (portable X-rays, unusual patient positioning, pediatric patients).

Avoid anthropomorphizing (“the AI thinks…”) but do explain enough that users understand why certain inputs might cause problems.

22.2.4 4. Performance Summary

Present key performance metrics with context that makes them interpretable.

Essential metrics: - Sensitivity (true positive rate) - Specificity (true negative rate) - AUROC (overall discrimination) - Positive predictive value (if prevalence is known) - Negative predictive value (if prevalence is known)

Context that matters: - What population were these measured on? - What is the confidence interval? - How does this compare to expert performance? - Are there subgroups with different performance?

Example table:

Metric Value (95% CI) Context
Sensitivity 91% (88-94%) At default threshold; comparable to expert radiologists
Specificity 85% (82-88%) False positive rate ~15%
AUROC 0.94 (0.92-0.96) Measured on internal test set; may differ in your population

Subgroup performance: > Sensitivity drops to 82% on portable X-rays. Consider higher index of suspicion when overriding negative results from portable images.

22.2.5 5. Limitations and Failure Modes

This is the most important section for safe use. Clinicians must know when not to trust the tool.

Known limitations: - Input types that degrade performance - Patient populations with different accuracy - Clinical scenarios where the tool may mislead

Failure modes: - How does the tool fail? (False negatives vs. false positives) - What does failure look like? (Confident wrong predictions vs. uncertain correct ones) - Are there patterns to failure? (Misses subtle interstitial patterns)

Example: > The tool may miss: > - Early or subtle pneumonia (small infiltrates, interstitial patterns) > - Pneumonia obscured by pleural effusion > - Atypical presentations in immunocompromised patients > > The tool may over-call: > - Atelectasis (may be flagged as consolidation) > - Prior scarring > - Motion artifact

Be honest. Hiding limitations doesn’t protect anyone—it creates the conditions for harm.

22.2.6 6. Human Oversight Rules

Specify when and how humans should override the AI.

When to override: - Clinical presentation inconsistent with AI output - Patient outside validated population - Image quality concerns - High-stakes decision points

Decision rules: - AI positive, clinical suspicion low: Describe workflow (additional imaging? clinical observation?) - AI negative, clinical suspicion high: Never let AI override clinical judgment for high-risk patients - Uncertain cases: Escalation pathway

Example: > If AI negative but clinical suspicion high: > Do not discharge based on negative AI result alone. The tool misses 9% of pneumonia cases. If the patient has fever, productive cough, and respiratory distress, treat the clinical picture. > > If AI positive but clinical suspicion low: > Review the highlighted regions. If they correspond to known chronic findings or artifact, your clinical judgment takes precedence. Document your reasoning.

22.2.7 7. Monitoring and Feedback

Explain how to report problems and how the system is monitored.

Reporting issues: - Who to contact for suspected errors - How to document disagreements - Feedback mechanism for improving the system

Ongoing monitoring: - How is performance tracked? - Who reviews alerts and reports? - What would trigger system suspension?

Example: > To report a concern: > Submit a Safety Report through the EHR (Help → AI Tool Feedback). Include the patient MRN, the AI output, and your clinical assessment. All reports are reviewed within 48 hours. > > System monitoring: > AI performance is tracked weekly by the Clinical AI Committee. If sensitivity drops below 85% on rolling 30-day data, the system is suspended pending investigation.

22.3 Worked Example: Chest X-ray Classifier

Here is a complete field guide for the pneumonia classifier developed throughout this book.


22.3.1 Pneumonia Classifier Field Guide

Version 1.0 | December 2024

22.3.1.1 What This Tool Does

The Pneumonia Classifier analyzes PA chest radiographs to identify findings suggestive of pneumonia. It produces a probability score (0-100%) and a heatmap highlighting suspicious regions. The tool supports—but does not replace—radiologist interpretation.

22.3.1.2 When to Use

✓ Adult patients (18-85 years) presenting with respiratory symptoms ✓ Standard PA chest radiographs from stationary equipment ✓ Triage support and second-reader confirmation

22.3.1.3 When NOT to Use

✗ Pediatric patients (not validated) ✗ Portable X-rays (reduced sensitivity) ✗ CT scans or other imaging modalities ✗ Patients with known extensive lung disease (may confound) ✗ As the sole basis for treatment decisions

22.3.1.4 Performance

Metric Value Notes
Sensitivity 91% Misses ~1 in 11 cases
Specificity 85% ~15% false positive rate
AUROC 0.94 Comparable to expert radiologists

Subgroup considerations: - Portable X-rays: Sensitivity drops to 82% - Age >80: Sensitivity drops to 88% - Immunocompromised: Not validated—use with caution

22.3.1.5 Interpreting Output

Probability score: - 0-30%: Low probability—consider clinical picture - 30-70%: Intermediate—clinical judgment critical - 70-100%: High probability—correlate with presentation

Heatmap: Review highlighted regions. Do they correspond to: - Anatomically plausible locations for pneumonia? - Or known chronic findings, artifact, or atelectasis?

22.3.1.6 Known Failure Modes

The tool may miss: - Subtle or early infiltrates - Interstitial patterns - Retrocardiac pneumonia

The tool may over-call: - Atelectasis - Chronic scarring - Motion artifact

22.3.1.7 Override Rules

Situation Action
AI negative, clinical suspicion high Do NOT rely on negative result. Treat the patient.
AI positive, clinical suspicion low Review heatmap. Override if artifact or chronic finding. Document reasoning.
Any uncertainty Escalate to attending radiologist

22.3.1.8 Reporting Problems

Submit feedback via EHR → Help → AI Tool Feedback Include: Patient MRN, AI output, your clinical assessment All reports reviewed within 48 hours


22.4 Common Mistakes

When reviewing field guides, these errors appear frequently:

22.4.1 Too Technical

Problem: Including architecture details, training curves, or code snippets.

Why it fails: Clinicians don’t need to know you used DenseNet-121. They need to know the tool struggles with portable X-rays.

Fix: Every technical detail should answer a clinical question. If it doesn’t, remove it.

22.4.2 Missing Failure Modes

Problem: Emphasizing performance without discussing when the tool fails.

Why it fails: Creates false confidence. Clinicians assume the tool works everywhere if limitations aren’t stated.

Fix: Dedicate at least 20% of the document to limitations and failure modes. Be specific about what kinds of errors occur.

22.4.3 No Escalation Path

Problem: Instructions for normal use without guidance for problems.

Why it fails: When something goes wrong at 3 AM, the clinician needs to know who to call and what to do.

Fix: Include explicit escalation paths, contact information, and fallback procedures.

22.4.4 Performance Without Context

Problem: Reporting metrics without prevalence, population, or comparison.

Why it fails: 91% sensitivity means different things at different prevalences. Without context, metrics are uninterpretable.

Fix: Always report: - What population was tested - What the confidence interval is - How performance compares to current standard of care - Subgroup variations

22.4.5 Burying Critical Information

Problem: Hiding limitations in dense paragraphs at the end.

Why it fails: Critical safety information must be visible, not discovered through careful reading.

Fix: Use visual hierarchy. Box critical warnings. Put limitations near the top, not the bottom.

22.5 Templates

22.5.1 Minimal Field Guide Template

# [Tool Name] Field Guide
Version [X.X] | [Date]

## What This Tool Does
[One paragraph: What it does, what it outputs, what it's for]

## When to Use / When NOT to Use
[Bulleted lists for each]

## Performance Summary
[Table with key metrics, confidence intervals, and context]

## Limitations and Failure Modes
[Specific list of known weaknesses]

## Human Override Rules
[Table or decision tree for AI-human disagreement]

## Reporting Issues
[Contact information and process]

22.5.2 Extended Field Guide Template

For complex tools, add:

## How It Works (Plain Language)
[Conceptual explanation without technical jargon]

## Interpreting Output
[Detailed guidance on probability scores, visualizations, etc.]

## Integration with Clinical Workflow
[Where this fits in existing processes]

## Training Requirements
[What users need to know before use]

## Version History
[Changes from previous versions]

22.6 From Model Card to Field Guide

Model cards (Chapter 16, Chapter 21) and field guides serve different purposes:

Aspect Model Card Field Guide
Audience Technical reviewers, regulators Clinical users
Purpose Documentation for compliance Guidance for appropriate use
Content Training data, architecture, metrics Workflows, decision rules, override guidance
Length Comprehensive Concise (3-5 pages)
Language Technical acceptable Plain language required

A model card documents what the system is. A field guide explains how to use it. Both are necessary; neither replaces the other.

Workflow: 1. Complete model card during development (technical documentation) 2. Translate relevant portions into field guide for deployment (clinical documentation) 3. Update both when the model changes

22.7 Exercises

  1. Write a field guide for an AI tool you’ve encountered or read about. Use the minimal template. Can you complete it from available information? What’s missing?

  2. Critique an existing document. Find documentation for a commercial clinical AI tool. Evaluate it against the seven components. What’s present? What’s absent? What would you add?

  3. Translate technical to clinical. Take a model card from this book (Chapter 16 or Chapter 21) and write the corresponding field guide. What technical details did you keep? What did you remove? What did you add?

  4. Test with a clinician. If you have access to a clinician (or are one), show them your field guide. Ask:

    • After reading this, would you know when to use this tool?
    • Would you know when NOT to use it?
    • Would you know what to do if you disagreed with its output?

    Revise based on their feedback.

  5. Design for failure. For a tool of your choice, write the “Limitations and Failure Modes” section. Be as specific as possible. Now ask: How would you design the user interface to make these limitations visible during use?

22.8 Chapter Summary

Field guides bridge the gap between technical documentation and clinical practice.

Key principles: - Write for clinicians, not engineers - 3-5 pages maximum—longer doesn’t get read - Seven essential components: summary, scope, mechanism, performance, limitations, override rules, monitoring - Be specific about failure modes—this is the most important section - Include escalation paths for when things go wrong

Common mistakes: - Too technical (architecture instead of limitations) - Missing failure modes (creates false confidence) - No escalation path (leaves clinicians stranded) - Performance without context (uninterpretable metrics) - Buried critical information (must be visible)

Relationship to other documentation: - Model cards document what the system is (technical) - Field guides explain how to use it (clinical) - Both are necessary; neither replaces the other

The best AI system in the world is useless if clinicians don’t understand when to trust it and when to question it. Writing that understanding into a clear, practical document is the final—and essential—step in responsible deployment.