11  Large Language Models in Medicine

Large language models have rapidly moved from research curiosity to clinical deployment (Thirunavukarasu et al. 2023). In just a few years, these systems have demonstrated remarkable capabilities—passing medical licensing exams, generating clinical documentation, and engaging in diagnostic reasoning. This chapter explores what makes modern LLMs different from the BERT-era models we covered in Chapter 10, surveys the landscape of medical LLMs, and introduces retrieval-augmented generation as the key pattern for clinical deployment.

11.1 From BERT to GPT: The Generative Turn

Clinical Context: A hospitalist receives an alert that a patient’s clinical note has been auto-generated from their conversation. The system didn’t just extract information—it synthesized a coherent narrative, organized by problem, with appropriate medical terminology. This generative capability represents a fundamental shift from earlier NLP systems that could only classify or extract.

Chapter 10 introduced transformer architectures and BERT-style models that excel at understanding text—classifying notes, extracting entities, predicting outcomes. These encoder models process text bidirectionally, building rich representations useful for downstream tasks.

Modern large language models take a different approach. Rather than encoding text for classification, they generate text token by token. This generative capability, combined with massive scale, has produced systems with remarkably flexible capabilities.

The key architectural difference is simple: while BERT uses the transformer encoder to build representations, models like GPT use the transformer decoder to predict the next token. Given “The patient presents with chest pain and,” the model predicts “shortness” as a likely next token, then “of,” then “breath.” This autoregressive generation, trained on internet-scale text, produces models that can write, summarize, translate, reason, and more—all from the same basic capability.

What surprised researchers was what happened at scale. Models with billions of parameters, trained on trillions of tokens, developed capabilities that smaller models lacked entirely. A 1-billion parameter model might generate grammatical text; a 100-billion parameter model can engage in multi-step reasoning, follow complex instructions, and adapt to new tasks from a few examples. These emergent abilities—capabilities that appear suddenly as models scale—have transformed what’s possible with clinical AI.

# The shift from classification to generation
# BERT-style: encode text, classify output
from transformers import AutoModelForSequenceClassification

# Classify a clinical note
bert_model = AutoModelForSequenceClassification.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
# Output: probability of each class

# GPT-style: generate text token by token
from openai import OpenAI
client = OpenAI()

# Generate a response
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Summarize this clinical note: ..."}]
)
# Output: generated text summary

For medicine, this generative capability opens new possibilities. Instead of training separate models for summarization, question answering, and translation, a single LLM can perform all these tasks—and many others—through natural language instructions.

11.2 Anatomy of a Modern LLM

Clinical Context: When a health system evaluates LLMs for clinical use, they encounter unfamiliar terminology: “instruction-tuned,” “RLHF,” “context window.” Understanding these concepts is essential for evaluating which models suit which clinical needs and anticipating their limitations.

Modern LLMs are built through a multi-stage training process, each stage shaping the model’s capabilities and behavior.

11.2.1 Pretraining: Learning Language at Scale

The foundation is pretraining—training a transformer to predict the next token on massive text corpora. This corpus typically includes web pages, books, code, scientific papers, and more. The model sees trillions of tokens and learns statistical patterns: grammar, facts, reasoning patterns, even some medical knowledge from included textbooks and papers.

Pretraining is enormously expensive—millions of dollars in compute for frontier models. This cost means only a few organizations can train foundation models from scratch. Most medical AI development uses pretrained models as starting points.

The scaling laws discovered by researchers show predictable relationships: model performance improves smoothly as you increase model size, training data, and compute. This predictability has driven the push toward ever-larger models, though returns eventually diminish.

11.2.2 Instruction Tuning: Learning to Follow Directions

A pretrained model is a powerful but unwieldy tool. Ask it a question, and it might continue the text in unexpected ways rather than answering directly. Instruction tuning transforms this text completion engine into an assistant that follows directions.

The process is straightforward: train the model on thousands of (instruction, response) pairs. “Summarize this article:” paired with a good summary. “What are the side effects of metformin?” paired with an accurate, helpful answer. After instruction tuning, the model learns the pattern of responding helpfully to requests.

# Instruction tuning examples (conceptual)
instruction_examples = [
    {
        "instruction": "List the contraindications for aspirin use.",
        "response": "Aspirin is contraindicated in patients with: 1) Known hypersensitivity to NSAIDs, 2) Active peptic ulcer disease, 3) Bleeding disorders, 4) Third trimester of pregnancy, 5) Children with viral infections (risk of Reye's syndrome)..."
    },
    {
        "instruction": "Explain this lab result to a patient: Hemoglobin A1c: 8.2%",
        "response": "Your hemoglobin A1c test measures your average blood sugar over the past 2-3 months. A result of 8.2% indicates that your blood sugar has been higher than the target range..."
    }
]

11.2.3 RLHF: Aligning with Human Preferences

Instruction tuning teaches the model to respond, but not necessarily to respond well. Reinforcement Learning from Human Feedback (RLHF) addresses this by training the model to produce outputs that humans prefer.

The process works in stages:

  1. Generate multiple responses to each prompt
  2. Have humans rank responses by quality (helpfulness, accuracy, safety)
  3. Train a “reward model” to predict human preferences
  4. Fine-tune the LLM to maximize reward model scores

RLHF is how models learn to be helpful rather than technically correct but unhelpful, to refuse harmful requests, and to acknowledge uncertainty. For medical applications, RLHF can encode preferences for accuracy, appropriate hedging, and safety-conscious responses.

11.2.4 Context Windows: Working Memory

A critical practical consideration is the context window—how much text the model can process at once. Early GPT models had 2,048 tokens (~1,500 words); current models handle 100,000+ tokens (entire textbooks).

For clinical applications, context window determines what’s possible: - Short context (4K tokens): Single clinical notes, brief conversations - Medium context (32K tokens): Multiple notes, longer conversations with history - Long context (100K+ tokens): Entire patient charts, full guidelines, extensive retrieval

# Context window usage for clinical tasks
# Approximate token counts for clinical documents

document_tokens = {
    "Brief clinic note": 200,
    "Discharge summary": 1500,
    "Operative report": 800,
    "Full H&P": 2000,
    "Radiology report": 300,
    "Week of ICU notes": 10000,
    "Year of outpatient records": 50000,
}

# With a 32K context window, you can include:
# - Current encounter notes
# - Recent relevant history
# - Pertinent guidelines
# - Your specific question

Longer contexts enable richer clinical reasoning but increase computational cost and latency. Choosing the right context length involves balancing completeness against speed and cost.

11.3 The Medical LLM Landscape

Clinical Context: A health system’s AI governance committee must evaluate LLM options. Do they use a general-purpose model like GPT-4, a medical-specific model like Med-PaLM, or an open model they can run locally? Each choice has implications for performance, privacy, cost, and control.

The landscape of LLMs relevant to medicine includes both general-purpose models applied to healthcare and models specifically developed for medical applications.

11.3.1 General-Purpose Models in Medicine

The most capable LLMs are general-purpose models trained on broad internet data:

GPT-4 and GPT-4o (OpenAI): Currently among the most capable models, GPT-4 has demonstrated strong performance on medical benchmarks, passing USMLE-style exams with scores above 80% (Nori et al. 2023). Many health systems pilot GPT-4 for documentation, patient messaging, and clinical decision support. GPT-4o adds multimodal capabilities for processing images alongside text.

Claude (Anthropic): Known for strong reasoning capabilities and longer context windows (up to 200K tokens). Claude’s emphasis on helpfulness and honesty makes it appealing for clinical applications where nuanced, accurate responses matter.

Gemini (Google): Google’s multimodal model family, with variants ranging from efficient to highly capable. Gemini’s integration with Google’s infrastructure appeals to organizations in that ecosystem.

These general-purpose models weren’t trained specifically for medicine, yet their scale and broad training produce surprisingly strong medical performance. They’ve absorbed medical knowledge from textbooks, papers, and clinical discussions in their training data.

11.3.2 Medical-Specific Models

Some models are specifically developed for medical applications:

Med-PaLM and Med-PaLM 2 (Google): Built on Google’s PaLM architecture with medical-specific training (Singhal, Azizi, et al. 2023). Med-PaLM 2 achieved 85%+ on USMLE-style questions and was the first AI system to reach “expert” level on several medical benchmarks (Singhal, Tu, et al. 2023). However, it’s not publicly available.

Meditron (EPFL): An open-weights model based on Llama 2, further trained on medical literature including PubMed papers and clinical guidelines. Available in 7B and 70B parameter versions.

BioMistral: Mistral fine-tuned on biomedical literature. Smaller and more efficient than some alternatives while maintaining reasonable medical performance.

OpenBioLLM: Open-source medical LLM family with various size options, designed to be fine-tuned on institutional data.

11.3.3 Open vs. Closed: Tradeoffs for Healthcare

The choice between open and closed models involves significant tradeoffs:

Factor Closed (GPT-4, Claude) Open (Llama, Mistral)
Performance Currently highest Rapidly improving
Data privacy Data sent to external API Can run locally
Cost Per-token pricing Infrastructure costs
Control Limited customization Full control
Compliance BAA available from major providers Self-managed compliance
Updates Automatic (may change behavior) You control versions

For many clinical applications, the highest-performing option is a closed commercial API with a Business Associate Agreement (BAA). For applications involving sensitive data or requiring local control, open models offer an alternative—with the tradeoff of managing infrastructure and potentially lower baseline performance.

11.3.4 Decision Framework

When selecting an LLM for a clinical application:

  1. What’s the task? Documentation, summarization, and patient communication are well-served by current models. Novel clinical reasoning requires more caution.

  2. What’s the data sensitivity? PHI requirements may mandate local deployment or BAA-covered APIs.

  3. What’s the latency requirement? Real-time clinical workflows need fast inference; batch processing can tolerate slower, more capable models.

  4. What’s the accuracy requirement? High-stakes decisions need the most capable models with human oversight; administrative tasks can use lighter models.

  5. What’s the budget? API costs scale with usage; local deployment has fixed infrastructure costs.

11.4 Retrieval-Augmented Generation

Clinical Context: A clinician asks an LLM about drug interactions for a patient on multiple medications. The model gives a plausible but outdated answer—it doesn’t know about a recent FDA warning. Retrieval-augmented generation solves this by grounding LLM responses in current, authoritative sources.

LLMs have a fundamental limitation: their knowledge is frozen at training time. They may confidently state outdated information, hallucinate drug names that don’t exist, or miss recent guideline changes. Retrieval-Augmented Generation (RAG) addresses this by giving the model access to external knowledge sources at inference time.

11.4.1 The RAG Architecture

RAG combines two systems:

  1. Retriever: Finds relevant documents from a knowledge base
  2. Generator: Uses retrieved documents to produce grounded responses

The workflow:

User query → Retriever finds relevant documents →
Documents + query sent to LLM → LLM generates grounded response

This architecture has several advantages for clinical AI:

  • Current information: The knowledge base can be updated without retraining the model
  • Verifiable sources: Responses can cite specific documents
  • Institutional knowledge: RAG can incorporate local guidelines, formularies, and protocols
  • Reduced hallucination: Grounding in retrieved text constrains model outputs

11.4.2 Building a Clinical RAG System

Let’s build a RAG system for drug information. The key components are:

1. Document Processing: Split knowledge sources into chunks suitable for retrieval.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Sample drug information (in practice, this would be a comprehensive database)
drug_documents = [
    """
    METFORMIN HYDROCHLORIDE

    Indications: Type 2 diabetes mellitus as monotherapy or in combination with other agents.

    Contraindications: Severe renal impairment (eGFR <30 mL/min/1.73m²), acute or chronic
    metabolic acidosis including diabetic ketoacidosis.

    Warnings: Lactic acidosis is a rare but serious complication. Risk increases with renal
    impairment, age >65, radiologic studies with contrast, surgery, excessive alcohol intake.
    Hold metformin before iodinated contrast procedures in patients with eGFR 30-60.

    Drug Interactions: Carbonic anhydrase inhibitors may increase risk of lactic acidosis.
    Alcohol potentiates effect on lactate metabolism.

    Dosing: Initial 500mg twice daily or 850mg once daily. Maximum 2550mg/day in divided doses.
    """,
    """
    LISINOPRIL

    Indications: Hypertension, heart failure, acute MI (within 24 hours in hemodynamically
    stable patients), diabetic nephropathy.

    Contraindications: History of angioedema related to previous ACE inhibitor therapy,
    hereditary or idiopathic angioedema, concomitant use with aliskiren in diabetic patients.

    Warnings: Angioedema can occur at any time during treatment. Higher risk in Black patients.
    Can cause hyperkalemia, especially with renal impairment or potassium supplements.
    Fetal toxicity - discontinue as soon as pregnancy detected.

    Drug Interactions: NSAIDs may reduce antihypertensive effect and worsen renal function.
    Potassium-sparing diuretics increase hyperkalemia risk. Lithium levels may increase.

    Dosing: Hypertension: Initial 10mg once daily. Usual range 20-40mg/day.
    """,
    """
    WARFARIN SODIUM

    Indications: Prophylaxis and treatment of venous thromboembolism, prophylaxis and
    treatment of thromboembolic complications associated with atrial fibrillation and/or
    cardiac valve replacement, reduction in risk of death and thromboembolic events post-MI.

    Contraindications: Pregnancy (except in women with mechanical heart valves), hemorrhagic
    tendencies, recent or contemplated surgery of CNS or eye, major regional lumbar block
    anesthesia, malignant hypertension.

    Drug Interactions: EXTENSIVE interaction profile. CYP2C9 inhibitors increase effect
    (fluconazole, amiodarone, metronidazole). CYP2C9 inducers decrease effect (rifampin,
    carbamazepine). Vitamin K-containing foods affect response. Many antibiotics alter
    INR - monitor closely with any antibiotic initiation or discontinuation.

    Monitoring: INR target typically 2.0-3.0 for most indications. More frequent monitoring
    needed when starting, stopping, or changing interacting medications.
    """
]

# Split into chunks for retrieval
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)

chunks = []
for doc in drug_documents:
    chunks.extend(text_splitter.split_text(doc))

print(f"Created {len(chunks)} chunks from {len(drug_documents)} documents")

2. Embedding and Indexing: Convert chunks to vectors for similarity search.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Create embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Build vector store
vectorstore = FAISS.from_texts(chunks, embeddings)

# Test retrieval
query = "What are the drug interactions for metformin?"
relevant_docs = vectorstore.similarity_search(query, k=3)

print("Retrieved documents:")
for i, doc in enumerate(relevant_docs):
    print(f"\n--- Document {i+1} ---")
    print(doc.page_content[:200] + "...")

3. Generation with Context: Send retrieved documents to the LLM with the user query.

from openai import OpenAI

client = OpenAI()

def rag_query(question: str, vectorstore, k: int = 3) -> str:
    """Answer a question using RAG."""

    # Retrieve relevant documents
    docs = vectorstore.similarity_search(question, k=k)
    context = "\n\n".join([doc.page_content for doc in docs])

    # Generate response grounded in retrieved context
    prompt = f"""You are a clinical pharmacist assistant. Answer the question based ONLY on
the provided drug information. If the information needed is not in the context, say so.
Cite which drug's information you're using.

DRUG INFORMATION:
{context}

QUESTION: {question}

ANSWER:"""

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return response.choices[0].message.content

# Example queries
print(rag_query("Can I give metformin to a patient getting a CT with contrast?", vectorstore))
print("\n" + "="*50 + "\n")
print(rag_query("What should I monitor when starting warfarin with an antibiotic?", vectorstore))

11.4.3 Production RAG Considerations

Moving from prototype to production requires addressing several challenges:

Chunking Strategy: How documents are split affects retrieval quality. Semantic chunking (by topic) often outperforms fixed-size chunking. For clinical documents, splitting by section headers preserves context.

Embedding Model Selection: Medical-specific embedding models (e.g., PubMedBERT embeddings) may outperform general embeddings for clinical content.

Retrieval Enhancement: Techniques like hybrid search (combining dense embeddings with keyword search), reranking retrieved documents, and query expansion improve relevance.

Citation and Traceability: Clinical applications need to cite sources. Include document metadata and present sources alongside generated answers.

Knowledge Base Maintenance: Establish processes for updating the knowledge base as guidelines change, new drugs are approved, and warnings are issued.

# Production-grade RAG with citations
def rag_with_citations(question: str, vectorstore, k: int = 3) -> dict:
    """RAG query that returns answer with source citations."""

    docs = vectorstore.similarity_search(question, k=k)

    # Build context with source markers
    context_parts = []
    sources = []
    for i, doc in enumerate(docs):
        context_parts.append(f"[Source {i+1}]: {doc.page_content}")
        sources.append({
            "id": i + 1,
            "content": doc.page_content[:100] + "...",
            "metadata": doc.metadata if hasattr(doc, 'metadata') else {}
        })

    context = "\n\n".join(context_parts)

    prompt = f"""Answer based on the provided sources. Cite sources using [Source N] format.

SOURCES:
{context}

QUESTION: {question}

ANSWER (with citations):"""

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": sources
    }

11.5 Clinical Applications

Clinical Context: Across healthcare, organizations are piloting LLMs for diverse applications. Some are transforming workflows; others remain research prototypes. Understanding where LLMs deliver value—and where they fall short—guides practical deployment decisions.

11.5.1 Documentation and Ambient AI

The most successful clinical LLM applications reduce documentation burden. Ambient AI scribes listen to patient-clinician conversations and generate clinical notes automatically.

The workflow typically involves: 1. Audio recording of the clinical encounter (with consent) 2. Speech-to-text conversion 3. LLM processing to structure content into note format 4. Clinician review and editing 5. Final note signed and filed

Products like Nuance DAX, Abridge, and Suki have demonstrated significant time savings—some studies show 50% reduction in documentation time. The key is that clinicians review and sign the generated note; the AI augments rather than replaces the documentation process.

# Conceptual note generation from transcript
def generate_clinical_note(transcript: str, note_type: str = "SOAP") -> str:
    """Generate a structured clinical note from a conversation transcript."""

    prompt = f"""Convert this patient-physician conversation into a {note_type} note.
Follow standard medical documentation conventions. Use appropriate medical terminology.
Include only information explicitly stated or clearly implied in the conversation.

CONVERSATION TRANSCRIPT:
{transcript}

CLINICAL NOTE ({note_type} FORMAT):"""

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3
    )

    return response.choices[0].message.content

# Example usage
transcript = """
Doctor: Good morning, how are you feeling today?
Patient: Not great. This cough has been going on for about two weeks now.
Doctor: Is it a dry cough or are you bringing up any phlegm?
Patient: Mostly dry, but sometimes a little yellow mucus.
Doctor: Any fever, chills, or shortness of breath?
Patient: I had a low fever a few days ago, maybe 99 or 100. No real shortness of breath,
         but I do feel a little winded going up stairs.
Doctor: Are you a smoker?
Patient: I quit about five years ago. Smoked for about 20 years before that.
Doctor: Let me listen to your lungs... I hear some scattered rhonchi bilaterally.
        Given your history and symptoms, I think this is acute bronchitis.
        I'd recommend supportive care - rest, fluids, honey for the cough.
        If it's not better in another week, or if you develop high fever or
        significant shortness of breath, come back and we may need to do a chest X-ray.
Patient: Should I take any antibiotics?
Doctor: For acute bronchitis, antibiotics usually don't help since it's typically viral.
        Let's hold off unless things get worse.
"""

# note = generate_clinical_note(transcript)

11.5.2 Clinical Decision Support

LLMs can assist with diagnostic reasoning, treatment planning, and evidence synthesis. However, these applications require careful implementation:

Differential Diagnosis Assistance: Given patient presentation, LLMs can generate comprehensive differential diagnoses. Useful as a cognitive aid to ensure rare conditions aren’t overlooked.

Treatment Suggestions: LLMs can suggest evidence-based treatments, though these should always be verified against current guidelines and patient-specific factors.

Drug Interaction Checking: Combined with RAG over drug databases, LLMs can identify and explain potential interactions in complex medication regimens.

The consistent theme: LLMs augment clinical reasoning rather than replace it. They’re most valuable when they surface information clinicians might miss, not when they make autonomous decisions.

11.5.3 Patient Communication

LLMs excel at transforming clinical information into patient-friendly language:

Portal Message Responses: Draft responses to patient messages for clinician review. Can handle routine questions while flagging urgent concerns.

Patient Education: Generate personalized education materials based on specific diagnoses and treatment plans.

Discharge Instructions: Create clear, understandable discharge instructions tailored to the patient’s conditions and literacy level.

def draft_portal_response(
    patient_message: str,
    patient_context: str,
    response_type: str = "draft"
) -> str:
    """Draft a response to a patient portal message."""

    prompt = f"""You are helping a physician respond to a patient message.
Draft a compassionate, clear response appropriate for a patient portal.
Use plain language (8th grade reading level). Be helpful but don't diagnose.
For urgent symptoms, advise seeking immediate care.

PATIENT CONTEXT:
{patient_context}

PATIENT MESSAGE:
{patient_message}

DRAFT RESPONSE:"""

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.5
    )

    return response.choices[0].message.content

11.5.4 Literature Synthesis

Keeping up with medical literature is impossible—thousands of papers publish daily. LLMs can help:

Search and Summarization: RAG over PubMed to find and summarize relevant papers for a clinical question.

Evidence Synthesis: Summarize findings across multiple studies on a topic.

Guideline Navigation: Answer questions about specific guideline recommendations with citations.

These applications benefit enormously from RAG to ground responses in actual literature rather than potentially outdated or hallucinated citations.

11.6 Beyond Prompting: LLM System Architecture

Clinical Context: A health system wants to deploy an LLM for medication reconciliation. A single prompt won’t suffice—the system must query the medication database, check for interactions, verify against the patient’s allergy list, and present findings for pharmacist review. This is system design, not just prompting.

Prompting is necessary but not sufficient for clinical deployment. Production systems require architecture that handles complexity, failures, and human oversight.

11.6.1 Agent Architectures

When a single prompt isn’t enough, agent architectures decompose tasks into verifiable steps:

Tool use: The LLM decides when to query external resources—drug databases, clinical calculators, guideline repositories. Instead of hallucinating a drug interaction, it looks it up.

def medication_review_agent(patient_meds, query):
    """Agent that uses tools to answer medication questions."""

    tools = {
        "drug_interactions": check_drug_interactions,
        "dosing_calculator": calculate_dose,
        "guidelines": search_guidelines
    }

    # LLM decides which tools to use
    plan = llm.plan_tool_use(query, available_tools=list(tools.keys()))

    results = []
    for step in plan:
        tool_result = tools[step.tool](step.parameters)
        results.append(tool_result)

    # LLM synthesizes tool outputs into response
    return llm.synthesize(query, tool_results=results)

Multi-step workflows: Complex tasks are broken into verifiable stages. Each stage can be validated before proceeding.

Human-in-loop checkpoints: Critical decision points pause for clinician review before continuing. The system proposes; the human disposes.

11.6.2 Error Handling and Graceful Degradation

Production systems must handle failures gracefully:

Retrieval failures: What happens when RAG returns no relevant documents? Options include: - Acknowledge the gap explicitly (“I couldn’t find relevant guidelines for this question”) - Fall back to general knowledge with clear caveats - Escalate to human review

Confidence thresholds: When should the system refuse to answer? Low-confidence outputs can be flagged for review rather than presented as definitive.

Graceful degradation: When the LLM is unavailable or responding slowly, the system should have fallback behavior—perhaps simpler rule-based logic or direct routing to human review.

def robust_clinical_query(query, context):
    """Handle LLM failures gracefully."""
    try:
        response = llm.generate(query, context, timeout=30)

        if response.confidence < 0.7:
            return {
                "response": response.text,
                "flag": "LOW_CONFIDENCE",
                "action": "ROUTE_TO_REVIEW"
            }
        return {"response": response.text, "flag": "OK"}

    except TimeoutError:
        return {
            "response": None,
            "flag": "TIMEOUT",
            "action": "FALLBACK_TO_HUMAN"
        }
    except Exception as e:
        log_error(e)
        return {
            "response": None,
            "flag": "ERROR",
            "action": "FALLBACK_TO_HUMAN"
        }

11.6.3 Monitoring for Production

Deployed LLM systems need ongoing monitoring:

Hallucination rate tracking: Sample outputs for factual verification. If hallucination rates increase, investigate prompt drift or model updates.

Query distribution drift: Are users asking questions outside the system’s intended scope? Monitor query patterns to catch scope creep.

Latency budgets: Clinical workflows have time constraints. A documentation assistant that takes 30 seconds is usable; one that takes 3 minutes is not.

User feedback loops: Clinicians should have easy mechanisms to flag incorrect or unhelpful outputs. This data feeds continuous improvement.

The difference between a demo and a deployment is everything that happens when things go wrong.

11.7 Evaluation and Benchmarks

Clinical Context: A hospital evaluating an LLM for clinical use asks: “How do we know if it’s good enough?” Benchmark scores provide one answer, but understanding what benchmarks measure—and what they don’t—is essential for deployment decisions.

11.7.1 Medical Exam Benchmarks

The most common LLM benchmarks test performance on medical licensing exam questions:

MedQA: USMLE-style multiple choice questions. GPT-4 achieves ~86%, Med-PaLM 2 achieves ~85%.

MedMCQA: Questions from Indian medical entrance exams. Tests breadth of medical knowledge.

PubMedQA: Questions answerable from PubMed abstracts. Tests scientific reasoning.

MMLU Medical Subsets: Medical portions of the Massive Multitask Language Understanding benchmark.

These benchmarks demonstrate that LLMs have absorbed substantial medical knowledge. However, exam performance doesn’t directly translate to clinical utility:

  • Exams test knowledge recall; clinical practice requires judgment
  • Exam questions are unambiguous; clinical scenarios rarely are
  • Exams have single correct answers; treatment decisions involve tradeoffs
  • Exams don’t test communication, empathy, or practical constraints

11.7.2 Clinical Task Evaluation

More relevant evaluations assess performance on actual clinical tasks:

Note Summarization: Can the model accurately summarize a clinical note? Evaluated by clinician review.

Information Extraction: Does the model correctly identify medications, diagnoses, and procedures from text?

Referral Letter Generation: Are generated referral letters accurate, complete, and appropriately formatted?

Clinical Reasoning: Given a case presentation, does the model’s reasoning process make sense?

These evaluations require clinician time, making them expensive but more predictive of real-world value.

11.7.3 Limitations of Benchmarks

Several limitations constrain what benchmarks tell us:

Data Contamination: Training data may include benchmark questions, inflating scores without reflecting true capability.

Distribution Shift: Benchmarks may not represent your institution’s patient population, documentation style, or clinical questions.

Static Evaluation: Benchmarks are fixed; clinical practice evolves with new treatments, guidelines, and evidence.

Missing Dimensions: Benchmarks rarely test safety (does the model refuse harmful requests?), uncertainty (does it know what it doesn’t know?), or fairness (does it perform equally across patient populations?).

For deployment decisions, benchmark scores are a starting point, not a destination. Prospective evaluation on your institution’s data, with your clinicians, remains essential.

11.8 Limitations, Safety, and the Path Forward

Clinical Context: A well-known case involved an LLM providing detailed instructions for a medication regimen that would have been harmful. The model was confident, articulate, and wrong. Understanding LLM limitations isn’t pessimism—it’s essential for safe deployment.

11.8.1 Hallucinations: The Fundamental Challenge

LLMs generate plausible text, but plausibility doesn’t guarantee accuracy. Hallucinations—confident statements of false information—are inherent to how these models work.

In medicine, hallucinations are particularly dangerous: - Fabricated drug names that don’t exist - Incorrect dosages (decimal point errors can be lethal) - Made-up citations to papers that were never published - Contraindications stated as indications

No current technique eliminates hallucinations entirely. RAG reduces them by grounding in retrieved sources. Careful prompting can encourage hedging and uncertainty expression. But the risk remains.

Mitigation strategies: 1. Always have clinician review before acting on LLM outputs 2. Use RAG for factual claims, especially drug information 3. Build systems that verify specific facts (dosages, interactions) against authoritative databases 4. Design interfaces that present LLM outputs as suggestions, not decisions

11.8.2 Training Data Issues

LLMs inherit biases and limitations from their training data:

Outdated Information: Training cutoffs mean models don’t know about recent FDA approvals, guideline changes, or drug recalls.

Bias Amplification: Training data reflects historical biases in medical literature and documentation. Models may perpetuate disparities.

Memorization: Models may memorize and regurgitate training data, including copyrighted content or potentially patient information if present in training data.

Geographic and Cultural Bias: Training data skews toward English-language, Western medical practice. Recommendations may not apply globally.

11.8.3 Automation Bias

Perhaps the most insidious risk is automation bias—the tendency for clinicians to over-rely on AI suggestions. Studies consistently show that people anchor on automated recommendations, sometimes overriding their own correct judgment.

This risk increases when: - AI outputs appear confident and authoritative - Time pressure limits careful review - The AI has been previously reliable - The output confirms existing expectations

Designing systems that encourage critical evaluation rather than rubber-stamping is a UX challenge as much as an AI challenge.

11.8.3.1 Designing Against Automation Bias

Automation bias is insidious because it correlates with system reliability—the better a tool works, the more users trust it blindly. Several system design patterns can reduce this risk:

Require explicit confirmation. Force clinicians to actively approve high-stakes outputs rather than passively accept defaults. Instead of “Accept recommendation?”, require “I have reviewed this recommendation and confirmed it is appropriate for this patient.”

Show uncertainty. Display confidence indicators or highlight low-confidence predictions. “85% confident” invites scrutiny; a definitive “POSITIVE” does not. When the model is uncertain, make that uncertainty impossible to miss.

Randomize presentation. Occasionally withhold AI output for a subset of cases and compare clinician performance. This maintains calibration and provides ongoing validation data.

Make AI suggestions visually distinct. Use different fonts, colors, or layouts so AI output is never mistaken for clinician documentation. The source of every recommendation should be immediately obvious.

Build in friction for high-stakes decisions. Require a pause between seeing AI output and acting on it. Time delays reduce anchoring effects.

Clinical example: A sepsis prediction alert fires for a patient. The nurse, trusting the system’s track record, initiates the sepsis bundle. But the patient has a known chronic condition that mimics sepsis markers—the model was technically correct (elevated risk score) but clinically inappropriate (not acute sepsis). Designing the alert to require explicit confirmation—“Confirm: Patient does not have [known confounders]”—would have introduced the friction needed to catch this case.

11.8.4 Regulatory Landscape

The regulatory framework for clinical LLMs is evolving:

FDA Jurisdiction: Software that diagnoses or recommends treatment may be a regulated medical device. Documentation assistance may not be. The boundaries are still being clarified.

Liability: If an LLM contributes to patient harm, liability may fall on the clinician, health system, or vendor. Legal frameworks are not yet established.

Documentation Requirements: Using AI in clinical care may need to be documented. Standards are institution-specific.

Updates and Drift: When vendors update models, behavior changes. This complicates validation and may require re-evaluation.

11.8.5 The Human-AI Collaboration Model

The path forward isn’t autonomous AI replacing clinicians—it’s thoughtful human-AI collaboration:

  1. AI augments human capability: Handles information synthesis, documentation, and routine tasks
  2. Humans provide judgment: Make final decisions, especially for complex or high-stakes situations
  3. Appropriate trust calibration: Understand where AI is reliable and where it needs verification
  4. Continuous monitoring: Track AI performance and catch degradation

This collaborative model captures the benefits of AI—speed, consistency, tirelessness—while maintaining the judgment, empathy, and accountability that medicine requires.

11.9 Appendix 9A: Running Open Medical LLMs

For applications requiring local deployment—whether for data privacy, cost control, or customization—open-weights models offer an alternative to cloud APIs. This appendix covers practical considerations for running medical LLMs locally.

11.9.1 Hardware Requirements

LLM inference is primarily memory-bound. The key constraint is fitting the model in memory:

Model Size Memory Required Typical Hardware
7B parameters 14GB (FP16), 4-7GB (quantized) Consumer GPU (RTX 3090/4090)
13B parameters 26GB (FP16), 7-13GB (quantized) Professional GPU (A6000) or multi-GPU
70B parameters 140GB (FP16), 35-70GB (quantized) Multi-GPU server or cloud instance

Quantization reduces memory requirements by representing weights with fewer bits. A 4-bit quantized model uses ~1/4 the memory of the original with modest quality degradation.

11.9.2 Local Deployment with Ollama

Ollama provides the simplest path to running open models locally:

# Install Ollama (macOS/Linux)
# curl -fsSL https://ollama.com/install.sh | sh

# Pull a medical-capable model
# ollama pull meditron:7b
# ollama pull openbiollm:8b

# Using Ollama from Python
import requests

def query_local_llm(prompt: str, model: str = "meditron:7b") -> str:
    """Query a locally running Ollama model."""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )

    return response.json()["response"]

# Example usage
# response = query_local_llm("What are the first-line treatments for type 2 diabetes?")

11.9.3 Using llama.cpp for Efficiency

For maximum efficiency on consumer hardware, llama.cpp provides optimized inference:

# Using llama-cpp-python
from llama_cpp import Llama

# Load a quantized model
llm = Llama(
    model_path="./models/meditron-7b-Q4_K_M.gguf",
    n_ctx=4096,  # Context window
    n_threads=8,  # CPU threads
    n_gpu_layers=35  # Layers to offload to GPU
)

# Generate response
output = llm(
    "Patient presents with chest pain and elevated troponin. Differential diagnosis:",
    max_tokens=256,
    temperature=0.7,
    stop=["Patient:", "\n\n"]
)

print(output["choices"][0]["text"])

11.9.4 When Local Deployment Makes Sense

Local deployment is appropriate when:

  1. Data cannot leave premises: Strict PHI requirements without cloud BAA options
  2. High volume, predictable load: Fixed infrastructure costs beat per-token pricing
  3. Customization needed: Fine-tuning on institutional data
  4. Latency critical: Avoiding network round-trips for real-time applications
  5. Research and experimentation: Testing many models without API costs

Local deployment is less appropriate when:

  1. Maximum capability required: Cloud APIs offer more capable models
  2. Variable or uncertain load: Per-token pricing more efficient
  3. Limited ML infrastructure expertise: Cloud APIs are simpler to integrate
  4. Rapid iteration: New cloud models available immediately; local requires updates

11.9.5 Security Considerations

Running models locally doesn’t automatically ensure security:

  • Models may still leak training data through memorization
  • Logging and access controls need institutional implementation
  • Model files themselves need secure storage
  • Inference servers need network security like any service

Local deployment shifts responsibility from vendor to institution—this requires appropriate expertise and resources.