10 Sequence Models & Transformers

Clinical medicine generates vast amounts of text: admission notes, progress notes, discharge summaries, radiology reports, pathology reports. This unstructured data contains rich clinical information that structured EHR fields miss. This chapter introduces the neural architectures that process sequential data—from early recurrent networks to the transformer architecture that powers modern language models.

10.1 From Images to Sequences

Clinical Context: A discharge summary might span 2,000 words, describing a patient’s hospital course from admission through treatment to discharge planning. Unlike a chest X-ray (fixed 224×224 pixels), this text has variable length, and understanding any sentence may require context from paragraphs earlier. How do we build neural networks for such data?

10.1.1 The Sequence Processing Challenge

Chapter 7’s feedforward networks and Chapter 8’s CNNs assume fixed-size inputs. An image is always 224×224×3; we design the network architecture accordingly. But clinical text varies dramatically in length—a triage note might be 50 words, a comprehensive discharge summary 3,000 words.

Three key properties distinguish sequence data:

Variable length. We need architectures that handle inputs of any length without redesigning the network.

Order matters. “Patient denied chest pain” means something very different from “Patient described chest pain.” The same words in different orders convey opposite meanings.

Long-range dependencies. A pronoun at word 500 might refer to a concept introduced at word 50. The model must track information across long spans.

10.1.2 Why CNNs Aren’t Enough

You could apply 1D convolutions to text (and some models do), but convolutions have a limited receptive field. A 3-word convolution kernel sees local context but misses dependencies spanning hundreds of words. Stacking many layers expands the receptive field, but it’s inefficient for very long-range dependencies.

We need architectures designed for sequences from the ground up.

10.2 Recurrent Neural Networks

Clinical Context: Imagine reading a clinical note word by word, maintaining a mental summary as you go. When you encounter “allergic to penicillin” halfway through the note, you update your understanding; this information influences how you interpret all subsequent medication mentions. Recurrent neural networks formalize this process.

10.2.1 The Sequential Processing Idea

A recurrent neural network (RNN) processes sequences one element at a time, maintaining a hidden state that summarizes everything seen so far:

\[ h_t = f(h_{t-1}, x_t) \]

At each timestep $t$: 1. Take the previous hidden state $h_{t-1}$ 2. Combine it with the current input $x_t$ 3. Produce a new hidden state $h_t$

The hidden state acts as the network’s “memory”—a compressed representation of the sequence so far. For classification, we typically use the final hidden state $h_T$ as the sequence representation.

import torch
import torch.nn as nn

# Simple RNN for sequence classification
class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.classifier = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        # x: (batch, seq_len) token indices
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        output, hidden = self.rnn(embedded)  # hidden: (1, batch, hidden_dim)
        return self.classifier(hidden.squeeze(0))

10.2.2 The Vanishing Gradient Problem

Simple RNNs struggle with long sequences. During backpropagation, gradients must flow backward through every timestep. With hundreds of steps, gradients either:

Vanish: Shrink exponentially, making early timesteps unlearnable
Explode: Grow exponentially, causing numerical instability

In practice, simple RNNs effectively “forget” information from more than 10-20 timesteps back—useless for a 500-word clinical note where the diagnosis in sentence 3 affects interpretation of medications in sentence 30.

10.2.3 LSTM: Learning What to Remember

Long Short-Term Memory (LSTM) networks address vanishing gradients through gating mechanisms (Hochreiter and Schmidhuber 1997). An LSTM cell maintains two states:

Hidden state $h_t$: Short-term, working memory
Cell state $c_t$: Long-term memory, information can persist unchanged

Three gates control information flow:

Forget gate: What to erase from long-term memory
Input gate: What new information to store
Output gate: What to expose to the next layer

# LSTM for clinical text classification
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True,
                            bidirectional=True)
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)
        output, (hidden, cell) = self.lstm(embedded)
        # Concatenate forward and backward final hidden states
        hidden_cat = torch.cat([hidden[0], hidden[1]], dim=1)
        return self.classifier(hidden_cat)

Bidirectional LSTMs process the sequence in both directions, capturing both preceding and following context. For clinical text, this often improves performance—understanding “chest pain” might depend on both what came before (history) and after (resolved vs. ongoing).

10.2.4 Why RNNs Fell Behind

Despite improvements like LSTM, recurrent architectures have fundamental limitations:

Sequential processing. Each timestep depends on the previous one—no parallelization. Training on long sequences is slow.

Long-range dependencies still hard. Even LSTMs struggle with dependencies spanning hundreds of tokens. Information must pass through every intermediate step.

Fixed hidden state size. The entire sequence history must compress into a fixed-size vector, creating a bottleneck.

These limitations motivated the search for better architectures—leading to attention.

10.3 The Attention Revolution

Clinical Context: When a radiologist reads “consolidation in the right lower lobe consistent with pneumonia,” they don’t give equal weight to every word. “Consolidation,” “right lower lobe,” and “pneumonia” are diagnostic; “in,” “the,” and “with” are structural. Attention mechanisms let neural networks learn which parts of the input to focus on.

10.3.1 Attention as Weighted Combination

The core attention idea: instead of compressing the entire sequence into a single hidden state, let the model look back at all positions and decide which are relevant.

Given a query (what we’re looking for) and a set of key-value pairs (the sequence):

Compare the query to each key (compute similarity scores)
Convert scores to weights (softmax)
Return a weighted combination of values

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V \]

The $\sqrt{d_k}$ scaling prevents the dot products from growing too large in high dimensions.

10.3.2 Self-Attention: Every Position Attends to Every Other

Self-attention applies attention within a single sequence. Every position can attend to every other position, allowing direct connections between distant tokens.

For a clinical note, the word “pneumonia” at position 200 can directly attend to “cough” at position 10 and “fever” at position 45—no need to pass information through 190 intermediate steps.

import torch
import torch.nn.functional as F

def self_attention(x, d_k):
    """
    x: (batch, seq_len, d_model) - input embeddings
    Returns: (batch, seq_len, d_model) - attended representations
    """
    # In self-attention, Q, K, V all come from the same input
    Q = K = V = x

    # Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

    # Convert to probabilities
    attention_weights = F.softmax(scores, dim=-1)

    # Weighted combination of values
    return torch.matmul(attention_weights, V)

10.3.3 Why Attention Changes Everything

Parallelization. Unlike RNNs, attention computes all positions simultaneously. Training is dramatically faster on GPUs.

Direct long-range connections. Any two positions connect in one step, regardless of distance. No vanishing gradients across the sequence.

Interpretability. Attention weights show which words the model focuses on, providing some insight into its reasoning.

The attention mechanism is the core innovation enabling transformers and all modern language models.

10.4 The Transformer Architecture

Clinical Context: In 2017, the paper “Attention Is All You Need” introduced the transformer—an architecture built entirely on attention, with no recurrence (Vaswani et al. 2017). Within a few years, transformers dominated NLP. Models like BERT and GPT, both based on transformers, now underpin most clinical NLP applications.

10.4.1 The Encoder-Decoder Structure

The original transformer was designed for translation (English → German). It has two parts:

Encoder: Processes the input sequence, producing contextual representations
Decoder: Generates the output sequence, attending to both previous outputs and encoder representations

For classification tasks, we typically use only the encoder (BERT-style models). For generation tasks, we use only the decoder (GPT-style models) or the full encoder-decoder (T5, BART).

10.4.2 Multi-Head Attention

Instead of a single attention operation, transformers use multi-head attention—running several attention operations in parallel, each learning different relationships:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, x):
        batch_size, seq_len, d_model = x.shape

        # Project to Q, K, V
        Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

        # Attention for each head
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
        attention = F.softmax(scores, dim=-1)
        context = torch.matmul(attention, V)

        # Concatenate heads and project
        context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
        return self.W_o(context)

Different heads might learn to attend to different things: one head for syntactic relationships (subject-verb), another for semantic relationships (disease-symptom), another for coreference (pronoun-antecedent).

10.4.3 Positional Encoding

Attention treats the input as a set—it has no inherent notion of order. “Patient has fever” and “Fever has patient” would produce identical attention patterns without intervention.

Positional encodings inject position information. The original transformer uses sinusoidal functions:

\[ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d}) \] \[ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d}) \]

These are added to the input embeddings, giving each position a unique signature. Learned positional embeddings (used in BERT) work similarly.

10.4.4 The Transformer Block

A transformer encoder layer combines:

Multi-head self-attention
Add & normalize (residual connection + layer normalization)
Feed-forward network (two linear layers with nonlinearity)
Add & normalize

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Self-attention with residual
        attended = self.attention(x)
        x = self.norm1(x + self.dropout(attended))

        # Feed-forward with residual
        ff_out = self.ff(x)
        x = self.norm2(x + self.dropout(ff_out))

        return x

Stack 6-24 of these blocks, and you have a transformer encoder.

10.5 BERT and Pretrained Language Models

Clinical Context: Training a transformer from scratch requires enormous data—far more than any single hospital has. The breakthrough insight: pretrain on massive general text, then fine-tune on your specific clinical task. A model pretrained on all of Wikipedia and BookCorpus learns general language understanding; fine-tuning adapts it to predict ICU mortality from clinical notes.

10.5.1 The Pretrain-Then-Finetune Paradigm

Pretraining: Train a large transformer on unlabeled text using self-supervised objectives. The model learns language structure, world knowledge, and reasoning patterns.

Fine-tuning: Take the pretrained model, add a task-specific head (e.g., classification layer), and train on your labeled dataset. The pretrained weights provide a strong starting point.

This paradigm transformed NLP. Instead of training models from scratch on limited clinical data, we leverage knowledge from billions of words of text.

10.5.2 BERT: Bidirectional Encoder Representations

BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only transformer pretrained with two objectives (Devlin et al. 2018):

Masked Language Modeling (MLM): Randomly mask 15% of tokens; train the model to predict them from context. “Patient presented with [MASK] pain” → “chest”

Next Sentence Prediction (NSP): Given two sentences, predict whether the second follows the first in the original text. (Later research showed this is less important.)

BERT processes text bidirectionally—each token sees both left and right context—making it powerful for understanding tasks like classification and extraction.

10.5.3 Clinical Language Model Variants

General BERT trained on Wikipedia doesn’t know medical terminology. Several clinical variants exist:

Model	Training Data	Best For
BioBERT (Lee et al. 2020)	PubMed abstracts + PMC full text	Biomedical literature, research applications
PubMedBERT	PubMed abstracts only	Similar to BioBERT, sometimes better
ClinicalBERT (Alsentzer et al. 2019)	MIMIC-III clinical notes	Clinical notes, EHR text
Bio+ClinicalBERT	PubMed + MIMIC-III	Hybrid applications

When to use which:

Processing clinical notes (discharge summaries, progress notes): ClinicalBERT
Processing biomedical literature (research papers, guidelines): PubMedBERT or BioBERT
Mixed content: Bio+ClinicalBERT or experiment with both

from transformers import AutoTokenizer, AutoModel

# Load ClinicalBERT
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

# Tokenize clinical text
text = "Patient presents with acute chest pain radiating to left arm."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Get contextual embeddings
outputs = model(**inputs)
embeddings = outputs.last_hidden_state  # (batch, seq_len, 768)

# Use [CLS] token embedding for classification
cls_embedding = embeddings[:, 0, :]  # (batch, 768)

10.6 Putting It Together: Clinical Text Classification

Clinical Context: You’re tasked with building a model to predict 30-day hospital readmission from discharge summaries. This is a classic clinical NLP task: take unstructured text, extract relevant information, and make a binary prediction. We’ll fine-tune ClinicalBERT for this task.

10.6.1 Data Preparation

Clinical text requires careful preprocessing:

from transformers import AutoTokenizer
import torch
from torch.utils.data import Dataset, DataLoader

class ClinicalTextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        encoding = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'label': torch.tensor(label, dtype=torch.long)
        }

# Load tokenizer and create datasets
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

train_dataset = ClinicalTextDataset(train_texts, train_labels, tokenizer)
val_dataset = ClinicalTextDataset(val_texts, val_labels, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)

Handling long documents: BERT’s maximum sequence length is 512 tokens. Discharge summaries often exceed this. Options:

Truncation: Keep first 512 tokens (may lose important end information)
Chunking: Split into overlapping chunks, aggregate predictions
Hierarchical models: Encode chunks separately, then combine
Longformer/BigBird: Transformer variants designed for long sequences

For many tasks, truncation works surprisingly well—the beginning of clinical notes often contains the most critical information.

10.6.2 Fine-Tuning ClinicalBERT

from transformers import AutoModelForSequenceClassification
import torch.optim as optim
from sklearn.metrics import roc_auc_score
import numpy as np

# Load pretrained model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    "emilyalsentzer/Bio_ClinicalBERT",
    num_labels=2
)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Optimizer with different learning rates
optimizer = optim.AdamW([
    {'params': model.bert.parameters(), 'lr': 2e-5},      # Pretrained layers
    {'params': model.classifier.parameters(), 'lr': 1e-4}  # New classifier
])

# Training loop
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    # Validation
    model.eval()
    val_preds = []
    val_labels = []

    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask)
            probs = torch.softmax(outputs.logits, dim=1)[:, 1]

            val_preds.extend(probs.cpu().numpy())
            val_labels.extend(batch['label'].numpy())

    auroc = roc_auc_score(val_labels, val_preds)
    print(f"Epoch {epoch+1}: Loss={total_loss/len(train_loader):.4f}, Val AUROC={auroc:.4f}")

10.6.3 Using HuggingFace Trainer

For production use, HuggingFace’s Trainer class handles many details automatically:

from transformers import Trainer, TrainingArguments
import numpy as np
from sklearn.metrics import roc_auc_score, accuracy_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = torch.softmax(torch.tensor(logits), dim=1)[:, 1].numpy()
    preds = np.argmax(logits, axis=1)
    return {
        'auroc': roc_auc_score(labels, probs),
        'accuracy': accuracy_score(labels, preds)
    }

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

10.6.4 Evaluation and Interpretation

Beyond AUROC, examine model behavior:

# Get predictions on test set
test_results = trainer.predict(test_dataset)
test_probs = torch.softmax(torch.tensor(test_results.predictions), dim=1)[:, 1].numpy()

# Clinical metrics
from sklearn.metrics import confusion_matrix, classification_report

test_preds = (test_probs > 0.5).astype(int)
print(classification_report(test_labels, test_preds,
                            target_names=['No Readmit', 'Readmit']))

# Attention visualization (which words matter?)
# See Chapter 18 for interpretation methods

10.7 Limitations and Looking Ahead

Clinical Context: Transformers are powerful but not magic. Understanding their limitations helps you deploy them responsibly and know when simpler methods might suffice.

10.7.1 Context Length Constraints

BERT processes at most 512 tokens. A typical discharge summary contains 1,000-3,000 tokens. Options:

Truncate: Loses information but often works
Longformer/BigBird: Sparse attention allows 4,096+ tokens
Hierarchical approaches: Encode sections separately, combine

Context length is an active research area. Recent models handle 100K+ tokens, but with increased computational cost.

10.7.2 Computational Requirements

Transformers are expensive:

Training: Fine-tuning BERT takes hours on a GPU; pretraining takes weeks on hundreds of GPUs
Inference: ~110M parameters means slower inference than simpler models
Memory: Attention is O(n²) in sequence length

For high-throughput clinical applications, consider distilled models (DistilBERT, TinyBERT) that sacrifice some accuracy for speed.

10.7.3 What Transformers Don’t Do

No explicit reasoning. Transformers learn patterns from data; they don’t have symbolic reasoning capabilities. A model might learn “chest pain → cardiology” without understanding anatomy.

Brittle to distribution shift. A model trained on one hospital’s notes may fail on another’s due to different terminology, templates, or patient populations.

No uncertainty quantification. Standard transformers output confidences that aren’t well-calibrated. A model might be confidently wrong.

10.7.4 Looking Ahead: Generative Models

BERT-style encoders are powerful for understanding tasks (classification, extraction). But what about generating text? Chapter 11 introduces decoder-only transformers like GPT, which generate text autoregressively—the foundation of modern large language models and their medical applications.

10.8 Appendix 8A: Transformer Mathematics

This appendix provides formal definitions for readers who want the mathematical foundations.

10.8.1 Scaled Dot-Product Attention

Given queries $Q \in \mathbb{R}^{n \times d_k}$, keys $K \in \mathbb{R}^{m \times d_k}$, and values $V \in \mathbb{R}^{m \times d_v}$:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V \]

The softmax is applied row-wise, so each query produces a probability distribution over keys.

Why scale by $\sqrt{d_k}$? The dot products $QK^T$ have variance proportional to $d_k$. Large dot products push softmax into regions with tiny gradients. Scaling stabilizes training.

10.8.2 Multi-Head Attention

Instead of single attention with $d_{model}$-dimensional queries/keys/values, use $h$ parallel attention heads with $d_k = d_{model}/h$:

\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O \]

where each head is:

\[ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) \]

with learned projections $W_i^Q, W_i^K \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$, and $W^O \in \mathbb{R}^{hd_v \times d_{model}}$.

10.8.3 Positional Encoding

The sinusoidal positional encoding for position $pos$ and dimension $i$:

\[ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \] \[ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \]

Different frequencies allow the model to learn relative positions: for any fixed offset $k$, $PE_{pos+k}$ is a linear function of $PE_{pos}$.

10.8.4 Layer Normalization

Applied after each sub-layer:

\[ \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta \]

where $\mu$ and $\sigma$ are the mean and standard deviation computed across the feature dimension, and $\gamma$, $\beta$ are learned parameters.

10.8.5 BERT Pretraining Objectives

Masked Language Modeling (MLM):

Given input tokens $x_1, \ldots, x_n$, randomly select 15% of positions. For selected position $i$: - 80%: Replace $x_i$ with [MASK] - 10%: Replace $x_i$ with random token - 10%: Keep $x_i$ unchanged

Train to predict the original token from the corrupted context.

Next Sentence Prediction (NSP):

Given sentence pair (A, B): - 50%: B is the actual next sentence (label: IsNext) - 50%: B is random sentence (label: NotNext)

Train to predict the relationship. (Note: Later work showed NSP provides minimal benefit; many subsequent models omit it.)

10.8.6 Attention Complexity

For sequence length $n$ and model dimension $d$:

Time complexity: $O(n^2 d)$ — computing all pairwise attention scores
Space complexity: $O(n^2 + nd)$ — storing attention matrix and activations

This quadratic scaling in $n$ limits standard transformers to sequences of a few thousand tokens. Sparse attention variants (Longformer, BigBird) reduce this to $O(n \sqrt{n})$ or $O(n \log n)$.

10.8.7 Further Reading

Vaswani et al. (2017). “Attention Is All You Need.” The original transformer paper.
Devlin et al. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.”
Alsentzer et al. (2019). “Publicly Available Clinical BERT Embeddings.” The ClinicalBERT paper.
Gu et al. (2021). “Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature.” PubMedBERT paper.
Beltagy et al. (2020). “Longformer: The Long-Document Transformer.”

# Sequence Models & Transformers {#sec-sequence-transformers} Clinical medicine generates vast amounts of text: admission notes, progress notes, discharge summaries, radiology reports, pathology reports. This unstructured data contains rich clinical information that structured EHR fields miss. This chapter introduces the neural architectures that process sequential data—from early recurrent networks to the transformer architecture that powers modern language models. ## From Images to Sequences *Clinical Context:* A discharge summary might span 2,000 words, describing a patient's hospital course from admission through treatment to discharge planning. Unlike a chest X-ray (fixed 224×224 pixels), this text has variable length, and understanding any sentence may require context from paragraphs earlier. How do we build neural networks for such data? ### The Sequence Processing Challenge Chapter 7's feedforward networks and Chapter 8's CNNs assume fixed-size inputs. An image is always 224×224×3; we design the network architecture accordingly. But clinical text varies dramatically in length—a triage note might be 50 words, a comprehensive discharge summary 3,000 words. Three key properties distinguish sequence data: **Variable length.** We need architectures that handle inputs of any length without redesigning the network. **Order matters.** "Patient denied chest pain" means something very different from "Patient described chest pain." The same words in different orders convey opposite meanings. **Long-range dependencies.** A pronoun at word 500 might refer to a concept introduced at word 50. The model must track information across long spans. ### Why CNNs Aren't Enough You could apply 1D convolutions to text (and some models do), but convolutions have a limited receptive field. A 3-word convolution kernel sees local context but misses dependencies spanning hundreds of words. Stacking many layers expands the receptive field, but it's inefficient for very long-range dependencies. We need architectures designed for sequences from the ground up. ## Recurrent Neural Networks *Clinical Context:* Imagine reading a clinical note word by word, maintaining a mental summary as you go. When you encounter "allergic to penicillin" halfway through the note, you update your understanding; this information influences how you interpret all subsequent medication mentions. Recurrent neural networks formalize this process. ### The Sequential Processing Idea A **recurrent neural network** (RNN) processes sequences one element at a time, maintaining a **hidden state** that summarizes everything seen so far: $$ h_t = f(h_{t-1}, x_t) $$ At each timestep $t$: 1. Take the previous hidden state $h_{t-1}$ 2. Combine it with the current input $x_t$ 3. Produce a new hidden state $h_t$ The hidden state acts as the network's "memory"—a compressed representation of the sequence so far. For classification, we typically use the final hidden state $h_T$ as the sequence representation. ```{python} #| eval: false import torch import torch.nn as nn # Simple RNN for sequence classification class SimpleRNN(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim) self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True) self.classifier = nn.Linear(hidden_dim, num_classes) def forward(self, x): # x: (batch, seq_len) token indices embedded = self.embedding(x) # (batch, seq_len, embed_dim) output, hidden = self.rnn(embedded) # hidden: (1, batch, hidden_dim) return self.classifier(hidden.squeeze(0)) ``` ### The Vanishing Gradient Problem Simple RNNs struggle with long sequences. During backpropagation, gradients must flow backward through every timestep. With hundreds of steps, gradients either: - **Vanish**: Shrink exponentially, making early timesteps unlearnable - **Explode**: Grow exponentially, causing numerical instability In practice, simple RNNs effectively "forget" information from more than 10-20 timesteps back—useless for a 500-word clinical note where the diagnosis in sentence 3 affects interpretation of medications in sentence 30. ### LSTM: Learning What to Remember **Long Short-Term Memory** (LSTM) networks address vanishing gradients through gating mechanisms [@hochreiter1997lstm]. An LSTM cell maintains two states: - **Hidden state** $h_t$: Short-term, working memory - **Cell state** $c_t$: Long-term memory, information can persist unchanged Three gates control information flow: - **Forget gate**: What to erase from long-term memory - **Input gate**: What new information to store - **Output gate**: What to expose to the next layer ```{python} #| eval: false # LSTM for clinical text classification class LSTMClassifier(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim) self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True) self.classifier = nn.Linear(hidden_dim * 2, num_classes) def forward(self, x): embedded = self.embedding(x) output, (hidden, cell) = self.lstm(embedded) # Concatenate forward and backward final hidden states hidden_cat = torch.cat([hidden[0], hidden[1]], dim=1) return self.classifier(hidden_cat) ``` **Bidirectional LSTMs** process the sequence in both directions, capturing both preceding and following context. For clinical text, this often improves performance—understanding "chest pain" might depend on both what came before (history) and after (resolved vs. ongoing). ### Why RNNs Fell Behind Despite improvements like LSTM, recurrent architectures have fundamental limitations: **Sequential processing.** Each timestep depends on the previous one—no parallelization. Training on long sequences is slow. **Long-range dependencies still hard.** Even LSTMs struggle with dependencies spanning hundreds of tokens. Information must pass through every intermediate step. **Fixed hidden state size.** The entire sequence history must compress into a fixed-size vector, creating a bottleneck. These limitations motivated the search for better architectures—leading to attention. ## The Attention Revolution *Clinical Context:* When a radiologist reads "consolidation in the right lower lobe consistent with pneumonia," they don't give equal weight to every word. "Consolidation," "right lower lobe," and "pneumonia" are diagnostic; "in," "the," and "with" are structural. Attention mechanisms let neural networks learn which parts of the input to focus on. ### Attention as Weighted Combination The core attention idea: instead of compressing the entire sequence into a single hidden state, let the model *look back* at all positions and decide which are relevant. Given a query (what we're looking for) and a set of key-value pairs (the sequence): 1. Compare the query to each key (compute similarity scores) 2. Convert scores to weights (softmax) 3. Return a weighted combination of values $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$ The $\sqrt{d_k}$ scaling prevents the dot products from growing too large in high dimensions. ### Self-Attention: Every Position Attends to Every Other **Self-attention** applies attention within a single sequence. Every position can attend to every other position, allowing direct connections between distant tokens. For a clinical note, the word "pneumonia" at position 200 can directly attend to "cough" at position 10 and "fever" at position 45—no need to pass information through 190 intermediate steps. ```{python} #| eval: false import torch import torch.nn.functional as F def self_attention(x, d_k): """ x: (batch, seq_len, d_model) - input embeddings Returns: (batch, seq_len, d_model) - attended representations """ # In self-attention, Q, K, V all come from the same input Q = K = V = x # Compute attention scores scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5) # Convert to probabilities attention_weights = F.softmax(scores, dim=-1) # Weighted combination of values return torch.matmul(attention_weights, V) ``` ### Why Attention Changes Everything **Parallelization.** Unlike RNNs, attention computes all positions simultaneously. Training is dramatically faster on GPUs. **Direct long-range connections.** Any two positions connect in one step, regardless of distance. No vanishing gradients across the sequence. **Interpretability.** Attention weights show which words the model focuses on, providing some insight into its reasoning. The attention mechanism is the core innovation enabling transformers and all modern language models. ## The Transformer Architecture *Clinical Context:* In 2017, the paper "Attention Is All You Need" introduced the transformer—an architecture built entirely on attention, with no recurrence [@vaswani2017attention]. Within a few years, transformers dominated NLP. Models like BERT and GPT, both based on transformers, now underpin most clinical NLP applications. ### The Encoder-Decoder Structure The original transformer was designed for translation (English → German). It has two parts: - **Encoder**: Processes the input sequence, producing contextual representations - **Decoder**: Generates the output sequence, attending to both previous outputs and encoder representations For classification tasks, we typically use only the encoder (BERT-style models). For generation tasks, we use only the decoder (GPT-style models) or the full encoder-decoder (T5, BART). ### Multi-Head Attention Instead of a single attention operation, transformers use **multi-head attention**—running several attention operations in parallel, each learning different relationships: ```{python} #| eval: false class MultiHeadAttention(nn.Module): def __init__(self, d_model, num_heads): super().__init__() self.num_heads = num_heads self.d_k = d_model // num_heads self.W_q = nn.Linear(d_model, d_model) self.W_k = nn.Linear(d_model, d_model) self.W_v = nn.Linear(d_model, d_model) self.W_o = nn.Linear(d_model, d_model) def forward(self, x): batch_size, seq_len, d_model = x.shape # Project to Q, K, V Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2) K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2) V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2) # Attention for each head scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5) attention = F.softmax(scores, dim=-1) context = torch.matmul(attention, V) # Concatenate heads and project context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model) return self.W_o(context) ``` Different heads might learn to attend to different things: one head for syntactic relationships (subject-verb), another for semantic relationships (disease-symptom), another for coreference (pronoun-antecedent). ### Positional Encoding Attention treats the input as a *set*—it has no inherent notion of order. "Patient has fever" and "Fever has patient" would produce identical attention patterns without intervention. **Positional encodings** inject position information. The original transformer uses sinusoidal functions: $$ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d}) $$ $$ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d}) $$ These are added to the input embeddings, giving each position a unique signature. Learned positional embeddings (used in BERT) work similarly. ### The Transformer Block A transformer encoder layer combines: 1. **Multi-head self-attention** 2. **Add & normalize** (residual connection + layer normalization) 3. **Feed-forward network** (two linear layers with nonlinearity) 4. **Add & normalize** ```{python} #| eval: false class TransformerBlock(nn.Module): def __init__(self, d_model, num_heads, d_ff, dropout=0.1): super().__init__() self.attention = MultiHeadAttention(d_model, num_heads) self.norm1 = nn.LayerNorm(d_model) self.ff = nn.Sequential( nn.Linear(d_model, d_ff), nn.GELU(), nn.Linear(d_ff, d_model) ) self.norm2 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x): # Self-attention with residual attended = self.attention(x) x = self.norm1(x + self.dropout(attended)) # Feed-forward with residual ff_out = self.ff(x) x = self.norm2(x + self.dropout(ff_out)) return x ``` Stack 6-24 of these blocks, and you have a transformer encoder. ## BERT and Pretrained Language Models *Clinical Context:* Training a transformer from scratch requires enormous data—far more than any single hospital has. The breakthrough insight: pretrain on massive general text, then fine-tune on your specific clinical task. A model pretrained on all of Wikipedia and BookCorpus learns general language understanding; fine-tuning adapts it to predict ICU mortality from clinical notes. ### The Pretrain-Then-Finetune Paradigm **Pretraining**: Train a large transformer on unlabeled text using self-supervised objectives. The model learns language structure, world knowledge, and reasoning patterns. **Fine-tuning**: Take the pretrained model, add a task-specific head (e.g., classification layer), and train on your labeled dataset. The pretrained weights provide a strong starting point. This paradigm transformed NLP. Instead of training models from scratch on limited clinical data, we leverage knowledge from billions of words of text. ### BERT: Bidirectional Encoder Representations **BERT** (Bidirectional Encoder Representations from Transformers) is an encoder-only transformer pretrained with two objectives [@devlin2018bert]: **Masked Language Modeling (MLM)**: Randomly mask 15% of tokens; train the model to predict them from context. "Patient presented with [MASK] pain" → "chest" **Next Sentence Prediction (NSP)**: Given two sentences, predict whether the second follows the first in the original text. (Later research showed this is less important.) BERT processes text bidirectionally—each token sees both left and right context—making it powerful for understanding tasks like classification and extraction. ### Clinical Language Model Variants General BERT trained on Wikipedia doesn't know medical terminology. Several clinical variants exist: | Model | Training Data | Best For | |-------|---------------|----------| | **BioBERT** [@lee2020biobert] | PubMed abstracts + PMC full text | Biomedical literature, research applications | | **PubMedBERT** | PubMed abstracts only | Similar to BioBERT, sometimes better | | **ClinicalBERT** [@alsentzer2019clinicalbert] | MIMIC-III clinical notes | Clinical notes, EHR text | | **Bio+ClinicalBERT** | PubMed + MIMIC-III | Hybrid applications | **When to use which:** - Processing **clinical notes** (discharge summaries, progress notes): ClinicalBERT - Processing **biomedical literature** (research papers, guidelines): PubMedBERT or BioBERT - **Mixed content**: Bio+ClinicalBERT or experiment with both ```{python} #| eval: false from transformers import AutoTokenizer, AutoModel # Load ClinicalBERT tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") # Tokenize clinical text text = "Patient presents with acute chest pain radiating to left arm." inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) # Get contextual embeddings outputs = model(**inputs) embeddings = outputs.last_hidden_state # (batch, seq_len, 768) # Use [CLS] token embedding for classification cls_embedding = embeddings[:, 0, :] # (batch, 768) ``` ## Putting It Together: Clinical Text Classification *Clinical Context:* You're tasked with building a model to predict 30-day hospital readmission from discharge summaries. This is a classic clinical NLP task: take unstructured text, extract relevant information, and make a binary prediction. We'll fine-tune ClinicalBERT for this task. ### Data Preparation Clinical text requires careful preprocessing: ```{python} #| eval: false from transformers import AutoTokenizer import torch from torch.utils.data import Dataset, DataLoader class ClinicalTextDataset(Dataset): def __init__(self, texts, labels, tokenizer, max_length=512): self.texts = texts self.labels = labels self.tokenizer = tokenizer self.max_length = max_length def __len__(self): return len(self.texts) def __getitem__(self, idx): text = self.texts[idx] label = self.labels[idx] encoding = self.tokenizer( text, truncation=True, max_length=self.max_length, padding='max_length', return_tensors='pt' ) return { 'input_ids': encoding['input_ids'].squeeze(), 'attention_mask': encoding['attention_mask'].squeeze(), 'label': torch.tensor(label, dtype=torch.long) } # Load tokenizer and create datasets tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT") train_dataset = ClinicalTextDataset(train_texts, train_labels, tokenizer) val_dataset = ClinicalTextDataset(val_texts, val_labels, tokenizer) train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=16) ``` **Handling long documents**: BERT's maximum sequence length is 512 tokens. Discharge summaries often exceed this. Options: - **Truncation**: Keep first 512 tokens (may lose important end information) - **Chunking**: Split into overlapping chunks, aggregate predictions - **Hierarchical models**: Encode chunks separately, then combine - **Longformer/BigBird**: Transformer variants designed for long sequences For many tasks, truncation works surprisingly well—the beginning of clinical notes often contains the most critical information. ### Fine-Tuning ClinicalBERT ```{python} #| eval: false from transformers import AutoModelForSequenceClassification import torch.optim as optim from sklearn.metrics import roc_auc_score import numpy as np # Load pretrained model with classification head model = AutoModelForSequenceClassification.from_pretrained( "emilyalsentzer/Bio_ClinicalBERT", num_labels=2 ) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device) # Optimizer with different learning rates optimizer = optim.AdamW([ {'params': model.bert.parameters(), 'lr': 2e-5}, # Pretrained layers {'params': model.classifier.parameters(), 'lr': 1e-4} # New classifier ]) # Training loop num_epochs = 3 for epoch in range(num_epochs): model.train() total_loss = 0 for batch in train_loader: input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['label'].to(device) optimizer.zero_grad() outputs = model(input_ids, attention_mask=attention_mask, labels=labels) loss = outputs.loss loss.backward() optimizer.step() total_loss += loss.item() # Validation model.eval() val_preds = [] val_labels = [] with torch.no_grad(): for batch in val_loader: input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) outputs = model(input_ids, attention_mask=attention_mask) probs = torch.softmax(outputs.logits, dim=1)[:, 1] val_preds.extend(probs.cpu().numpy()) val_labels.extend(batch['label'].numpy()) auroc = roc_auc_score(val_labels, val_preds) print(f"Epoch {epoch+1}: Loss={total_loss/len(train_loader):.4f}, Val AUROC={auroc:.4f}") ``` ### Using HuggingFace Trainer For production use, HuggingFace's `Trainer` class handles many details automatically: ```{python} #| eval: false from transformers import Trainer, TrainingArguments import numpy as np from sklearn.metrics import roc_auc_score, accuracy_score def compute_metrics(eval_pred): logits, labels = eval_pred probs = torch.softmax(torch.tensor(logits), dim=1)[:, 1].numpy() preds = np.argmax(logits, axis=1) return { 'auroc': roc_auc_score(labels, probs), 'accuracy': accuracy_score(labels, preds) } training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=16, warmup_steps=500, weight_decay=0.01, logging_dir='./logs', logging_steps=100, evaluation_strategy='epoch', save_strategy='epoch', load_best_model_at_end=True, ) trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset, compute_metrics=compute_metrics, ) trainer.train() ``` ### Evaluation and Interpretation Beyond AUROC, examine model behavior: ```{python} #| eval: false # Get predictions on test set test_results = trainer.predict(test_dataset) test_probs = torch.softmax(torch.tensor(test_results.predictions), dim=1)[:, 1].numpy() # Clinical metrics from sklearn.metrics import confusion_matrix, classification_report test_preds = (test_probs > 0.5).astype(int) print(classification_report(test_labels, test_preds, target_names=['No Readmit', 'Readmit'])) # Attention visualization (which words matter?) # See Chapter 18 for interpretation methods ``` ## Limitations and Looking Ahead *Clinical Context:* Transformers are powerful but not magic. Understanding their limitations helps you deploy them responsibly and know when simpler methods might suffice. ### Context Length Constraints BERT processes at most 512 tokens. A typical discharge summary contains 1,000-3,000 tokens. Options: - **Truncate**: Loses information but often works - **Longformer/BigBird**: Sparse attention allows 4,096+ tokens - **Hierarchical approaches**: Encode sections separately, combine Context length is an active research area. Recent models handle 100K+ tokens, but with increased computational cost. ### Computational Requirements Transformers are expensive: - **Training**: Fine-tuning BERT takes hours on a GPU; pretraining takes weeks on hundreds of GPUs - **Inference**: ~110M parameters means slower inference than simpler models - **Memory**: Attention is O(n²) in sequence length For high-throughput clinical applications, consider distilled models (DistilBERT, TinyBERT) that sacrifice some accuracy for speed. ### What Transformers Don't Do **No explicit reasoning.** Transformers learn patterns from data; they don't have symbolic reasoning capabilities. A model might learn "chest pain → cardiology" without understanding anatomy. **Brittle to distribution shift.** A model trained on one hospital's notes may fail on another's due to different terminology, templates, or patient populations. **No uncertainty quantification.** Standard transformers output confidences that aren't well-calibrated. A model might be confidently wrong. ### Looking Ahead: Generative Models BERT-style encoders are powerful for understanding tasks (classification, extraction). But what about generating text? Chapter 11 introduces decoder-only transformers like GPT, which generate text autoregressively—the foundation of modern large language models and their medical applications. ## Appendix 8A: Transformer Mathematics {#sec-appendix-transformer-math} *This appendix provides formal definitions for readers who want the mathematical foundations.* ### Scaled Dot-Product Attention Given queries $Q \in \mathbb{R}^{n \times d_k}$, keys $K \in \mathbb{R}^{m \times d_k}$, and values $V \in \mathbb{R}^{m \times d_v}$: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$ The softmax is applied row-wise, so each query produces a probability distribution over keys. **Why scale by $\sqrt{d_k}$?** The dot products $QK^T$ have variance proportional to $d_k$. Large dot products push softmax into regions with tiny gradients. Scaling stabilizes training. ### Multi-Head Attention Instead of single attention with $d_{model}$-dimensional queries/keys/values, use $h$ parallel attention heads with $d_k = d_{model}/h$: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O $$ where each head is: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$ with learned projections $W_i^Q, W_i^K \in \mathbb{R}^{d_{model} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{model} \times d_v}$, and $W^O \in \mathbb{R}^{hd_v \times d_{model}}$. ### Positional Encoding The sinusoidal positional encoding for position $pos$ and dimension $i$: $$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ Different frequencies allow the model to learn relative positions: for any fixed offset $k$, $PE_{pos+k}$ is a linear function of $PE_{pos}$. ### Layer Normalization Applied after each sub-layer: $$ \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta $$ where $\mu$ and $\sigma$ are the mean and standard deviation computed across the feature dimension, and $\gamma$, $\beta$ are learned parameters. ### BERT Pretraining Objectives **Masked Language Modeling (MLM):** Given input tokens $x_1, \ldots, x_n$, randomly select 15% of positions. For selected position $i$: - 80%: Replace $x_i$ with [MASK] - 10%: Replace $x_i$ with random token - 10%: Keep $x_i$ unchanged Train to predict the original token from the corrupted context. **Next Sentence Prediction (NSP):** Given sentence pair (A, B): - 50%: B is the actual next sentence (label: IsNext) - 50%: B is random sentence (label: NotNext) Train to predict the relationship. (Note: Later work showed NSP provides minimal benefit; many subsequent models omit it.) ### Attention Complexity For sequence length $n$ and model dimension $d$: - **Time complexity**: $O(n^2 d)$ — computing all pairwise attention scores - **Space complexity**: $O(n^2 + nd)$ — storing attention matrix and activations This quadratic scaling in $n$ limits standard transformers to sequences of a few thousand tokens. Sparse attention variants (Longformer, BigBird) reduce this to $O(n \sqrt{n})$ or $O(n \log n)$. ### Further Reading - **Vaswani et al.** (2017). "Attention Is All You Need." The original transformer paper. - **Devlin et al.** (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." - **Alsentzer et al.** (2019). "Publicly Available Clinical BERT Embeddings." The ClinicalBERT paper. - **Gu et al.** (2021). "Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature." PubMedBERT paper. - **Beltagy et al.** (2020). "Longformer: The Long-Document Transformer."