import torch
import torch.nn as nn
# Simple RNN for sequence classification
class SimpleRNN(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
self.classifier = nn.Linear(hidden_dim, num_classes)
def forward(self, x):
# x: (batch, seq_len) token indices
embedded = self.embedding(x) # (batch, seq_len, embed_dim)
output, hidden = self.rnn(embedded) # hidden: (1, batch, hidden_dim)
return self.classifier(hidden.squeeze(0))10 Sequence Models & Transformers
Clinical medicine generates vast amounts of text: admission notes, progress notes, discharge summaries, radiology reports, pathology reports. This unstructured data contains rich clinical information that structured EHR fields miss. This chapter introduces the neural architectures that process sequential data—from early recurrent networks to the transformer architecture that powers modern language models.
10.1 From Images to Sequences
Clinical Context: A discharge summary might span 2,000 words, describing a patient’s hospital course from admission through treatment to discharge planning. Unlike a chest X-ray (fixed 224×224 pixels), this text has variable length, and understanding any sentence may require context from paragraphs earlier. How do we build neural networks for such data?
10.1.1 The Sequence Processing Challenge
Chapter 7’s feedforward networks and Chapter 8’s CNNs assume fixed-size inputs. An image is always 224×224×3; we design the network architecture accordingly. But clinical text varies dramatically in length—a triage note might be 50 words, a comprehensive discharge summary 3,000 words.
Three key properties distinguish sequence data:
Variable length. We need architectures that handle inputs of any length without redesigning the network.
Order matters. “Patient denied chest pain” means something very different from “Patient described chest pain.” The same words in different orders convey opposite meanings.
Long-range dependencies. A pronoun at word 500 might refer to a concept introduced at word 50. The model must track information across long spans.
10.1.2 Why CNNs Aren’t Enough
You could apply 1D convolutions to text (and some models do), but convolutions have a limited receptive field. A 3-word convolution kernel sees local context but misses dependencies spanning hundreds of words. Stacking many layers expands the receptive field, but it’s inefficient for very long-range dependencies.
We need architectures designed for sequences from the ground up.
10.2 Recurrent Neural Networks
Clinical Context: Imagine reading a clinical note word by word, maintaining a mental summary as you go. When you encounter “allergic to penicillin” halfway through the note, you update your understanding; this information influences how you interpret all subsequent medication mentions. Recurrent neural networks formalize this process.
10.2.1 The Sequential Processing Idea
A recurrent neural network (RNN) processes sequences one element at a time, maintaining a hidden state that summarizes everything seen so far:
\[ h_t = f(h_{t-1}, x_t) \]
At each timestep \(t\): 1. Take the previous hidden state \(h_{t-1}\) 2. Combine it with the current input \(x_t\) 3. Produce a new hidden state \(h_t\)
The hidden state acts as the network’s “memory”—a compressed representation of the sequence so far. For classification, we typically use the final hidden state \(h_T\) as the sequence representation.
10.2.2 The Vanishing Gradient Problem
Simple RNNs struggle with long sequences. During backpropagation, gradients must flow backward through every timestep. With hundreds of steps, gradients either:
- Vanish: Shrink exponentially, making early timesteps unlearnable
- Explode: Grow exponentially, causing numerical instability
In practice, simple RNNs effectively “forget” information from more than 10-20 timesteps back—useless for a 500-word clinical note where the diagnosis in sentence 3 affects interpretation of medications in sentence 30.
10.2.3 LSTM: Learning What to Remember
Long Short-Term Memory (LSTM) networks address vanishing gradients through gating mechanisms (Hochreiter and Schmidhuber 1997). An LSTM cell maintains two states:
- Hidden state \(h_t\): Short-term, working memory
- Cell state \(c_t\): Long-term memory, information can persist unchanged
Three gates control information flow:
- Forget gate: What to erase from long-term memory
- Input gate: What new information to store
- Output gate: What to expose to the next layer
# LSTM for clinical text classification
class LSTMClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True,
bidirectional=True)
self.classifier = nn.Linear(hidden_dim * 2, num_classes)
def forward(self, x):
embedded = self.embedding(x)
output, (hidden, cell) = self.lstm(embedded)
# Concatenate forward and backward final hidden states
hidden_cat = torch.cat([hidden[0], hidden[1]], dim=1)
return self.classifier(hidden_cat)Bidirectional LSTMs process the sequence in both directions, capturing both preceding and following context. For clinical text, this often improves performance—understanding “chest pain” might depend on both what came before (history) and after (resolved vs. ongoing).
10.2.4 Why RNNs Fell Behind
Despite improvements like LSTM, recurrent architectures have fundamental limitations:
Sequential processing. Each timestep depends on the previous one—no parallelization. Training on long sequences is slow.
Long-range dependencies still hard. Even LSTMs struggle with dependencies spanning hundreds of tokens. Information must pass through every intermediate step.
Fixed hidden state size. The entire sequence history must compress into a fixed-size vector, creating a bottleneck.
These limitations motivated the search for better architectures—leading to attention.
10.3 The Attention Revolution
Clinical Context: When a radiologist reads “consolidation in the right lower lobe consistent with pneumonia,” they don’t give equal weight to every word. “Consolidation,” “right lower lobe,” and “pneumonia” are diagnostic; “in,” “the,” and “with” are structural. Attention mechanisms let neural networks learn which parts of the input to focus on.
10.3.1 Attention as Weighted Combination
The core attention idea: instead of compressing the entire sequence into a single hidden state, let the model look back at all positions and decide which are relevant.
Given a query (what we’re looking for) and a set of key-value pairs (the sequence):
- Compare the query to each key (compute similarity scores)
- Convert scores to weights (softmax)
- Return a weighted combination of values
\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V \]
The \(\sqrt{d_k}\) scaling prevents the dot products from growing too large in high dimensions.
10.3.2 Self-Attention: Every Position Attends to Every Other
Self-attention applies attention within a single sequence. Every position can attend to every other position, allowing direct connections between distant tokens.
For a clinical note, the word “pneumonia” at position 200 can directly attend to “cough” at position 10 and “fever” at position 45—no need to pass information through 190 intermediate steps.
import torch
import torch.nn.functional as F
def self_attention(x, d_k):
"""
x: (batch, seq_len, d_model) - input embeddings
Returns: (batch, seq_len, d_model) - attended representations
"""
# In self-attention, Q, K, V all come from the same input
Q = K = V = x
# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
# Convert to probabilities
attention_weights = F.softmax(scores, dim=-1)
# Weighted combination of values
return torch.matmul(attention_weights, V)10.3.3 Why Attention Changes Everything
Parallelization. Unlike RNNs, attention computes all positions simultaneously. Training is dramatically faster on GPUs.
Direct long-range connections. Any two positions connect in one step, regardless of distance. No vanishing gradients across the sequence.
Interpretability. Attention weights show which words the model focuses on, providing some insight into its reasoning.
The attention mechanism is the core innovation enabling transformers and all modern language models.
10.4 The Transformer Architecture
Clinical Context: In 2017, the paper “Attention Is All You Need” introduced the transformer—an architecture built entirely on attention, with no recurrence (Vaswani et al. 2017). Within a few years, transformers dominated NLP. Models like BERT and GPT, both based on transformers, now underpin most clinical NLP applications.
10.4.1 The Encoder-Decoder Structure
The original transformer was designed for translation (English → German). It has two parts:
- Encoder: Processes the input sequence, producing contextual representations
- Decoder: Generates the output sequence, attending to both previous outputs and encoder representations
For classification tasks, we typically use only the encoder (BERT-style models). For generation tasks, we use only the decoder (GPT-style models) or the full encoder-decoder (T5, BART).
10.4.2 Multi-Head Attention
Instead of a single attention operation, transformers use multi-head attention—running several attention operations in parallel, each learning different relationships:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x):
batch_size, seq_len, d_model = x.shape
# Project to Q, K, V
Q = self.W_q(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(x).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# Attention for each head
scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
attention = F.softmax(scores, dim=-1)
context = torch.matmul(attention, V)
# Concatenate heads and project
context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, d_model)
return self.W_o(context)Different heads might learn to attend to different things: one head for syntactic relationships (subject-verb), another for semantic relationships (disease-symptom), another for coreference (pronoun-antecedent).
10.4.3 Positional Encoding
Attention treats the input as a set—it has no inherent notion of order. “Patient has fever” and “Fever has patient” would produce identical attention patterns without intervention.
Positional encodings inject position information. The original transformer uses sinusoidal functions:
\[ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d}) \] \[ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d}) \]
These are added to the input embeddings, giving each position a unique signature. Learned positional embeddings (used in BERT) work similarly.
10.4.4 The Transformer Block
A transformer encoder layer combines:
- Multi-head self-attention
- Add & normalize (residual connection + layer normalization)
- Feed-forward network (two linear layers with nonlinearity)
- Add & normalize
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.norm1 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model)
)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Self-attention with residual
attended = self.attention(x)
x = self.norm1(x + self.dropout(attended))
# Feed-forward with residual
ff_out = self.ff(x)
x = self.norm2(x + self.dropout(ff_out))
return xStack 6-24 of these blocks, and you have a transformer encoder.
10.5 BERT and Pretrained Language Models
Clinical Context: Training a transformer from scratch requires enormous data—far more than any single hospital has. The breakthrough insight: pretrain on massive general text, then fine-tune on your specific clinical task. A model pretrained on all of Wikipedia and BookCorpus learns general language understanding; fine-tuning adapts it to predict ICU mortality from clinical notes.
10.5.1 The Pretrain-Then-Finetune Paradigm
Pretraining: Train a large transformer on unlabeled text using self-supervised objectives. The model learns language structure, world knowledge, and reasoning patterns.
Fine-tuning: Take the pretrained model, add a task-specific head (e.g., classification layer), and train on your labeled dataset. The pretrained weights provide a strong starting point.
This paradigm transformed NLP. Instead of training models from scratch on limited clinical data, we leverage knowledge from billions of words of text.
10.5.2 BERT: Bidirectional Encoder Representations
BERT (Bidirectional Encoder Representations from Transformers) is an encoder-only transformer pretrained with two objectives (Devlin et al. 2018):
Masked Language Modeling (MLM): Randomly mask 15% of tokens; train the model to predict them from context. “Patient presented with [MASK] pain” → “chest”
Next Sentence Prediction (NSP): Given two sentences, predict whether the second follows the first in the original text. (Later research showed this is less important.)
BERT processes text bidirectionally—each token sees both left and right context—making it powerful for understanding tasks like classification and extraction.
10.5.3 Clinical Language Model Variants
General BERT trained on Wikipedia doesn’t know medical terminology. Several clinical variants exist:
| Model | Training Data | Best For |
|---|---|---|
| BioBERT (Lee et al. 2020) | PubMed abstracts + PMC full text | Biomedical literature, research applications |
| PubMedBERT | PubMed abstracts only | Similar to BioBERT, sometimes better |
| ClinicalBERT (Alsentzer et al. 2019) | MIMIC-III clinical notes | Clinical notes, EHR text |
| Bio+ClinicalBERT | PubMed + MIMIC-III | Hybrid applications |
When to use which:
- Processing clinical notes (discharge summaries, progress notes): ClinicalBERT
- Processing biomedical literature (research papers, guidelines): PubMedBERT or BioBERT
- Mixed content: Bio+ClinicalBERT or experiment with both
from transformers import AutoTokenizer, AutoModel
# Load ClinicalBERT
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
# Tokenize clinical text
text = "Patient presents with acute chest pain radiating to left arm."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Get contextual embeddings
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # (batch, seq_len, 768)
# Use [CLS] token embedding for classification
cls_embedding = embeddings[:, 0, :] # (batch, 768)10.6 Putting It Together: Clinical Text Classification
Clinical Context: You’re tasked with building a model to predict 30-day hospital readmission from discharge summaries. This is a classic clinical NLP task: take unstructured text, extract relevant information, and make a binary prediction. We’ll fine-tune ClinicalBERT for this task.
10.6.1 Data Preparation
Clinical text requires careful preprocessing:
from transformers import AutoTokenizer
import torch
from torch.utils.data import Dataset, DataLoader
class ClinicalTextDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_length=512):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
label = self.labels[idx]
encoding = self.tokenizer(
text,
truncation=True,
max_length=self.max_length,
padding='max_length',
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(label, dtype=torch.long)
}
# Load tokenizer and create datasets
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
train_dataset = ClinicalTextDataset(train_texts, train_labels, tokenizer)
val_dataset = ClinicalTextDataset(val_texts, val_labels, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)Handling long documents: BERT’s maximum sequence length is 512 tokens. Discharge summaries often exceed this. Options:
- Truncation: Keep first 512 tokens (may lose important end information)
- Chunking: Split into overlapping chunks, aggregate predictions
- Hierarchical models: Encode chunks separately, then combine
- Longformer/BigBird: Transformer variants designed for long sequences
For many tasks, truncation works surprisingly well—the beginning of clinical notes often contains the most critical information.
10.6.2 Fine-Tuning ClinicalBERT
from transformers import AutoModelForSequenceClassification
import torch.optim as optim
from sklearn.metrics import roc_auc_score
import numpy as np
# Load pretrained model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
"emilyalsentzer/Bio_ClinicalBERT",
num_labels=2
)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
# Optimizer with different learning rates
optimizer = optim.AdamW([
{'params': model.bert.parameters(), 'lr': 2e-5}, # Pretrained layers
{'params': model.classifier.parameters(), 'lr': 1e-4} # New classifier
])
# Training loop
num_epochs = 3
for epoch in range(num_epochs):
model.train()
total_loss = 0
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
total_loss += loss.item()
# Validation
model.eval()
val_preds = []
val_labels = []
with torch.no_grad():
for batch in val_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
probs = torch.softmax(outputs.logits, dim=1)[:, 1]
val_preds.extend(probs.cpu().numpy())
val_labels.extend(batch['label'].numpy())
auroc = roc_auc_score(val_labels, val_preds)
print(f"Epoch {epoch+1}: Loss={total_loss/len(train_loader):.4f}, Val AUROC={auroc:.4f}")10.6.3 Using HuggingFace Trainer
For production use, HuggingFace’s Trainer class handles many details automatically:
from transformers import Trainer, TrainingArguments
import numpy as np
from sklearn.metrics import roc_auc_score, accuracy_score
def compute_metrics(eval_pred):
logits, labels = eval_pred
probs = torch.softmax(torch.tensor(logits), dim=1)[:, 1].numpy()
preds = np.argmax(logits, axis=1)
return {
'auroc': roc_auc_score(labels, probs),
'accuracy': accuracy_score(labels, preds)
}
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=100,
evaluation_strategy='epoch',
save_strategy='epoch',
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics,
)
trainer.train()10.6.4 Evaluation and Interpretation
Beyond AUROC, examine model behavior:
# Get predictions on test set
test_results = trainer.predict(test_dataset)
test_probs = torch.softmax(torch.tensor(test_results.predictions), dim=1)[:, 1].numpy()
# Clinical metrics
from sklearn.metrics import confusion_matrix, classification_report
test_preds = (test_probs > 0.5).astype(int)
print(classification_report(test_labels, test_preds,
target_names=['No Readmit', 'Readmit']))
# Attention visualization (which words matter?)
# See Chapter 18 for interpretation methods10.7 Limitations and Looking Ahead
Clinical Context: Transformers are powerful but not magic. Understanding their limitations helps you deploy them responsibly and know when simpler methods might suffice.
10.7.1 Context Length Constraints
BERT processes at most 512 tokens. A typical discharge summary contains 1,000-3,000 tokens. Options:
- Truncate: Loses information but often works
- Longformer/BigBird: Sparse attention allows 4,096+ tokens
- Hierarchical approaches: Encode sections separately, combine
Context length is an active research area. Recent models handle 100K+ tokens, but with increased computational cost.
10.7.2 Computational Requirements
Transformers are expensive:
- Training: Fine-tuning BERT takes hours on a GPU; pretraining takes weeks on hundreds of GPUs
- Inference: ~110M parameters means slower inference than simpler models
- Memory: Attention is O(n²) in sequence length
For high-throughput clinical applications, consider distilled models (DistilBERT, TinyBERT) that sacrifice some accuracy for speed.
10.7.3 What Transformers Don’t Do
No explicit reasoning. Transformers learn patterns from data; they don’t have symbolic reasoning capabilities. A model might learn “chest pain → cardiology” without understanding anatomy.
Brittle to distribution shift. A model trained on one hospital’s notes may fail on another’s due to different terminology, templates, or patient populations.
No uncertainty quantification. Standard transformers output confidences that aren’t well-calibrated. A model might be confidently wrong.
10.7.4 Looking Ahead: Generative Models
BERT-style encoders are powerful for understanding tasks (classification, extraction). But what about generating text? Chapter 11 introduces decoder-only transformers like GPT, which generate text autoregressively—the foundation of modern large language models and their medical applications.
10.8 Appendix 8A: Transformer Mathematics
This appendix provides formal definitions for readers who want the mathematical foundations.
10.8.1 Scaled Dot-Product Attention
Given queries \(Q \in \mathbb{R}^{n \times d_k}\), keys \(K \in \mathbb{R}^{m \times d_k}\), and values \(V \in \mathbb{R}^{m \times d_v}\):
\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V \]
The softmax is applied row-wise, so each query produces a probability distribution over keys.
Why scale by \(\sqrt{d_k}\)? The dot products \(QK^T\) have variance proportional to \(d_k\). Large dot products push softmax into regions with tiny gradients. Scaling stabilizes training.
10.8.2 Multi-Head Attention
Instead of single attention with \(d_{model}\)-dimensional queries/keys/values, use \(h\) parallel attention heads with \(d_k = d_{model}/h\):
\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O \]
where each head is:
\[ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) \]
with learned projections \(W_i^Q, W_i^K \in \mathbb{R}^{d_{model} \times d_k}\), \(W_i^V \in \mathbb{R}^{d_{model} \times d_v}\), and \(W^O \in \mathbb{R}^{hd_v \times d_{model}}\).
10.8.3 Positional Encoding
The sinusoidal positional encoding for position \(pos\) and dimension \(i\):
\[ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \] \[ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \]
Different frequencies allow the model to learn relative positions: for any fixed offset \(k\), \(PE_{pos+k}\) is a linear function of \(PE_{pos}\).
10.8.4 Layer Normalization
Applied after each sub-layer:
\[ \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta \]
where \(\mu\) and \(\sigma\) are the mean and standard deviation computed across the feature dimension, and \(\gamma\), \(\beta\) are learned parameters.
10.8.5 BERT Pretraining Objectives
Masked Language Modeling (MLM):
Given input tokens \(x_1, \ldots, x_n\), randomly select 15% of positions. For selected position \(i\): - 80%: Replace \(x_i\) with [MASK] - 10%: Replace \(x_i\) with random token - 10%: Keep \(x_i\) unchanged
Train to predict the original token from the corrupted context.
Next Sentence Prediction (NSP):
Given sentence pair (A, B): - 50%: B is the actual next sentence (label: IsNext) - 50%: B is random sentence (label: NotNext)
Train to predict the relationship. (Note: Later work showed NSP provides minimal benefit; many subsequent models omit it.)
10.8.6 Attention Complexity
For sequence length \(n\) and model dimension \(d\):
- Time complexity: \(O(n^2 d)\) — computing all pairwise attention scores
- Space complexity: \(O(n^2 + nd)\) — storing attention matrix and activations
This quadratic scaling in \(n\) limits standard transformers to sequences of a few thousand tokens. Sparse attention variants (Longformer, BigBird) reduce this to \(O(n \sqrt{n})\) or \(O(n \log n)\).
10.8.7 Further Reading
- Vaswani et al. (2017). “Attention Is All You Need.” The original transformer paper.
- Devlin et al. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.”
- Alsentzer et al. (2019). “Publicly Available Clinical BERT Embeddings.” The ClinicalBERT paper.
- Gu et al. (2021). “Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature.” PubMedBERT paper.
- Beltagy et al. (2020). “Longformer: The Long-Document Transformer.”