def tokenize_dna(sequence: str, k: int = 6) -> list:
"""Tokenize DNA sequence into k-mers."""
sequence = sequence.upper()
kmers = []
for i in range(len(sequence) - k + 1):
kmer = sequence[i:i + k]
if all(base in 'ATGC' for base in kmer):
kmers.append(kmer)
return kmers
# Example
dna = "ATGCGATCGATCG"
print(tokenize_dna(dna, k=3))
# ['ATG', 'TGC', 'GCG', 'CGA', 'GAT', 'ATC', 'TCG', 'CGA', 'GAT', 'ATC', 'TCG']17 Genomics & Drug Discovery
The transformer architectures you learned for clinical text in Chapter 10 have revolutionized biological sequence analysis. DNA, RNA, and proteins are sequences—just with different alphabets. This chapter shows how the same deep learning concepts apply to genomics and drug discovery, connecting these fields to techniques you already understand.
17.1 Biological Sequences as Language
Clinical Context: A genetic counselor reviews a patient’s sequencing results and finds a variant of uncertain significance (VUS) in BRCA2. Is this mutation pathogenic? Traditional methods rely on databases of known variants, but most variants are rare or novel. Machine learning models trained on millions of sequences can predict pathogenicity even for never-before-seen variants.
17.1.1 The Sequence Analogy
Consider the parallels between clinical text and biological sequences:
| Property | Clinical Text | DNA | Protein |
|---|---|---|---|
| Alphabet | ~50,000 words | 4 letters (A, T, G, C) | 20 amino acids |
| Sequence length | 100–5,000 tokens | 100–100,000+ bases | 100–2,000 residues |
| Structure | Sentences, paragraphs | Genes, regulatory regions | Secondary/tertiary folds |
| Meaning | Semantics from word order | Function from base sequence | Function from 3D shape |
The key insight: the same architectures work across all these domains. Tokenization, embeddings, attention, transformers—all transfer directly.
17.1.2 From Words to Bases
In Chapter 10, we tokenized clinical notes into subwords. For DNA, we tokenize into k-mers—subsequences of length k:
With k=6, there are \(4^6 = 4,096\) possible tokens—comparable to a small vocabulary. Larger k captures more context but increases sparsity.
Modern models like DNABERT use learned tokenization similar to byte-pair encoding, letting the model discover meaningful subsequences.
17.1.3 Why This Matters Clinically
Genomic AI has direct clinical applications:
- Variant interpretation: Classifying genetic variants as pathogenic or benign
- Gene expression: Predicting how mutations affect protein production
- Pharmacogenomics: Predicting drug response based on genetic variants
- Cancer genomics: Identifying driver mutations from tumor sequencing
These are classification and regression problems—exactly what you’ve been doing with other data types.
17.2 DNA Language Models
Clinical Context: Whole genome sequencing generates 3 billion base pairs per patient. Interpreting this data requires models that understand DNA “grammar”—the patterns that distinguish functional regions from noise, pathogenic variants from benign polymorphisms.
17.2.1 DNABERT: Transformers for Genomes
DNABERT applies the BERT architecture to DNA sequences. Pre-trained on the human genome through masked language modeling (predicting masked k-mers from context), it learns representations useful for downstream tasks.
from transformers import AutoTokenizer, AutoModel
import torch
# Load DNABERT
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-6")
model = AutoModel.from_pretrained("zhihan1996/DNABERT-6")
# Tokenize a DNA sequence
sequence = "ATGCGATCGATCGATCGATCGATCGATCGATCG"
# DNABERT expects k-mer tokenized input with spaces
kmers = " ".join([sequence[i:i+6] for i in range(len(sequence) - 5)])
inputs = tokenizer(kmers, return_tensors="pt")
# Get embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # (1, seq_len, 768)
print(f"Embedding shape: {embeddings.shape}")17.2.2 Nucleotide Transformer: Scaling Up
Nucleotide Transformer models scale to billions of parameters, trained on genomes from thousands of species. They capture evolutionary patterns that inform function:
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
# Load Nucleotide Transformer
model_name = "InstaDeepAI/nucleotide-transformer-500m-human-ref"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# These models use single-nucleotide tokenization
sequence = "ATGCGATCGATCGATCG"
inputs = tokenizer(sequence, return_tensors="pt")
# Get embeddings for downstream tasks
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
embeddings = outputs.hidden_states[-1] # Last layer17.2.3 Variant Effect Prediction
A key clinical task: given a genetic variant, predict its effect. This is a classification problem using sequence context:
import torch
import torch.nn as nn
class VariantClassifier(nn.Module):
"""Classify variants using DNA embeddings."""
def __init__(self, dna_model, hidden_dim=256, num_classes=2):
super().__init__()
self.dna_model = dna_model
# Freeze pre-trained weights (optional)
for param in self.dna_model.parameters():
param.requires_grad = False
self.classifier = nn.Sequential(
nn.Linear(768, hidden_dim), # DNABERT hidden size
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(hidden_dim, num_classes)
)
def forward(self, input_ids, attention_mask):
# Get DNA embeddings
outputs = self.dna_model(input_ids=input_ids, attention_mask=attention_mask)
# Use [CLS] token representation
cls_embedding = outputs.last_hidden_state[:, 0, :]
return self.classifier(cls_embedding)
# Training follows standard PyTorch patterns from earlier chaptersThe workflow mirrors clinical NLP: pre-trained model → fine-tune on labeled variants → deploy for classification.
17.3 Protein Sequence Analysis
Clinical Context: A patient has a novel missense mutation in a cardiac ion channel gene. The variant changes one amino acid in the protein sequence. Will this disrupt protein function and cause arrhythmia? Protein language models predict functional effects from sequence alone.
17.3.1 Proteins as Sequences
Proteins are chains of amino acids, each represented by a single letter (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y). A protein sequence might look like:
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH
This is hemoglobin’s beginning—the protein that carries oxygen in blood. The sequence determines the protein’s 3D structure, which determines its function.
17.3.2 ESM: Evolutionary Scale Modeling
ESM (Evolutionary Scale Modeling) from Meta AI is a family of protein language models trained on millions of protein sequences. Like BERT for text, ESM learns contextualized embeddings that capture functional and structural properties.
import torch
import esm
# Load ESM-2 model
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()
# Prepare protein sequence
data = [("protein1", "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
# Get embeddings
with torch.no_grad():
results = model(batch_tokens, repr_layers=[33], return_contacts=True)
# Token embeddings (per-residue)
token_embeddings = results["representations"][33] # (1, seq_len, 1280)
# Mean pooling for sequence-level embedding
sequence_embedding = token_embeddings[0, 1:-1].mean(0) # Exclude start/end tokens
print(f"Sequence embedding shape: {sequence_embedding.shape}") # (1280,)17.3.3 Predicting Mutation Effects
ESM embeddings enable zero-shot prediction of mutation effects by comparing embeddings before and after mutation:
def predict_mutation_effect(model, alphabet, wild_type_seq, position, mutant_aa):
"""Predict effect of single amino acid mutation using ESM."""
batch_converter = alphabet.get_batch_converter()
# Create mutant sequence
mutant_seq = wild_type_seq[:position] + mutant_aa + wild_type_seq[position+1:]
# Get embeddings for both
data = [("wild_type", wild_type_seq), ("mutant", mutant_seq)]
_, _, batch_tokens = batch_converter(data)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[33])
wt_embedding = results["representations"][33][0, position+1] # +1 for start token
mut_embedding = results["representations"][33][1, position+1]
# Cosine similarity as effect proxy
similarity = torch.cosine_similarity(wt_embedding, mut_embedding, dim=0)
# Lower similarity = more disruptive mutation
return similarity.item()
# Example: predict effect of mutation at position 10
effect = predict_mutation_effect(model, alphabet, "MVLSPADKTNVKAAW...", 10, "A")
print(f"Mutation effect score: {effect:.3f}")17.3.4 Connecting to Structure: AlphaFold
While this chapter focuses on sequence models, it’s worth noting that AlphaFold revolutionized protein structure prediction (Jumper et al. 2021). AlphaFold takes a sequence and predicts its 3D structure—enabling analysis of how mutations affect protein shape and function.
For clinical applications, you can: 1. Use ESM for sequence-level predictions (fast, scalable) 2. Use AlphaFold for structure-level analysis (slower, more detailed)
# AlphaFold predictions are often accessed via databases
# rather than run locally (requires significant compute)
# Example: fetch predicted structure from AlphaFold DB
import requests
def get_alphafold_structure(uniprot_id):
"""Fetch AlphaFold predicted structure for a protein."""
url = f"https://alphafold.ebi.ac.uk/files/AF-{uniprot_id}-F1-model_v4.pdb"
response = requests.get(url)
if response.status_code == 200:
return response.text # PDB format structure
return None
# Hemoglobin subunit alpha
pdb_data = get_alphafold_structure("P69905")17.4 Molecules as Sequences: SMILES
Clinical Context: Drug discovery involves screening millions of compounds for potential therapeutic activity. Representing molecules as text strings enables the same sequence models used for language and genomics to predict molecular properties—toxicity, solubility, binding affinity.
17.4.1 SMILES Notation
SMILES (Simplified Molecular Input Line Entry System) represents molecules as text strings:
| Molecule | SMILES |
|---|---|
| Water | O |
| Ethanol | CCO |
| Aspirin | CC(=O)OC1=CC=CC=C1C(=O)O |
| Caffeine | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |
This text representation means we can apply NLP techniques directly:
from rdkit import Chem
from rdkit.Chem import Draw
# Parse SMILES to molecule object
aspirin_smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
mol = Chem.MolFromSmiles(aspirin_smiles)
# Basic properties
print(f"Molecular formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}")
print(f"Molecular weight: {Chem.rdMolDescriptors.CalcExactMolWt(mol):.2f}")
print(f"Number of atoms: {mol.GetNumAtoms()}")
# Visualize
Draw.MolToImage(mol)17.4.2 Molecular Property Prediction
Given a SMILES string, predict properties like toxicity or solubility. This is text classification:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# ChemBERTa: BERT for chemistry
model_name = "seyonec/ChemBERTa-zinc-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
def predict_property(smiles: str) -> float:
"""Predict molecular property from SMILES."""
inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
return probs[0, 1].item() # Probability of positive class
# Example: predict property
score = predict_property("CC(=O)OC1=CC=CC=C1C(=O)O")
print(f"Property prediction: {score:.3f}")17.4.3 Molecular Fingerprints
An alternative to sequence models: convert molecules to fixed-length vectors called fingerprints. These binary vectors encode structural features:
from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np
def get_morgan_fingerprint(smiles: str, radius: int = 2, n_bits: int = 2048):
"""Convert SMILES to Morgan fingerprint vector."""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
return np.array(fp)
# Compare two molecules
aspirin_fp = get_morgan_fingerprint("CC(=O)OC1=CC=CC=C1C(=O)O")
ibuprofen_fp = get_morgan_fingerprint("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O")
# Tanimoto similarity
similarity = np.sum(aspirin_fp & ibuprofen_fp) / np.sum(aspirin_fp | ibuprofen_fp)
print(f"Aspirin-Ibuprofen similarity: {similarity:.3f}")Fingerprints work well with classical ML (random forests, SVMs) and are fast to compute—useful for screening millions of compounds.
17.5 Drug-Target Interaction Prediction
Clinical Context: A pharmaceutical company has a protein target implicated in cancer. They want to find molecules that bind to this target. Experimentally testing millions of candidates is expensive; computational prediction narrows the search.
17.5.1 The Prediction Task
Drug-target interaction (DTI) prediction asks: given a drug (molecule) and a target (protein), will they interact? This is a matching problem—similar to question-answering or retrieval tasks in NLP.
17.5.2 Embedding-Based Approach
Embed both drug and protein into the same space; predict interaction from embedding similarity:
import torch
import torch.nn as nn
class DTIPredictor(nn.Module):
"""Predict drug-target interaction from embeddings."""
def __init__(self, drug_dim, protein_dim, hidden_dim=256):
super().__init__()
# Project both to same space
self.drug_proj = nn.Linear(drug_dim, hidden_dim)
self.protein_proj = nn.Linear(protein_dim, hidden_dim)
# Interaction prediction
self.classifier = nn.Sequential(
nn.Linear(hidden_dim * 2, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden_dim, 1),
nn.Sigmoid()
)
def forward(self, drug_embedding, protein_embedding):
drug_proj = self.drug_proj(drug_embedding)
protein_proj = self.protein_proj(protein_embedding)
# Concatenate and predict
combined = torch.cat([drug_proj, protein_proj], dim=-1)
return self.classifier(combined)
# Usage:
# 1. Get drug embedding from ChemBERTa or fingerprint
# 2. Get protein embedding from ESM
# 3. Predict interaction17.5.3 End-to-End Pipeline
A complete DTI pipeline:
import torch
from transformers import AutoTokenizer, AutoModel
import esm
class DrugTargetPipeline:
"""End-to-end drug-target interaction prediction."""
def __init__(self):
# Drug encoder (ChemBERTa)
self.drug_tokenizer = AutoTokenizer.from_pretrained(
"seyonec/ChemBERTa-zinc-base-v1"
)
self.drug_model = AutoModel.from_pretrained(
"seyonec/ChemBERTa-zinc-base-v1"
)
# Protein encoder (ESM)
self.protein_model, self.alphabet = esm.pretrained.esm2_t6_8M_UR50D()
self.batch_converter = self.alphabet.get_batch_converter()
# Interaction predictor (would be trained on DTI data)
self.predictor = DTIPredictor(
drug_dim=768, # ChemBERTa hidden size
protein_dim=320 # ESM-2 8M hidden size
)
def encode_drug(self, smiles: str) -> torch.Tensor:
inputs = self.drug_tokenizer(smiles, return_tensors="pt")
with torch.no_grad():
outputs = self.drug_model(**inputs)
return outputs.last_hidden_state[:, 0, :] # CLS token
def encode_protein(self, sequence: str) -> torch.Tensor:
data = [("protein", sequence)]
_, _, tokens = self.batch_converter(data)
with torch.no_grad():
results = self.protein_model(tokens, repr_layers=[6])
embeddings = results["representations"][6]
return embeddings[0, 1:-1].mean(0, keepdim=True) # Mean pool
def predict(self, smiles: str, protein_seq: str) -> float:
drug_emb = self.encode_drug(smiles)
protein_emb = self.encode_protein(protein_seq)
with torch.no_grad():
score = self.predictor(drug_emb, protein_emb)
return score.item()
# Example usage
pipeline = DrugTargetPipeline()
interaction_score = pipeline.predict(
smiles="CC(=O)OC1=CC=CC=C1C(=O)O", # Aspirin
protein_seq="MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
)
print(f"Predicted interaction: {interaction_score:.3f}")17.6 Practical Considerations
17.6.1 Computational Resources
Genomic and molecular models vary widely in compute requirements:
| Model | Parameters | GPU Memory | Use Case |
|---|---|---|---|
| DNABERT-6 | 110M | ~4 GB | Variant classification |
| ESM-2 (8M) | 8M | ~1 GB | Quick protein embeddings |
| ESM-2 (650M) | 650M | ~8 GB | High-quality embeddings |
| ESM-2 (3B) | 3B | ~24 GB | Best performance |
| Nucleotide Transformer | 500M–2.5B | 8–32 GB | Genomics research |
For clinical deployment, smaller models (DNABERT, ESM-2 8M) often suffice. Research applications may benefit from larger models.
17.6.2 Data Sources
Key databases for training and evaluation:
- ClinVar: Curated database of genetic variants and clinical significance
- gnomAD: Population-level variant frequencies
- UniProt: Protein sequences and functional annotations
- ChEMBL: Bioactivity data for drug-like molecules
- PDB: Experimentally determined protein structures
# Example: query ClinVar for variant annotations
import requests
def query_clinvar(variant_id: str):
"""Query ClinVar for variant clinical significance."""
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
params = {
"db": "clinvar",
"id": variant_id,
"retmode": "json"
}
response = requests.get(url, params=params)
return response.json()17.6.3 Connecting to Clinical Workflows
These models integrate into clinical pipelines:
- Variant interpretation: Patient sequencing → variant calling → AI classification → clinical report
- Pharmacogenomics: Genotype → drug response prediction → dosing recommendation
- Drug repurposing: Existing drugs → DTI prediction → new indications
The deployment considerations from Chapter 16 apply: model validation, regulatory requirements, integration with clinical systems.
17.7 Summary
This chapter extended sequence modeling techniques to biological domains:
- DNA sequences use the same transformer architectures as clinical text, just with a 4-letter alphabet
- Protein sequences can be embedded with ESM models, enabling mutation effect prediction
- Molecules represented as SMILES strings allow NLP-style property prediction
- Drug-target interaction combines molecular and protein embeddings for binding prediction
The key insight: if you understand transformers and embeddings from Chapters 8–10, you already have the conceptual foundation for genomic AI. The applications differ, but the techniques transfer directly.
For clinical deployment, start with smaller models (DNABERT, ESM-2 8M, molecular fingerprints) that run efficiently. Scale up to larger models when accuracy requirements demand it and compute resources allow.