17 Genomics & Drug Discovery

The transformer architectures you learned for clinical text in Chapter 10 have revolutionized biological sequence analysis. DNA, RNA, and proteins are sequences—just with different alphabets. This chapter shows how the same deep learning concepts apply to genomics and drug discovery, connecting these fields to techniques you already understand.

17.1 Biological Sequences as Language

Clinical Context: A genetic counselor reviews a patient’s sequencing results and finds a variant of uncertain significance (VUS) in BRCA2. Is this mutation pathogenic? Traditional methods rely on databases of known variants, but most variants are rare or novel. Machine learning models trained on millions of sequences can predict pathogenicity even for never-before-seen variants.

17.1.1 The Sequence Analogy

Consider the parallels between clinical text and biological sequences:

Property	Clinical Text	DNA	Protein
Alphabet	~50,000 words	4 letters (A, T, G, C)	20 amino acids
Sequence length	100–5,000 tokens	100–100,000+ bases	100–2,000 residues
Structure	Sentences, paragraphs	Genes, regulatory regions	Secondary/tertiary folds
Meaning	Semantics from word order	Function from base sequence	Function from 3D shape

The key insight: the same architectures work across all these domains. Tokenization, embeddings, attention, transformers—all transfer directly.

17.1.2 From Words to Bases

In Chapter 10, we tokenized clinical notes into subwords. For DNA, we tokenize into k-mers—subsequences of length k:

def tokenize_dna(sequence: str, k: int = 6) -> list:
    """Tokenize DNA sequence into k-mers."""
    sequence = sequence.upper()
    kmers = []
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i:i + k]
        if all(base in 'ATGC' for base in kmer):
            kmers.append(kmer)
    return kmers

# Example
dna = "ATGCGATCGATCG"
print(tokenize_dna(dna, k=3))
# ['ATG', 'TGC', 'GCG', 'CGA', 'GAT', 'ATC', 'TCG', 'CGA', 'GAT', 'ATC', 'TCG']

With k=6, there are $4^6 = 4,096$ possible tokens—comparable to a small vocabulary. Larger k captures more context but increases sparsity.

Modern models like DNABERT use learned tokenization similar to byte-pair encoding, letting the model discover meaningful subsequences.

17.1.3 Why This Matters Clinically

Genomic AI has direct clinical applications:

Variant interpretation: Classifying genetic variants as pathogenic or benign
Gene expression: Predicting how mutations affect protein production
Pharmacogenomics: Predicting drug response based on genetic variants
Cancer genomics: Identifying driver mutations from tumor sequencing

These are classification and regression problems—exactly what you’ve been doing with other data types.

17.2 DNA Language Models

Clinical Context: Whole genome sequencing generates 3 billion base pairs per patient. Interpreting this data requires models that understand DNA “grammar”—the patterns that distinguish functional regions from noise, pathogenic variants from benign polymorphisms.

17.2.1 DNABERT: Transformers for Genomes

DNABERT applies the BERT architecture to DNA sequences. Pre-trained on the human genome through masked language modeling (predicting masked k-mers from context), it learns representations useful for downstream tasks.

from transformers import AutoTokenizer, AutoModel
import torch

# Load DNABERT
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-6")
model = AutoModel.from_pretrained("zhihan1996/DNABERT-6")

# Tokenize a DNA sequence
sequence = "ATGCGATCGATCGATCGATCGATCGATCGATCG"
# DNABERT expects k-mer tokenized input with spaces
kmers = " ".join([sequence[i:i+6] for i in range(len(sequence) - 5)])
inputs = tokenizer(kmers, return_tensors="pt")

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # (1, seq_len, 768)

print(f"Embedding shape: {embeddings.shape}")

17.2.2 Nucleotide Transformer: Scaling Up

Nucleotide Transformer models scale to billions of parameters, trained on genomes from thousands of species. They capture evolutionary patterns that inform function:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load Nucleotide Transformer
model_name = "InstaDeepAI/nucleotide-transformer-500m-human-ref"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# These models use single-nucleotide tokenization
sequence = "ATGCGATCGATCGATCG"
inputs = tokenizer(sequence, return_tensors="pt")

# Get embeddings for downstream tasks
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)
    embeddings = outputs.hidden_states[-1]  # Last layer

17.2.3 Variant Effect Prediction

A key clinical task: given a genetic variant, predict its effect. This is a classification problem using sequence context:

import torch
import torch.nn as nn

class VariantClassifier(nn.Module):
    """Classify variants using DNA embeddings."""

    def __init__(self, dna_model, hidden_dim=256, num_classes=2):
        super().__init__()
        self.dna_model = dna_model
        # Freeze pre-trained weights (optional)
        for param in self.dna_model.parameters():
            param.requires_grad = False

        self.classifier = nn.Sequential(
            nn.Linear(768, hidden_dim),  # DNABERT hidden size
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, input_ids, attention_mask):
        # Get DNA embeddings
        outputs = self.dna_model(input_ids=input_ids, attention_mask=attention_mask)
        # Use [CLS] token representation
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        return self.classifier(cls_embedding)

# Training follows standard PyTorch patterns from earlier chapters

The workflow mirrors clinical NLP: pre-trained model → fine-tune on labeled variants → deploy for classification.

17.3 Protein Sequence Analysis

Clinical Context: A patient has a novel missense mutation in a cardiac ion channel gene. The variant changes one amino acid in the protein sequence. Will this disrupt protein function and cause arrhythmia? Protein language models predict functional effects from sequence alone.

17.3.1 Proteins as Sequences

Proteins are chains of amino acids, each represented by a single letter (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y). A protein sequence might look like:

MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH

This is hemoglobin’s beginning—the protein that carries oxygen in blood. The sequence determines the protein’s 3D structure, which determines its function.

17.3.2 ESM: Evolutionary Scale Modeling

ESM (Evolutionary Scale Modeling) from Meta AI is a family of protein language models trained on millions of protein sequences. Like BERT for text, ESM learns contextualized embeddings that capture functional and structural properties.

import torch
import esm

# Load ESM-2 model
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()

# Prepare protein sequence
data = [("protein1", "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Get embeddings
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)

# Token embeddings (per-residue)
token_embeddings = results["representations"][33]  # (1, seq_len, 1280)

# Mean pooling for sequence-level embedding
sequence_embedding = token_embeddings[0, 1:-1].mean(0)  # Exclude start/end tokens
print(f"Sequence embedding shape: {sequence_embedding.shape}")  # (1280,)

17.3.3 Predicting Mutation Effects

ESM embeddings enable zero-shot prediction of mutation effects by comparing embeddings before and after mutation:

def predict_mutation_effect(model, alphabet, wild_type_seq, position, mutant_aa):
    """Predict effect of single amino acid mutation using ESM."""

    batch_converter = alphabet.get_batch_converter()

    # Create mutant sequence
    mutant_seq = wild_type_seq[:position] + mutant_aa + wild_type_seq[position+1:]

    # Get embeddings for both
    data = [("wild_type", wild_type_seq), ("mutant", mutant_seq)]
    _, _, batch_tokens = batch_converter(data)

    with torch.no_grad():
        results = model(batch_tokens, repr_layers=[33])

    wt_embedding = results["representations"][33][0, position+1]  # +1 for start token
    mut_embedding = results["representations"][33][1, position+1]

    # Cosine similarity as effect proxy
    similarity = torch.cosine_similarity(wt_embedding, mut_embedding, dim=0)

    # Lower similarity = more disruptive mutation
    return similarity.item()

# Example: predict effect of mutation at position 10
effect = predict_mutation_effect(model, alphabet, "MVLSPADKTNVKAAW...", 10, "A")
print(f"Mutation effect score: {effect:.3f}")

17.3.4 Connecting to Structure: AlphaFold

While this chapter focuses on sequence models, it’s worth noting that AlphaFold revolutionized protein structure prediction (Jumper et al. 2021). AlphaFold takes a sequence and predicts its 3D structure—enabling analysis of how mutations affect protein shape and function.

For clinical applications, you can: 1. Use ESM for sequence-level predictions (fast, scalable) 2. Use AlphaFold for structure-level analysis (slower, more detailed)

# AlphaFold predictions are often accessed via databases
# rather than run locally (requires significant compute)

# Example: fetch predicted structure from AlphaFold DB
import requests

def get_alphafold_structure(uniprot_id):
    """Fetch AlphaFold predicted structure for a protein."""
    url = f"https://alphafold.ebi.ac.uk/files/AF-{uniprot_id}-F1-model_v4.pdb"
    response = requests.get(url)
    if response.status_code == 200:
        return response.text  # PDB format structure
    return None

# Hemoglobin subunit alpha
pdb_data = get_alphafold_structure("P69905")

17.4 Molecules as Sequences: SMILES

Clinical Context: Drug discovery involves screening millions of compounds for potential therapeutic activity. Representing molecules as text strings enables the same sequence models used for language and genomics to predict molecular properties—toxicity, solubility, binding affinity.

17.4.1 SMILES Notation

SMILES (Simplified Molecular Input Line Entry System) represents molecules as text strings:

Molecule	SMILES
Water	`O`
Ethanol	`CCO`
Aspirin	`CC(=O)OC1=CC=CC=C1C(=O)O`
Caffeine	`CN1C=NC2=C1C(=O)N(C(=O)N2C)C`

This text representation means we can apply NLP techniques directly:

from rdkit import Chem
from rdkit.Chem import Draw

# Parse SMILES to molecule object
aspirin_smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
mol = Chem.MolFromSmiles(aspirin_smiles)

# Basic properties
print(f"Molecular formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}")
print(f"Molecular weight: {Chem.rdMolDescriptors.CalcExactMolWt(mol):.2f}")
print(f"Number of atoms: {mol.GetNumAtoms()}")

# Visualize
Draw.MolToImage(mol)

17.4.2 Molecular Property Prediction

Given a SMILES string, predict properties like toxicity or solubility. This is text classification:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# ChemBERTa: BERT for chemistry
model_name = "seyonec/ChemBERTa-zinc-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

def predict_property(smiles: str) -> float:
    """Predict molecular property from SMILES."""
    inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
    return probs[0, 1].item()  # Probability of positive class

# Example: predict property
score = predict_property("CC(=O)OC1=CC=CC=C1C(=O)O")
print(f"Property prediction: {score:.3f}")

17.4.3 Molecular Fingerprints

An alternative to sequence models: convert molecules to fixed-length vectors called fingerprints. These binary vectors encode structural features:

from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

def get_morgan_fingerprint(smiles: str, radius: int = 2, n_bits: int = 2048):
    """Convert SMILES to Morgan fingerprint vector."""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
    return np.array(fp)

# Compare two molecules
aspirin_fp = get_morgan_fingerprint("CC(=O)OC1=CC=CC=C1C(=O)O")
ibuprofen_fp = get_morgan_fingerprint("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O")

# Tanimoto similarity
similarity = np.sum(aspirin_fp & ibuprofen_fp) / np.sum(aspirin_fp | ibuprofen_fp)
print(f"Aspirin-Ibuprofen similarity: {similarity:.3f}")

Fingerprints work well with classical ML (random forests, SVMs) and are fast to compute—useful for screening millions of compounds.

17.5 Drug-Target Interaction Prediction

Clinical Context: A pharmaceutical company has a protein target implicated in cancer. They want to find molecules that bind to this target. Experimentally testing millions of candidates is expensive; computational prediction narrows the search.

17.5.1 The Prediction Task

Drug-target interaction (DTI) prediction asks: given a drug (molecule) and a target (protein), will they interact? This is a matching problem—similar to question-answering or retrieval tasks in NLP.

17.5.2 Embedding-Based Approach

Embed both drug and protein into the same space; predict interaction from embedding similarity:

import torch
import torch.nn as nn

class DTIPredictor(nn.Module):
    """Predict drug-target interaction from embeddings."""

    def __init__(self, drug_dim, protein_dim, hidden_dim=256):
        super().__init__()
        # Project both to same space
        self.drug_proj = nn.Linear(drug_dim, hidden_dim)
        self.protein_proj = nn.Linear(protein_dim, hidden_dim)

        # Interaction prediction
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )

    def forward(self, drug_embedding, protein_embedding):
        drug_proj = self.drug_proj(drug_embedding)
        protein_proj = self.protein_proj(protein_embedding)

        # Concatenate and predict
        combined = torch.cat([drug_proj, protein_proj], dim=-1)
        return self.classifier(combined)

# Usage:
# 1. Get drug embedding from ChemBERTa or fingerprint
# 2. Get protein embedding from ESM
# 3. Predict interaction

17.5.3 End-to-End Pipeline

A complete DTI pipeline:

import torch
from transformers import AutoTokenizer, AutoModel
import esm

class DrugTargetPipeline:
    """End-to-end drug-target interaction prediction."""

    def __init__(self):
        # Drug encoder (ChemBERTa)
        self.drug_tokenizer = AutoTokenizer.from_pretrained(
            "seyonec/ChemBERTa-zinc-base-v1"
        )
        self.drug_model = AutoModel.from_pretrained(
            "seyonec/ChemBERTa-zinc-base-v1"
        )

        # Protein encoder (ESM)
        self.protein_model, self.alphabet = esm.pretrained.esm2_t6_8M_UR50D()
        self.batch_converter = self.alphabet.get_batch_converter()

        # Interaction predictor (would be trained on DTI data)
        self.predictor = DTIPredictor(
            drug_dim=768,    # ChemBERTa hidden size
            protein_dim=320  # ESM-2 8M hidden size
        )

    def encode_drug(self, smiles: str) -> torch.Tensor:
        inputs = self.drug_tokenizer(smiles, return_tensors="pt")
        with torch.no_grad():
            outputs = self.drug_model(**inputs)
        return outputs.last_hidden_state[:, 0, :]  # CLS token

    def encode_protein(self, sequence: str) -> torch.Tensor:
        data = [("protein", sequence)]
        _, _, tokens = self.batch_converter(data)
        with torch.no_grad():
            results = self.protein_model(tokens, repr_layers=[6])
        embeddings = results["representations"][6]
        return embeddings[0, 1:-1].mean(0, keepdim=True)  # Mean pool

    def predict(self, smiles: str, protein_seq: str) -> float:
        drug_emb = self.encode_drug(smiles)
        protein_emb = self.encode_protein(protein_seq)
        with torch.no_grad():
            score = self.predictor(drug_emb, protein_emb)
        return score.item()

# Example usage
pipeline = DrugTargetPipeline()
interaction_score = pipeline.predict(
    smiles="CC(=O)OC1=CC=CC=C1C(=O)O",  # Aspirin
    protein_seq="MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
)
print(f"Predicted interaction: {interaction_score:.3f}")

17.6 Practical Considerations

17.6.1 Computational Resources

Genomic and molecular models vary widely in compute requirements:

Model	Parameters	GPU Memory	Use Case
DNABERT-6	110M	~4 GB	Variant classification
ESM-2 (8M)	8M	~1 GB	Quick protein embeddings
ESM-2 (650M)	650M	~8 GB	High-quality embeddings
ESM-2 (3B)	3B	~24 GB	Best performance
Nucleotide Transformer	500M–2.5B	8–32 GB	Genomics research

For clinical deployment, smaller models (DNABERT, ESM-2 8M) often suffice. Research applications may benefit from larger models.

17.6.2 Data Sources

Key databases for training and evaluation:

ClinVar: Curated database of genetic variants and clinical significance
gnomAD: Population-level variant frequencies
UniProt: Protein sequences and functional annotations
ChEMBL: Bioactivity data for drug-like molecules
PDB: Experimentally determined protein structures

# Example: query ClinVar for variant annotations
import requests

def query_clinvar(variant_id: str):
    """Query ClinVar for variant clinical significance."""
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    params = {
        "db": "clinvar",
        "id": variant_id,
        "retmode": "json"
    }
    response = requests.get(url, params=params)
    return response.json()

17.6.3 Connecting to Clinical Workflows

These models integrate into clinical pipelines:

Variant interpretation: Patient sequencing → variant calling → AI classification → clinical report
Pharmacogenomics: Genotype → drug response prediction → dosing recommendation
Drug repurposing: Existing drugs → DTI prediction → new indications

The deployment considerations from Chapter 16 apply: model validation, regulatory requirements, integration with clinical systems.

17.7 Summary

This chapter extended sequence modeling techniques to biological domains:

DNA sequences use the same transformer architectures as clinical text, just with a 4-letter alphabet
Protein sequences can be embedded with ESM models, enabling mutation effect prediction
Molecules represented as SMILES strings allow NLP-style property prediction
Drug-target interaction combines molecular and protein embeddings for binding prediction

The key insight: if you understand transformers and embeddings from Chapters 8–10, you already have the conceptual foundation for genomic AI. The applications differ, but the techniques transfer directly.

For clinical deployment, start with smaller models (DNABERT, ESM-2 8M, molecular fingerprints) that run efficiently. Scale up to larger models when accuracy requirements demand it and compute resources allow.

# Genomics & Drug Discovery {#sec-genomics-drug-discovery} The transformer architectures you learned for clinical text in Chapter 10 have revolutionized biological sequence analysis. DNA, RNA, and proteins are sequences—just with different alphabets. This chapter shows how the same deep learning concepts apply to genomics and drug discovery, connecting these fields to techniques you already understand. ## Biological Sequences as Language *Clinical Context:* A genetic counselor reviews a patient's sequencing results and finds a variant of uncertain significance (VUS) in BRCA2. Is this mutation pathogenic? Traditional methods rely on databases of known variants, but most variants are rare or novel. Machine learning models trained on millions of sequences can predict pathogenicity even for never-before-seen variants. ### The Sequence Analogy Consider the parallels between clinical text and biological sequences: | Property | Clinical Text | DNA | Protein | |----------|---------------|-----|---------| | Alphabet | ~50,000 words | 4 letters (A, T, G, C) | 20 amino acids | | Sequence length | 100–5,000 tokens | 100–100,000+ bases | 100–2,000 residues | | Structure | Sentences, paragraphs | Genes, regulatory regions | Secondary/tertiary folds | | Meaning | Semantics from word order | Function from base sequence | Function from 3D shape | The key insight: **the same architectures work across all these domains**. Tokenization, embeddings, attention, transformers—all transfer directly. ### From Words to Bases In Chapter 10, we tokenized clinical notes into subwords. For DNA, we tokenize into **k-mers**—subsequences of length k: ```{python} #| eval: false def tokenize_dna(sequence: str, k: int = 6) -> list: """Tokenize DNA sequence into k-mers.""" sequence = sequence.upper() kmers = [] for i in range(len(sequence) - k + 1): kmer = sequence[i:i + k] if all(base in 'ATGC' for base in kmer): kmers.append(kmer) return kmers # Example dna = "ATGCGATCGATCG" print(tokenize_dna(dna, k=3)) # ['ATG', 'TGC', 'GCG', 'CGA', 'GAT', 'ATC', 'TCG', 'CGA', 'GAT', 'ATC', 'TCG'] ``` With k=6, there are $4^6 = 4,096$ possible tokens—comparable to a small vocabulary. Larger k captures more context but increases sparsity. Modern models like DNABERT use learned tokenization similar to byte-pair encoding, letting the model discover meaningful subsequences. ### Why This Matters Clinically Genomic AI has direct clinical applications: - **Variant interpretation**: Classifying genetic variants as pathogenic or benign - **Gene expression**: Predicting how mutations affect protein production - **Pharmacogenomics**: Predicting drug response based on genetic variants - **Cancer genomics**: Identifying driver mutations from tumor sequencing These are classification and regression problems—exactly what you've been doing with other data types. ## DNA Language Models *Clinical Context:* Whole genome sequencing generates 3 billion base pairs per patient. Interpreting this data requires models that understand DNA "grammar"—the patterns that distinguish functional regions from noise, pathogenic variants from benign polymorphisms. ### DNABERT: Transformers for Genomes **DNABERT** applies the BERT architecture to DNA sequences. Pre-trained on the human genome through masked language modeling (predicting masked k-mers from context), it learns representations useful for downstream tasks. ```{python} #| eval: false from transformers import AutoTokenizer, AutoModel import torch # Load DNABERT tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-6") model = AutoModel.from_pretrained("zhihan1996/DNABERT-6") # Tokenize a DNA sequence sequence = "ATGCGATCGATCGATCGATCGATCGATCGATCG" # DNABERT expects k-mer tokenized input with spaces kmers = " ".join([sequence[i:i+6] for i in range(len(sequence) - 5)]) inputs = tokenizer(kmers, return_tensors="pt") # Get embeddings with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state # (1, seq_len, 768) print(f"Embedding shape: {embeddings.shape}") ``` ### Nucleotide Transformer: Scaling Up **Nucleotide Transformer** models scale to billions of parameters, trained on genomes from thousands of species. They capture evolutionary patterns that inform function: ```{python} #| eval: false from transformers import AutoTokenizer, AutoModelForMaskedLM import torch # Load Nucleotide Transformer model_name = "InstaDeepAI/nucleotide-transformer-500m-human-ref" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForMaskedLM.from_pretrained(model_name) # These models use single-nucleotide tokenization sequence = "ATGCGATCGATCGATCG" inputs = tokenizer(sequence, return_tensors="pt") # Get embeddings for downstream tasks with torch.no_grad(): outputs = model(**inputs, output_hidden_states=True) embeddings = outputs.hidden_states[-1] # Last layer ``` ### Variant Effect Prediction A key clinical task: given a genetic variant, predict its effect. This is a classification problem using sequence context: ```{python} #| eval: false import torch import torch.nn as nn class VariantClassifier(nn.Module): """Classify variants using DNA embeddings.""" def __init__(self, dna_model, hidden_dim=256, num_classes=2): super().__init__() self.dna_model = dna_model # Freeze pre-trained weights (optional) for param in self.dna_model.parameters(): param.requires_grad = False self.classifier = nn.Sequential( nn.Linear(768, hidden_dim), # DNABERT hidden size nn.ReLU(), nn.Dropout(0.1), nn.Linear(hidden_dim, num_classes) ) def forward(self, input_ids, attention_mask): # Get DNA embeddings outputs = self.dna_model(input_ids=input_ids, attention_mask=attention_mask) # Use [CLS] token representation cls_embedding = outputs.last_hidden_state[:, 0, :] return self.classifier(cls_embedding) # Training follows standard PyTorch patterns from earlier chapters ``` The workflow mirrors clinical NLP: pre-trained model → fine-tune on labeled variants → deploy for classification. ## Protein Sequence Analysis *Clinical Context:* A patient has a novel missense mutation in a cardiac ion channel gene. The variant changes one amino acid in the protein sequence. Will this disrupt protein function and cause arrhythmia? Protein language models predict functional effects from sequence alone. ### Proteins as Sequences Proteins are chains of amino acids, each represented by a single letter (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y). A protein sequence might look like: ``` MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH ``` This is hemoglobin's beginning—the protein that carries oxygen in blood. The sequence determines the protein's 3D structure, which determines its function. ### ESM: Evolutionary Scale Modeling **ESM** (Evolutionary Scale Modeling) from Meta AI is a family of protein language models trained on millions of protein sequences. Like BERT for text, ESM learns contextualized embeddings that capture functional and structural properties. ```{python} #| eval: false import torch import esm # Load ESM-2 model model, alphabet = esm.pretrained.esm2_t33_650M_UR50D() batch_converter = alphabet.get_batch_converter() model.eval() # Prepare protein sequence data = [("protein1", "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH")] batch_labels, batch_strs, batch_tokens = batch_converter(data) # Get embeddings with torch.no_grad(): results = model(batch_tokens, repr_layers=[33], return_contacts=True) # Token embeddings (per-residue) token_embeddings = results["representations"][33] # (1, seq_len, 1280) # Mean pooling for sequence-level embedding sequence_embedding = token_embeddings[0, 1:-1].mean(0) # Exclude start/end tokens print(f"Sequence embedding shape: {sequence_embedding.shape}") # (1280,) ``` ### Predicting Mutation Effects ESM embeddings enable zero-shot prediction of mutation effects by comparing embeddings before and after mutation: ```{python} #| eval: false def predict_mutation_effect(model, alphabet, wild_type_seq, position, mutant_aa): """Predict effect of single amino acid mutation using ESM.""" batch_converter = alphabet.get_batch_converter() # Create mutant sequence mutant_seq = wild_type_seq[:position] + mutant_aa + wild_type_seq[position+1:] # Get embeddings for both data = [("wild_type", wild_type_seq), ("mutant", mutant_seq)] _, _, batch_tokens = batch_converter(data) with torch.no_grad(): results = model(batch_tokens, repr_layers=[33]) wt_embedding = results["representations"][33][0, position+1] # +1 for start token mut_embedding = results["representations"][33][1, position+1] # Cosine similarity as effect proxy similarity = torch.cosine_similarity(wt_embedding, mut_embedding, dim=0) # Lower similarity = more disruptive mutation return similarity.item() # Example: predict effect of mutation at position 10 effect = predict_mutation_effect(model, alphabet, "MVLSPADKTNVKAAW...", 10, "A") print(f"Mutation effect score: {effect:.3f}") ``` ### Connecting to Structure: AlphaFold While this chapter focuses on sequence models, it's worth noting that **AlphaFold** revolutionized protein structure prediction [@jumper2021alphafold]. AlphaFold takes a sequence and predicts its 3D structure—enabling analysis of how mutations affect protein shape and function. For clinical applications, you can: 1. Use ESM for sequence-level predictions (fast, scalable) 2. Use AlphaFold for structure-level analysis (slower, more detailed) ```{python} #| eval: false # AlphaFold predictions are often accessed via databases # rather than run locally (requires significant compute) # Example: fetch predicted structure from AlphaFold DB import requests def get_alphafold_structure(uniprot_id): """Fetch AlphaFold predicted structure for a protein.""" url = f"https://alphafold.ebi.ac.uk/files/AF-{uniprot_id}-F1-model_v4.pdb" response = requests.get(url) if response.status_code == 200: return response.text # PDB format structure return None # Hemoglobin subunit alpha pdb_data = get_alphafold_structure("P69905") ``` ## Molecules as Sequences: SMILES *Clinical Context:* Drug discovery involves screening millions of compounds for potential therapeutic activity. Representing molecules as text strings enables the same sequence models used for language and genomics to predict molecular properties—toxicity, solubility, binding affinity. ### SMILES Notation **SMILES** (Simplified Molecular Input Line Entry System) represents molecules as text strings: | Molecule | SMILES | |----------|--------| | Water | `O` | | Ethanol | `CCO` | | Aspirin | `CC(=O)OC1=CC=CC=C1C(=O)O` | | Caffeine | `CN1C=NC2=C1C(=O)N(C(=O)N2C)C` | This text representation means we can apply NLP techniques directly: ```{python} #| eval: false from rdkit import Chem from rdkit.Chem import Draw # Parse SMILES to molecule object aspirin_smiles = "CC(=O)OC1=CC=CC=C1C(=O)O" mol = Chem.MolFromSmiles(aspirin_smiles) # Basic properties print(f"Molecular formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}") print(f"Molecular weight: {Chem.rdMolDescriptors.CalcExactMolWt(mol):.2f}") print(f"Number of atoms: {mol.GetNumAtoms()}") # Visualize Draw.MolToImage(mol) ``` ### Molecular Property Prediction Given a SMILES string, predict properties like toxicity or solubility. This is text classification: ```{python} #| eval: false from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # ChemBERTa: BERT for chemistry model_name = "seyonec/ChemBERTa-zinc-base-v1" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) def predict_property(smiles: str) -> float: """Predict molecular property from SMILES.""" inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True) with torch.no_grad(): outputs = model(**inputs) probs = torch.softmax(outputs.logits, dim=-1) return probs[0, 1].item() # Probability of positive class # Example: predict property score = predict_property("CC(=O)OC1=CC=CC=C1C(=O)O") print(f"Property prediction: {score:.3f}") ``` ### Molecular Fingerprints An alternative to sequence models: convert molecules to fixed-length vectors called **fingerprints**. These binary vectors encode structural features: ```{python} #| eval: false from rdkit import Chem from rdkit.Chem import AllChem import numpy as np def get_morgan_fingerprint(smiles: str, radius: int = 2, n_bits: int = 2048): """Convert SMILES to Morgan fingerprint vector.""" mol = Chem.MolFromSmiles(smiles) if mol is None: return None fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits) return np.array(fp) # Compare two molecules aspirin_fp = get_morgan_fingerprint("CC(=O)OC1=CC=CC=C1C(=O)O") ibuprofen_fp = get_morgan_fingerprint("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O") # Tanimoto similarity similarity = np.sum(aspirin_fp & ibuprofen_fp) / np.sum(aspirin_fp | ibuprofen_fp) print(f"Aspirin-Ibuprofen similarity: {similarity:.3f}") ``` Fingerprints work well with classical ML (random forests, SVMs) and are fast to compute—useful for screening millions of compounds. ## Drug-Target Interaction Prediction *Clinical Context:* A pharmaceutical company has a protein target implicated in cancer. They want to find molecules that bind to this target. Experimentally testing millions of candidates is expensive; computational prediction narrows the search. ### The Prediction Task Drug-target interaction (DTI) prediction asks: given a drug (molecule) and a target (protein), will they interact? This is a matching problem—similar to question-answering or retrieval tasks in NLP. ### Embedding-Based Approach Embed both drug and protein into the same space; predict interaction from embedding similarity: ```{python} #| eval: false import torch import torch.nn as nn class DTIPredictor(nn.Module): """Predict drug-target interaction from embeddings.""" def __init__(self, drug_dim, protein_dim, hidden_dim=256): super().__init__() # Project both to same space self.drug_proj = nn.Linear(drug_dim, hidden_dim) self.protein_proj = nn.Linear(protein_dim, hidden_dim) # Interaction prediction self.classifier = nn.Sequential( nn.Linear(hidden_dim * 2, hidden_dim), nn.ReLU(), nn.Dropout(0.2), nn.Linear(hidden_dim, 1), nn.Sigmoid() ) def forward(self, drug_embedding, protein_embedding): drug_proj = self.drug_proj(drug_embedding) protein_proj = self.protein_proj(protein_embedding) # Concatenate and predict combined = torch.cat([drug_proj, protein_proj], dim=-1) return self.classifier(combined) # Usage: # 1. Get drug embedding from ChemBERTa or fingerprint # 2. Get protein embedding from ESM # 3. Predict interaction ``` ### End-to-End Pipeline A complete DTI pipeline: ```{python} #| eval: false import torch from transformers import AutoTokenizer, AutoModel import esm class DrugTargetPipeline: """End-to-end drug-target interaction prediction.""" def __init__(self): # Drug encoder (ChemBERTa) self.drug_tokenizer = AutoTokenizer.from_pretrained( "seyonec/ChemBERTa-zinc-base-v1" ) self.drug_model = AutoModel.from_pretrained( "seyonec/ChemBERTa-zinc-base-v1" ) # Protein encoder (ESM) self.protein_model, self.alphabet = esm.pretrained.esm2_t6_8M_UR50D() self.batch_converter = self.alphabet.get_batch_converter() # Interaction predictor (would be trained on DTI data) self.predictor = DTIPredictor( drug_dim=768, # ChemBERTa hidden size protein_dim=320 # ESM-2 8M hidden size ) def encode_drug(self, smiles: str) -> torch.Tensor: inputs = self.drug_tokenizer(smiles, return_tensors="pt") with torch.no_grad(): outputs = self.drug_model(**inputs) return outputs.last_hidden_state[:, 0, :] # CLS token def encode_protein(self, sequence: str) -> torch.Tensor: data = [("protein", sequence)] _, _, tokens = self.batch_converter(data) with torch.no_grad(): results = self.protein_model(tokens, repr_layers=[6]) embeddings = results["representations"][6] return embeddings[0, 1:-1].mean(0, keepdim=True) # Mean pool def predict(self, smiles: str, protein_seq: str) -> float: drug_emb = self.encode_drug(smiles) protein_emb = self.encode_protein(protein_seq) with torch.no_grad(): score = self.predictor(drug_emb, protein_emb) return score.item() # Example usage pipeline = DrugTargetPipeline() interaction_score = pipeline.predict( smiles="CC(=O)OC1=CC=CC=C1C(=O)O", # Aspirin protein_seq="MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH" ) print(f"Predicted interaction: {interaction_score:.3f}") ``` ## Practical Considerations ### Computational Resources Genomic and molecular models vary widely in compute requirements: | Model | Parameters | GPU Memory | Use Case | |-------|------------|------------|----------| | DNABERT-6 | 110M | ~4 GB | Variant classification | | ESM-2 (8M) | 8M | ~1 GB | Quick protein embeddings | | ESM-2 (650M) | 650M | ~8 GB | High-quality embeddings | | ESM-2 (3B) | 3B | ~24 GB | Best performance | | Nucleotide Transformer | 500M–2.5B | 8–32 GB | Genomics research | For clinical deployment, smaller models (DNABERT, ESM-2 8M) often suffice. Research applications may benefit from larger models. ### Data Sources Key databases for training and evaluation: - **ClinVar**: Curated database of genetic variants and clinical significance - **gnomAD**: Population-level variant frequencies - **UniProt**: Protein sequences and functional annotations - **ChEMBL**: Bioactivity data for drug-like molecules - **PDB**: Experimentally determined protein structures ```{python} #| eval: false # Example: query ClinVar for variant annotations import requests def query_clinvar(variant_id: str): """Query ClinVar for variant clinical significance.""" url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi" params = { "db": "clinvar", "id": variant_id, "retmode": "json" } response = requests.get(url, params=params) return response.json() ``` ### Connecting to Clinical Workflows These models integrate into clinical pipelines: 1. **Variant interpretation**: Patient sequencing → variant calling → AI classification → clinical report 2. **Pharmacogenomics**: Genotype → drug response prediction → dosing recommendation 3. **Drug repurposing**: Existing drugs → DTI prediction → new indications The deployment considerations from Chapter 16 apply: model validation, regulatory requirements, integration with clinical systems. ## Summary This chapter extended sequence modeling techniques to biological domains: - **DNA sequences** use the same transformer architectures as clinical text, just with a 4-letter alphabet - **Protein sequences** can be embedded with ESM models, enabling mutation effect prediction - **Molecules** represented as SMILES strings allow NLP-style property prediction - **Drug-target interaction** combines molecular and protein embeddings for binding prediction The key insight: if you understand transformers and embeddings from Chapters 8–10, you already have the conceptual foundation for genomic AI. The applications differ, but the techniques transfer directly. For clinical deployment, start with smaller models (DNABERT, ESM-2 8M, molecular fingerprints) that run efficiently. Scale up to larger models when accuracy requirements demand it and compute resources allow.