17  Genomics & Drug Discovery

The transformer architectures you learned for clinical text in Chapter 10 have revolutionized biological sequence analysis. DNA, RNA, and proteins are sequences—just with different alphabets. This chapter shows how the same deep learning concepts apply to genomics and drug discovery, connecting these fields to techniques you already understand.

17.1 Biological Sequences as Language

Clinical Context: A genetic counselor reviews a patient’s sequencing results and finds a variant of uncertain significance (VUS) in BRCA2. Is this mutation pathogenic? Traditional methods rely on databases of known variants, but most variants are rare or novel. Machine learning models trained on millions of sequences can predict pathogenicity even for never-before-seen variants.

17.1.1 The Sequence Analogy

Consider the parallels between clinical text and biological sequences:

Property Clinical Text DNA Protein
Alphabet ~50,000 words 4 letters (A, T, G, C) 20 amino acids
Sequence length 100–5,000 tokens 100–100,000+ bases 100–2,000 residues
Structure Sentences, paragraphs Genes, regulatory regions Secondary/tertiary folds
Meaning Semantics from word order Function from base sequence Function from 3D shape

The key insight: the same architectures work across all these domains. Tokenization, embeddings, attention, transformers—all transfer directly.

17.1.2 From Words to Bases

In Chapter 10, we tokenized clinical notes into subwords. For DNA, we tokenize into k-mers—subsequences of length k:

def tokenize_dna(sequence: str, k: int = 6) -> list:
    """Tokenize DNA sequence into k-mers."""
    sequence = sequence.upper()
    kmers = []
    for i in range(len(sequence) - k + 1):
        kmer = sequence[i:i + k]
        if all(base in 'ATGC' for base in kmer):
            kmers.append(kmer)
    return kmers

# Example
dna = "ATGCGATCGATCG"
print(tokenize_dna(dna, k=3))
# ['ATG', 'TGC', 'GCG', 'CGA', 'GAT', 'ATC', 'TCG', 'CGA', 'GAT', 'ATC', 'TCG']

With k=6, there are \(4^6 = 4,096\) possible tokens—comparable to a small vocabulary. Larger k captures more context but increases sparsity.

Modern models like DNABERT use learned tokenization similar to byte-pair encoding, letting the model discover meaningful subsequences.

17.1.3 Why This Matters Clinically

Genomic AI has direct clinical applications:

  • Variant interpretation: Classifying genetic variants as pathogenic or benign
  • Gene expression: Predicting how mutations affect protein production
  • Pharmacogenomics: Predicting drug response based on genetic variants
  • Cancer genomics: Identifying driver mutations from tumor sequencing

These are classification and regression problems—exactly what you’ve been doing with other data types.

17.2 DNA Language Models

Clinical Context: Whole genome sequencing generates 3 billion base pairs per patient. Interpreting this data requires models that understand DNA “grammar”—the patterns that distinguish functional regions from noise, pathogenic variants from benign polymorphisms.

17.2.1 DNABERT: Transformers for Genomes

DNABERT applies the BERT architecture to DNA sequences. Pre-trained on the human genome through masked language modeling (predicting masked k-mers from context), it learns representations useful for downstream tasks.

from transformers import AutoTokenizer, AutoModel
import torch

# Load DNABERT
tokenizer = AutoTokenizer.from_pretrained("zhihan1996/DNABERT-6")
model = AutoModel.from_pretrained("zhihan1996/DNABERT-6")

# Tokenize a DNA sequence
sequence = "ATGCGATCGATCGATCGATCGATCGATCGATCG"
# DNABERT expects k-mer tokenized input with spaces
kmers = " ".join([sequence[i:i+6] for i in range(len(sequence) - 5)])
inputs = tokenizer(kmers, return_tensors="pt")

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # (1, seq_len, 768)

print(f"Embedding shape: {embeddings.shape}")

17.2.2 Nucleotide Transformer: Scaling Up

Nucleotide Transformer models scale to billions of parameters, trained on genomes from thousands of species. They capture evolutionary patterns that inform function:

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load Nucleotide Transformer
model_name = "InstaDeepAI/nucleotide-transformer-500m-human-ref"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# These models use single-nucleotide tokenization
sequence = "ATGCGATCGATCGATCG"
inputs = tokenizer(sequence, return_tensors="pt")

# Get embeddings for downstream tasks
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)
    embeddings = outputs.hidden_states[-1]  # Last layer

17.2.3 Variant Effect Prediction

A key clinical task: given a genetic variant, predict its effect. This is a classification problem using sequence context:

import torch
import torch.nn as nn

class VariantClassifier(nn.Module):
    """Classify variants using DNA embeddings."""

    def __init__(self, dna_model, hidden_dim=256, num_classes=2):
        super().__init__()
        self.dna_model = dna_model
        # Freeze pre-trained weights (optional)
        for param in self.dna_model.parameters():
            param.requires_grad = False

        self.classifier = nn.Sequential(
            nn.Linear(768, hidden_dim),  # DNABERT hidden size
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_dim, num_classes)
        )

    def forward(self, input_ids, attention_mask):
        # Get DNA embeddings
        outputs = self.dna_model(input_ids=input_ids, attention_mask=attention_mask)
        # Use [CLS] token representation
        cls_embedding = outputs.last_hidden_state[:, 0, :]
        return self.classifier(cls_embedding)

# Training follows standard PyTorch patterns from earlier chapters

The workflow mirrors clinical NLP: pre-trained model → fine-tune on labeled variants → deploy for classification.

17.3 Protein Sequence Analysis

Clinical Context: A patient has a novel missense mutation in a cardiac ion channel gene. The variant changes one amino acid in the protein sequence. Will this disrupt protein function and cause arrhythmia? Protein language models predict functional effects from sequence alone.

17.3.1 Proteins as Sequences

Proteins are chains of amino acids, each represented by a single letter (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y). A protein sequence might look like:

MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH

This is hemoglobin’s beginning—the protein that carries oxygen in blood. The sequence determines the protein’s 3D structure, which determines its function.

17.3.2 ESM: Evolutionary Scale Modeling

ESM (Evolutionary Scale Modeling) from Meta AI is a family of protein language models trained on millions of protein sequences. Like BERT for text, ESM learns contextualized embeddings that capture functional and structural properties.

import torch
import esm

# Load ESM-2 model
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()

# Prepare protein sequence
data = [("protein1", "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH")]
batch_labels, batch_strs, batch_tokens = batch_converter(data)

# Get embeddings
with torch.no_grad():
    results = model(batch_tokens, repr_layers=[33], return_contacts=True)

# Token embeddings (per-residue)
token_embeddings = results["representations"][33]  # (1, seq_len, 1280)

# Mean pooling for sequence-level embedding
sequence_embedding = token_embeddings[0, 1:-1].mean(0)  # Exclude start/end tokens
print(f"Sequence embedding shape: {sequence_embedding.shape}")  # (1280,)

17.3.3 Predicting Mutation Effects

ESM embeddings enable zero-shot prediction of mutation effects by comparing embeddings before and after mutation:

def predict_mutation_effect(model, alphabet, wild_type_seq, position, mutant_aa):
    """Predict effect of single amino acid mutation using ESM."""

    batch_converter = alphabet.get_batch_converter()

    # Create mutant sequence
    mutant_seq = wild_type_seq[:position] + mutant_aa + wild_type_seq[position+1:]

    # Get embeddings for both
    data = [("wild_type", wild_type_seq), ("mutant", mutant_seq)]
    _, _, batch_tokens = batch_converter(data)

    with torch.no_grad():
        results = model(batch_tokens, repr_layers=[33])

    wt_embedding = results["representations"][33][0, position+1]  # +1 for start token
    mut_embedding = results["representations"][33][1, position+1]

    # Cosine similarity as effect proxy
    similarity = torch.cosine_similarity(wt_embedding, mut_embedding, dim=0)

    # Lower similarity = more disruptive mutation
    return similarity.item()

# Example: predict effect of mutation at position 10
effect = predict_mutation_effect(model, alphabet, "MVLSPADKTNVKAAW...", 10, "A")
print(f"Mutation effect score: {effect:.3f}")

17.3.4 Connecting to Structure: AlphaFold

While this chapter focuses on sequence models, it’s worth noting that AlphaFold revolutionized protein structure prediction (Jumper et al. 2021). AlphaFold takes a sequence and predicts its 3D structure—enabling analysis of how mutations affect protein shape and function.

For clinical applications, you can: 1. Use ESM for sequence-level predictions (fast, scalable) 2. Use AlphaFold for structure-level analysis (slower, more detailed)

# AlphaFold predictions are often accessed via databases
# rather than run locally (requires significant compute)

# Example: fetch predicted structure from AlphaFold DB
import requests

def get_alphafold_structure(uniprot_id):
    """Fetch AlphaFold predicted structure for a protein."""
    url = f"https://alphafold.ebi.ac.uk/files/AF-{uniprot_id}-F1-model_v4.pdb"
    response = requests.get(url)
    if response.status_code == 200:
        return response.text  # PDB format structure
    return None

# Hemoglobin subunit alpha
pdb_data = get_alphafold_structure("P69905")

17.4 Molecules as Sequences: SMILES

Clinical Context: Drug discovery involves screening millions of compounds for potential therapeutic activity. Representing molecules as text strings enables the same sequence models used for language and genomics to predict molecular properties—toxicity, solubility, binding affinity.

17.4.1 SMILES Notation

SMILES (Simplified Molecular Input Line Entry System) represents molecules as text strings:

Molecule SMILES
Water O
Ethanol CCO
Aspirin CC(=O)OC1=CC=CC=C1C(=O)O
Caffeine CN1C=NC2=C1C(=O)N(C(=O)N2C)C

This text representation means we can apply NLP techniques directly:

from rdkit import Chem
from rdkit.Chem import Draw

# Parse SMILES to molecule object
aspirin_smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
mol = Chem.MolFromSmiles(aspirin_smiles)

# Basic properties
print(f"Molecular formula: {Chem.rdMolDescriptors.CalcMolFormula(mol)}")
print(f"Molecular weight: {Chem.rdMolDescriptors.CalcExactMolWt(mol):.2f}")
print(f"Number of atoms: {mol.GetNumAtoms()}")

# Visualize
Draw.MolToImage(mol)

17.4.2 Molecular Property Prediction

Given a SMILES string, predict properties like toxicity or solubility. This is text classification:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# ChemBERTa: BERT for chemistry
model_name = "seyonec/ChemBERTa-zinc-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

def predict_property(smiles: str) -> float:
    """Predict molecular property from SMILES."""
    inputs = tokenizer(smiles, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
    return probs[0, 1].item()  # Probability of positive class

# Example: predict property
score = predict_property("CC(=O)OC1=CC=CC=C1C(=O)O")
print(f"Property prediction: {score:.3f}")

17.4.3 Molecular Fingerprints

An alternative to sequence models: convert molecules to fixed-length vectors called fingerprints. These binary vectors encode structural features:

from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

def get_morgan_fingerprint(smiles: str, radius: int = 2, n_bits: int = 2048):
    """Convert SMILES to Morgan fingerprint vector."""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
    return np.array(fp)

# Compare two molecules
aspirin_fp = get_morgan_fingerprint("CC(=O)OC1=CC=CC=C1C(=O)O")
ibuprofen_fp = get_morgan_fingerprint("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O")

# Tanimoto similarity
similarity = np.sum(aspirin_fp & ibuprofen_fp) / np.sum(aspirin_fp | ibuprofen_fp)
print(f"Aspirin-Ibuprofen similarity: {similarity:.3f}")

Fingerprints work well with classical ML (random forests, SVMs) and are fast to compute—useful for screening millions of compounds.

17.5 Drug-Target Interaction Prediction

Clinical Context: A pharmaceutical company has a protein target implicated in cancer. They want to find molecules that bind to this target. Experimentally testing millions of candidates is expensive; computational prediction narrows the search.

17.5.1 The Prediction Task

Drug-target interaction (DTI) prediction asks: given a drug (molecule) and a target (protein), will they interact? This is a matching problem—similar to question-answering or retrieval tasks in NLP.

17.5.2 Embedding-Based Approach

Embed both drug and protein into the same space; predict interaction from embedding similarity:

import torch
import torch.nn as nn

class DTIPredictor(nn.Module):
    """Predict drug-target interaction from embeddings."""

    def __init__(self, drug_dim, protein_dim, hidden_dim=256):
        super().__init__()
        # Project both to same space
        self.drug_proj = nn.Linear(drug_dim, hidden_dim)
        self.protein_proj = nn.Linear(protein_dim, hidden_dim)

        # Interaction prediction
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )

    def forward(self, drug_embedding, protein_embedding):
        drug_proj = self.drug_proj(drug_embedding)
        protein_proj = self.protein_proj(protein_embedding)

        # Concatenate and predict
        combined = torch.cat([drug_proj, protein_proj], dim=-1)
        return self.classifier(combined)

# Usage:
# 1. Get drug embedding from ChemBERTa or fingerprint
# 2. Get protein embedding from ESM
# 3. Predict interaction

17.5.3 End-to-End Pipeline

A complete DTI pipeline:

import torch
from transformers import AutoTokenizer, AutoModel
import esm

class DrugTargetPipeline:
    """End-to-end drug-target interaction prediction."""

    def __init__(self):
        # Drug encoder (ChemBERTa)
        self.drug_tokenizer = AutoTokenizer.from_pretrained(
            "seyonec/ChemBERTa-zinc-base-v1"
        )
        self.drug_model = AutoModel.from_pretrained(
            "seyonec/ChemBERTa-zinc-base-v1"
        )

        # Protein encoder (ESM)
        self.protein_model, self.alphabet = esm.pretrained.esm2_t6_8M_UR50D()
        self.batch_converter = self.alphabet.get_batch_converter()

        # Interaction predictor (would be trained on DTI data)
        self.predictor = DTIPredictor(
            drug_dim=768,    # ChemBERTa hidden size
            protein_dim=320  # ESM-2 8M hidden size
        )

    def encode_drug(self, smiles: str) -> torch.Tensor:
        inputs = self.drug_tokenizer(smiles, return_tensors="pt")
        with torch.no_grad():
            outputs = self.drug_model(**inputs)
        return outputs.last_hidden_state[:, 0, :]  # CLS token

    def encode_protein(self, sequence: str) -> torch.Tensor:
        data = [("protein", sequence)]
        _, _, tokens = self.batch_converter(data)
        with torch.no_grad():
            results = self.protein_model(tokens, repr_layers=[6])
        embeddings = results["representations"][6]
        return embeddings[0, 1:-1].mean(0, keepdim=True)  # Mean pool

    def predict(self, smiles: str, protein_seq: str) -> float:
        drug_emb = self.encode_drug(smiles)
        protein_emb = self.encode_protein(protein_seq)
        with torch.no_grad():
            score = self.predictor(drug_emb, protein_emb)
        return score.item()

# Example usage
pipeline = DrugTargetPipeline()
interaction_score = pipeline.predict(
    smiles="CC(=O)OC1=CC=CC=C1C(=O)O",  # Aspirin
    protein_seq="MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH"
)
print(f"Predicted interaction: {interaction_score:.3f}")

17.6 Practical Considerations

17.6.1 Computational Resources

Genomic and molecular models vary widely in compute requirements:

Model Parameters GPU Memory Use Case
DNABERT-6 110M ~4 GB Variant classification
ESM-2 (8M) 8M ~1 GB Quick protein embeddings
ESM-2 (650M) 650M ~8 GB High-quality embeddings
ESM-2 (3B) 3B ~24 GB Best performance
Nucleotide Transformer 500M–2.5B 8–32 GB Genomics research

For clinical deployment, smaller models (DNABERT, ESM-2 8M) often suffice. Research applications may benefit from larger models.

17.6.2 Data Sources

Key databases for training and evaluation:

  • ClinVar: Curated database of genetic variants and clinical significance
  • gnomAD: Population-level variant frequencies
  • UniProt: Protein sequences and functional annotations
  • ChEMBL: Bioactivity data for drug-like molecules
  • PDB: Experimentally determined protein structures
# Example: query ClinVar for variant annotations
import requests

def query_clinvar(variant_id: str):
    """Query ClinVar for variant clinical significance."""
    url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    params = {
        "db": "clinvar",
        "id": variant_id,
        "retmode": "json"
    }
    response = requests.get(url, params=params)
    return response.json()

17.6.3 Connecting to Clinical Workflows

These models integrate into clinical pipelines:

  1. Variant interpretation: Patient sequencing → variant calling → AI classification → clinical report
  2. Pharmacogenomics: Genotype → drug response prediction → dosing recommendation
  3. Drug repurposing: Existing drugs → DTI prediction → new indications

The deployment considerations from Chapter 16 apply: model validation, regulatory requirements, integration with clinical systems.

17.7 Summary

This chapter extended sequence modeling techniques to biological domains:

  • DNA sequences use the same transformer architectures as clinical text, just with a 4-letter alphabet
  • Protein sequences can be embedded with ESM models, enabling mutation effect prediction
  • Molecules represented as SMILES strings allow NLP-style property prediction
  • Drug-target interaction combines molecular and protein embeddings for binding prediction

The key insight: if you understand transformers and embeddings from Chapters 8–10, you already have the conceptual foundation for genomic AI. The applications differ, but the techniques transfer directly.

For clinical deployment, start with smaller models (DNABERT, ESM-2 8M, molecular fingerprints) that run efficiently. Scale up to larger models when accuracy requirements demand it and compute resources allow.