What Can Epigenomic Foundation Models Reveal About Biology?
Welcome to the first in a series of posts from Epigenome Technologies focusing on the application of foundation model techniques (or "deep learning") to epigenetic datasets, which aims to provide a variety of perspectives on this area of research, including overviews of major models, potential real-world applications, worked examples of fine-tuning, and deep dives into technical underpinnings. This first post centers on sequence-to-function models that predict molecular outcomes from DNA sequence alone: EVO-2, AlphaGenome, Borzoi, and a suite of models, GENA-LM. A subsequent post will concentrate on function-to-function models, which predict (or impute) molecular states from other, related molecular states.
Biological cells maintain protein levels on timescales far exceeding protein half-lives to sustain cellular function; however, protein levels drift into dysfunction or even malignancy under pathological conditions. Maintenance or drift both occur in the nucleus, the site of mRNA production encoding the next generation of proteins. The epigenetic landscape coordinates the activity of RNA polymerase and cofactors and therefore serves as a feedback nexus, responsible for environmental buffering, memory, and entrapping cells in malfunctioning states. Indeed, epigenetic predisposition determines the efficacy of cellular (re-)programming of somatic cells to pluripotency; modern differentiation protocols incorporate epigenetic drugs (e.g., EZH2, HDAC, or DOT1L inhibitors) in combination with signaling factors into (de-)differentiation-inducing media, which reveals how chromatin state can stymie even "robust" somatic cell reprogramming using reprogramming factor cocktails such as the classic "OSKM" or "OSNL" combinations.
Understanding the fine details of homeostasis, environmental perturbations, disease onset, and progression will require cell-resolution data and models that can generalize despite high complexity and partial observability. Given the physical nature of the nucleus (nearly 2 meters of linear DNA packed into a 4-micrometer radius), only molecular methods such as Paired-Tag and single-cell (sc)CUT&Tag can generate the necessary high-resolution data (at least for the time being!). As such, foundation models that extrapolate from next-generation sequencing-type data remain of paramount importance for understanding the long-term dynamics of cellular processes.
Sequence-to-Function Models
The sequence-to-function model, which constitutes the most straightforward (though far from simple) type of genomic foundation model, uses DNA sequences alone to predict functional outcomes such as open chromatin, histone modifications, gene expression, DNA methylation, transcription factor binding, and splicing. As all cell types effectively share DNA sequences, the predicted outcomes must differ by cell type, so all sequence-to-function mappings are fan out (i.e., single-input, multi-output).
This post explores the details of three sequence-to-function foundation models (AlphaGenome, EVO-2, and Borzoi) and one earlier model (GENA-LM) and examines their components to compare inputs, outputs, latent representations, architectures, and training methodologies.
As sequence-to-function models, variant effect prediction represents the core application of these models; in particular, predicting the impact of non-coding variants, with derived applications such as plasmid or promoter sequence optimization, quantitative trait locus prioritization, or de novo sequence generation; nevertheless, each foundation model takes a distinct approach to the task of function prediction.
Overview
"Predict what a given DNA sequence does" describes the general area these models fit. A fully-fledged "virtual cell" that provides phenotypic outputs (e.g., proliferation or secretion) may represent the most helpful tool; however, these models aim for an intermediate step: predicting molecular genetic outcomes such as RNA expression, DNA accessibility, chromatin state, or transcription factor binding.
Of note, these recently published models (2024/2025) highlight different main results. GENA-LM highlights an expanded input size compared to prior models (DNABERT/BigBird), strong performance on histone and transcription factor occupancy, and the incorporation of residual memory transformers to boost species classification performance. Borzoi predicts gene expression and open chromatin in a tissue- and species-specific manner from DNA sequence alone, and variants that impact gene expression and alternative polyadenylation. AlphaGenome (heavily inspired by Borzoi) predicts chromatin conformation, handles 1 MB input sequences, and improves variant effect prediction for splicing and gene expression. Evo-2 (strictly speaking, a sequence-to-sequence model) predicts the pathogenicity (including splice-altering mutations), uses embeddings as inputs for fine-tuned output tasks (such as BRCA1 deactivating mutations), and incorporates downstream models to generate sequences with specific chromatin features.
Of note, the field remains in the early days of development, and benchmark metrics (such as correlations and area under the precision–recall curve [AUPRC]), while improving, remain modest. Data constitutes a significant bottleneck for improving model performance and incorporating clinically relevant information. Tissue-level data remains valuable, but not so valuable as cell-type (or cell-subtype) resolved molecular information. Nevertheless, these models reflect the cutting-edge of what we can currently do; a post later in this series will describe the mechanics of fine-tuning open models on new data and potential applications beyond those described in the associated papers.
Input
Sequence-based foundation models obviously take DNA sequences as input; however, critical decisions include how much sequence to use, how to represent the sequence, and the effective resolution at which to represent the sequence.
| Model | Input Window | Raw Encoding | Sequence Encoding | Latent Resolution |
|---|---|---|---|---|
| AlphaGenome | 1 MB | One-hot | Convolutional stack | 128 bp, 1536 characters |
| Evo-2 | 1 MB; tags 131kb | Character byte | Embedding | 1 bp (full resolution) |
| Borzoi | 524 kb | One-hot | Convolutional stack | 128 bp, 1536 characters |
| GENA-LM | 36 kb | Byte-pair | Embedding | 1 token ("full" resolution) |
The species-aware nature of the EVO-2 and AlphaGenome models requires the model to be provided with the species of origin; however, Borzoi and GENA-LM, as species-agnostic models, process DNA sequences without reference to their phylogenetic origin.
EVO-2 employs a full (UTF-8) alphabet, and directly incorporates the phylogeny into the sequence (every 131kb) in the form of a whole phylogeny string ("|D__BACTERIA;P__PSEUDOMONADOTA;C__GAMMAPROTEOBACTERIA;O__ENTEROBACTERALES;F__ENTEROBACTERIACEAE;G__ESCHERICHIA;S__ESCHERICHIA|") into the sequence. As such, the interspersed attention mechanisms will recognize motifs enriched/specific to domains/phyla. AlphaGenome (at this point) trains only on human and mouse sequences and, unlike Evo-2, sticks to strict DNA bases with one-hot encoding; the species input is separate from the DNA input (of note: the AlphaGenome paper remains unclear on how the species identifier conditions the model, and speculation on this point forms part of the Architecture section below. For Borzoi, which is also trained on human and mouse, the training is simply interspersed, with human data computing loss from human output heads, and mouse data computing loss from mouse output heads)
For input encodings, AlphaGenome and Borzoi take similar approaches, with raw one-hot DNA sequences passed through a stack of 7-8 convolutional layers with a large (and increasing) number of filters. In both cases, the final layer has a 128-bp resolution: shape (8192, 1536) for AlphaGenome and shape (4096, 1536) for Borzoi; importantly, achieving an output resolution below 128 bp requires latent sequence up-sampling. Unlike other models that use only DNA bases, EVO-2 employs a complete alphabet, providing information on non-repetitive ("ACGT") and repeat-masked ("acgt") sequences, as well as "N" bases, contig. delimiters ("#"), and species delimiters ("@"). The model uses an nn.Embedding subclass to represent each UTF-8 character as an 8192-dimensional vector, which serves as the input to the attention/hyena stack. GENA-LM employs an intermediate approach, collapsing common k-mers into tokens using byte-pair encoding with a vocabulary of size of 32,000, followed by nn.Embedding on the resulting token set. This approach employs a window size set in token space (4096 tokens); however, 4096 tokens may reflect a different number of nucleotides (input length is "about" 36 kb) depending on k-mer frequencies in the training set.
The 'Final' sequence representations (immediately before attention) are the outputs from the convolutional stacks for Borzoi and AlphaGenome (at an effective resolution of 128 bp) and the character/token encodings for EVO-2 and GENA-LM (at an effective resolution of 1 bp or 1 token, respectively).
Output
As a DNA sequence predictor, EVO-2 has a single output: the sequence itself; meanwhile, the remaining models have multiple outputs, each with its own specialized "head." Typically, an "output head" constitutes a small, dedicated network that specializes the latent space for the specific output; this is almost the case for GENA-LM, which fine-tunes the entire network (not just the output head) for each task. Borzoi and AlphaGenome, by contrast, train each output head concurrently with the entire network. Both AlphaGenome and Borzoi employ a "trick" - each output track corresponds to one of the filters of a 1-d convolutional output (however, separate filter banks exist for human and mouse, as the tracks remain separate). These approaches provide rich mappings (1536 x conv_size), but less than a fully connected network (4096 x 1536).
To align with genomic track resolution, AlphaGenome and Borzoi require up-sampling, performed with a "U-net" design (see Architecture) that increases the latent dimension from 4096 (128 bp resolution) to 6384 (32 bp resolution). AlphaGenome includes the other five layers to align with the raw input at 1 bp resolution.
In addition to tracks, AlphaGenome adapts the latent space to predict chromatin contacts: the final embeddings produce an auxiliary matrix feature (aligned with the contact map resolution) for use in contact map prediction, which possesses a dedicated output head; this represents the major addition over Borzoi (specifics discussed in the Architecture section).
Architecture
Attention is expensive, and all models have a stack of transformer layers with versions of multi-head attention at their core; however, different architectural strategies can mitigate the transformer's expense. Borzoi and AlphaGenome reduce the resolution so that attention focuses on a smaller dimension, while GENA-LM employs a smaller input space and byte pair encoding to the same effect. EVO-2 introduces a novel set of plug-in kernels ("short explicit", "medium regularized", and "long implicit") as replacements for attention, while of the 50 layers in the 1B model transformer stack, 8 represent attention layers (Borzoi also employs 8, AlphaGenome 9, and GENA-LM 12 [bigbird-base-t2t]); however, the specifics of EVO-2's approach ("StripedHyena 2") merit comparisons with standard attention-based transformers, which will be detailed in a future post on kernel alternatives to attention (Hyena's state-space model represents one such example). Borzoi and AlphaGenome (after applying attention at a 128-bp resolution) must then up-scale to higher resolutions for genomic track predictions. Borzoi scales up to 32 bp, and AlphaGenome back to 1 bp; both models employ the U-Net architecture, in which intermediates from the downres (downscaling) stack are added to the inputs of the upres (upscaling) stack. This process occurs twice for Borzoi to reach a 32-bp final resolution and seven times for AlphaGenome to reach a 1-bp resolution.
Two significant differences exist between Borzoi and AlphaGenome regarding the construction of the U-Net. First, while both models take the convolutional outputs immediately before max pooling, Borzoi inserts an extra convolutional block that adds modified residuals to the upsampled image, since the channel counts would not match otherwise. Second, AlphaGenome includes residuals throughout all convolutions [much like Pad(x, …) + Conv1D(x)], whereas the Borzoi U-Net does not employ these residual connections. Finally, AlphaFold averages the latent values into 2048-bp bins (AveragePool) within the transformer blocks, then learns query and key matrices – attention analogues – of shape (512, 32, 128). Their outer sum forms a (512, 512, 32, 128)-shape tensor, further refined by row attention and a 1-hidden-layer MLP (fully-connected network). The contact-specific output head then symmetrizes the tensor and passes it through a small network to predict chromatin contacts.
As a supplementary note, while the field appears to have gone "all-in" on attention, one paper published earlier this year revisits the use of convolution-only frameworks for masked-language-model pretraining and task-specific fine-tuning, demonstrating that convolution-only models of similar parameter size to (last generation) sequence transformers display superior performance by utilizing a stack of purely gated convolutional layers. An emerging body of literature also exists regarding the advantages of deep models in general – particularly for predicting perturbations - and whether existing performance metrics suit such comparisons.
| Architecture | GENA-LM (bigbird-base-t2t) | Borzoi | AlphaGenome | EVO-2 |
|---|---|---|---|---|
| Encoding | BPE + nn.encoder | 7x CNN-1D | 8x CNN-1D | None |
| Transformer Block | 12x self-attn | 8x self-attn | 9x self-attn | StripedHyena2 |
| Upscaling | None | U-Net | U-Net | None |
| Output Heads | Per-task fine-tuning | 2 CNN (filter = output) | 2 CNN, 2 SJH, 2 CMH | 1 |
Training
GENA-LM undergoes the simplest training: masked language modeling (cross-entropy on masked tokens) with standard AdamW-based gradient descent and learning rate decay; special-cased fine-tunings use mean-squared error (MSE; continuous output) or cross-entropy (binomial/multinomial) with the same optimizer. Depending on the specific model, data training involved human (T2T reference + 1000 Genomes extension), Yeast, Drosophila, Arabidopsis, or all genomes ("Multispecies").
Borzoi and AlphaGenome take similar approaches at a high level; both build and employ ensembles of trained networks (Borzoi: 4; AlphaGenome: 64) to generate a final model (Borzoi: Simple ensemble; AlphaGenome: Distillation). During initial training, AlphaGenome (but not Borzoi) augments data via random shifts and/or reverse complementation. For tracks (expression or chromatin), AlphaGenome and Borzoi treat total interval coverage as a Poisson distribution, and the conditional assignment of coverage to bins as a multinomial distribution. The loss is the weighted negative log-likelihood, with a 5-fold weight on the multinomial ("shape") term. AlphaGenome adds additional output types with their own losses: genomic contacts (MSE), splice site usage (cross-entropy), and splice junction counts (multi-term: acceptor cross-entropy, donor cross-entropy, donor count Poisson, acceptor count Poisson). AlphaGenome distillation employs an identical student model architecture, distilled from 64 pre-trained models, with additional data augmentation via in-silico mutagenesis (single-nucleotide variants, insertions, deletions). Borzoi employed vanilla Adam optimization, trained to a validation plateau ("~25 Days"), while AlphaGenome utilized AdamW with a learning rate warm-up for 5K steps and cosine decay for 10K steps (total: 15K steps). Training for both models employed ENCODE, GTEX, FANTOM5, and CATlas datasets; AlphaGenome added data from the 4D Nucleome portal.
EVO-2 trains on all available genomic assemblies (subject to quality constraints – "OpenGenome2") and employs simple cross-entropy loss (0.1 weight on repeats, 0 weight on special characters including phylogenetic tags) using AdamW and cosine weight decay. Training proceeds in three stages (for the 40B model): 6.6 trillion tokens trained at a context length of 1024 tokens; 1.1 trillion tokens trained at a context length of 8192 tokens; and 200 billion tokens trained at lengths of 131.1K, 262.1K, and 1048.6K.
Given EVO-2's "decoder-only" architecture, this extension pre-pads the attention weight matrices with zeros to the new context length (Hyena kernels remain implicit and scale automatically), which supports the computation of "initial" keys, queries, and values from the previous context length, with the gradients then modifying the weights to incorporate new positions. Positional embedding remains a potential hitch, with the retention of the same encoded distance between elements impossible– a difference of 20 bp becomes represented differently after increasing the context window than originally, due to re-scaling and altering the base frequency of RoPE (Rotary Positional Embeddings).
Evaluations and Benchmarks
As AlphaGenome (a) increases context size, (b) performs data augmentation, (c) increases output resolution to 1 bp, (d) adds skip connections to convolutions, (e) adds splicing and contact-map output modalities, and (f) undergoes longer training, it is no surprise that the AlphaGenome extensions to Borzoi improve on the baseline across nearly all metrics. Matching to Borzoi resolution (32 bp) and output types (RNA and chromatin) provides modest improvements (3-5% relative improvement); however, they translate to a substantial (25%) improvement in expression quantitative trait locus prediction, and the addition of splicing quantitative trait locus prediction (effectively unsupported by Borzoi). This results in an AUPRC of 0.57-0.66 for non-coding variant pathogenicity prediction (although PhyloP scores provide an AUPRC of > 0.95).
EVO-2 excels at variant pathogenicity, achieving state-of-the-art performance for coding non-single-nucleotide-variant mutations (indels) and non-coding single-nucleotide variants, as well as for predicting gene essentiality. The in-silico "Needle in a haystack" benchmark represents perhaps the most interesting benchmark in EVO-2, which asks the following question: "How does the probability of predicting [SUFFIX] from context [PREFIX] [SUFFIX] [BRIDGE] [PREFIX] if I mutate [SUFFIX] in the context?" The probability may shift significantly (high sensitivity to the prefix) or not at all (no sensitivity to the prefix), depending on the length of the intervening (random) bridge. An average logit scale measures probability (mutate each base, take the Euclidean distance between that and the unmutated base, average over all mutations and positions), with 0.8 an empirically set cutoff (dependent on the number of possible tokens and overall confidence; not an absolute scale). Under this benchmark, for the 8192 bp (pre-training) length, the scores change most with small bridges (nearby shared prefix), but decay remains modest. Paradoxically, for the longest sequences (1 Mbp), the longest bridges (shared prefixes furthest apart) have the most significant impact on the change. As the presence of repetitive sequences in nearly all genomes should induce a recency bias, the preprint did not address this unexpected paradox.
Finally, GENA-LM, though a family of models, benchmarks the bigbird-base-t2t version on species classification for various sequence lengths, revealing surprisingly robust performance in predicting species from arbitrary 1 kb sequences.
Limitations
Predicting epigenetic context from DNA sequence alone represents a challenging task. Across all models, chromatin immunoprecipitation (ChIP) (for histone modifications or transcription factors) performance remains robustly the worst, which may partly reflect assay noise or suggest a need for additional context beyond DNA sequence alone to improve predictions. One might suggest that these models predict average occupancy of modified histones or transcription factors (averaged over cell types, cell states, and time); however, substantial dynamics remain unaddressed by these models. Indeed, analysis from the Altemose lab at a recent conference revealed that differences in chromatin accessibility between haplotypes and cells from the same population can exceed 40%; however, inter-haplotype differences drop below 5% when averaged across the population, which inevitably leads to difficulties with genome generation. Lacking epigenetic information in the model, attempts to use EVO-2 to generate "optimized" expression vectors for eukaryotic systems may generate transcriptionally active sequences under unpredictable conditions. Auxiliary models such as Borzoi can be used as a mitigation measure, so long as the cell type is part of the training set. Finally, in all models, the decoders or output heads learn the functional information, which remains simpler than the transformer network itself. We may be able to achieve gains by simply increasing the expressiveness of the output networks. More problematically, these models are not designed to – and cannot – predict functional changes due to epigenetic modifications, meaning that predicting the functional consequences of, for example, open chromatin remains impossible.
Epigenetic Context: Dispensable or Essential?
Major branches of the biological foundation model literature – RNA foundation models such as GEM-1, the sequence-to-function foundation models featured here, and spatial models such as CONCH, "understand" the epigenomic layer only implicitly. Deeply understanding causal gene regulation remains the primary goal of these models, with the hope of enhancing target discovery through in-silico experimentation, potentially ushering in an era of fine-grained "nudge drug " treatments.
We learned 10-20 years ago in the induced pluripotent stem cell (iPSC) space that epigenetic state strongly determines the induction response and capacity, leading to the (now commonplace) inclusion of epigenetic drugs in differentiation media. If massive transcription factor cocktail-induced perturbations depend on epigenetic state, then we may expect more fine-grained perturbations to demonstrate this dependence. Unfortunately, massive perturbation atlases, such as Xaira's X-Atlas/Orion, measure only RNA as a modality. If perturbation response remains conditional on cell type or state (as it almost certainly is), then epigenetic context remains essential for in-silico models, reflecting the only means to generalize to unseen cell types or states. The significantly greater performance of SCARF (RNA + ATAC foundation model) compared to other scRNA foundation models provides some evidence in this direction, even when SCARF remains limited to open chromatin alone.
Multi-modality represents the future of foundation models, with epigenetics as the key modality for modeling gene expression responses: not merely what genes a cell expresses, but the mechanisms that induce and retain them activation or silencing.. Significant limitations include the lack of data linking modalities together to support model construction. Paired-Tag - the only technology that profiles RNA and epigenetics - can feed into these essential models; our current work aims to increase these modalities to incorporate DNA methylation and multiple DNA-binding modalities simultaneously to facilitate the construction of future systems biology foundation models.
Stay tuned for our next post, focusing on foundational models of DNA methylation.