6  A Review of Current DNA Language Models

Abstract

This is not an academic review; those are obviously far more comprehensive. See Benegas et al. (2025) for a review I really liked. This chapter will provided a brief overview of key innovations and sbreaktrough models published over the last few years and discuss key challenges and next steps for the field. We briefly discuss and acknoledge a balanced critique of the interpretation of what DNA language models actually do (whether they memorize or actually learn fundemental rules/laws of biuiology) see (Consens et al. 2025) and (Hassan et al. 2025).

6.1 Modeling Paradigms

As we learned in the first 5 chapters, gLMs (genomic Language Models) borrow from natural language processing by treating DNA sequences as “text” composed of four characters (A, C, G, T). Early models used k-mer tokenization (e.g., DNABERT), but newer approaches experiment with both nucleotide‐level and subword tokenizations (such as 3-mer or 6-mer) to better capture biological semantics.

6.2 Architectural Innovations

There are two specific architectural innovations in DNA language models worth discussing. The first is GPA-MSA, the core idea behind a model we discussed in Chapter 4. In this model, the trainable embedding layer is replaced with a biologically informed deterministic embedding that reflects the evolutionary history of the genome at a given base.

A second innovation was necessitated by the need to model long-range dependence in DNA (changes to DNA can have effects over thousands of bases “downstream”). While transformer-based models initially dominated the field, their quadratic scaling with sequence length has prompted the development of more efficient architectures. Models such as HyenaDNA extend context lengths up to 1 million tokens at single-nucleotide resolution, and hybrid architectures like HybriDNA combine transformers with selective state-space models (Mamba2) to process sequences up to 131 kilobases. Omni-DNA and GENERator further illustrate the trend toward unified, cross-modal genomic foundation models capable of multitask learning.

Below is a summary table of several prominent DNA language models with innovative architectures, along with links to their corresponding paper and GitHub (or related resource) repositories:

Model Description Paper Link GitHub / Resource Link
DNABERT A transformer-based model that learns bidirectional representations from DNA sequences using k-mer tokenization. DNABERT Paper GitHub
Nucleotide Transformer (NT‑v2) A large transformer pretrained on massive human genomic data to learn robust DNA representations for various downstream tasks. Nucleotide Transformer v2 GitHub/HuggingFace
GPN The Genomic Pre-trained Network that leverages unsupervised DNA language modeling to predict genome-wide variant effects. GPN Paper (PNAS 2023) GitHub
GPN-MSA The Genomic Pre-trained Network that leverage multiple sequence alignment across species to develop specialized evolution-aware token embedding. GPN-MSA preprint GitHub
HyenaDNA A long-range genomic language model operating at single-nucleotide resolution using Hyena’s implicit convolutional approach to overcome transformer scaling issues. HyenaDNA (arXiv) GitHub
HybriDNA A hybrid model combining Transformer and Mamba2 (state-space) architectures for efficient long-range DNA modeling. HybriDNA (arXiv)
Omni‑DNA A unified genomic foundation model that supports cross-modal and multi-task learning across a wide range of genomic applications. Omni‑DNA (arXiv) Hugging Face Collection
GENERator A long-context generative genomic foundation model designed for sequence generation and optimization tasks with a context length of up to 98k bp. GENERator (arXiv) GitHub
Evo 2 Evo 2 is a state of the art DNA language model (available in 1B, 7B and 40B sizes) for long context modeling and design. Evo 2 models DNA sequences at single-nucleotide resolution at up to 1 million base pair context length using the StripedHyena 2architecture. Evo 2 was pretrained using Savanna. Evo 2 was trained autoregressively on OpenGenome2, a dataset containing 8.8 trillion tokens from all domains of life. Preprint Github Hugging Face collection

This table highlights each model’s core features and provides direct access to the publication and code repository (or resource page) where available.

6.3 Applications

These models have demonstrated state-of-the-art performance across multiple downstream tasks including: Variant Effect Prediction: Unsupervised approaches can predict deleterious mutations by modeling the “grammar” of the genome. Regulatory Element Identification: By learning long-range interactions, gLMs help detect promoters, enhancers, and other regulatory motifs. Sequence Generation and Protein Tasks: Some models generate synthetic regulatory sequences or transform coding DNA into meaningful protein representations, bridging genomics and proteomics.

6.4 Challenges and Future Directions

There are some interesting immediate challenges that become apparent form the literature. One very obvious one is a better grasp of the relation between training data (multi-species sequences vs intra-human variation) and model performance. The nucleotide transformer paper (Dalla-Torre et al. 2024) highlights how as things stand training on human sequences does not improve the model over multi-species training. Its obviously true that most bases are identical for most people, and so the human training data has low variability, which might adversely impact the model. It could also be that for the specific validation tests evolutionary constraint convey’s more information than the limited human variation in 1000-genomes.

one specific avenue for exploration could therefore be to deeply consider the order in which training data is presented (first train across species, then train within humans) and the effect of learning rate on specific segments of training. YOu could imagine training at a high learning rate using the high variance intra-species data and then train at a lower learning rate with the low variance sequences, only modestly updating the model in that phase. Alternatively you could consider “low rank training”, where the model is first trained on multi-species data and then only fine-tuned on human data, restricting the degrees of freedom in that second phase trough low rank matrix approximation (LoRa)(Hu et al. 2021) which learns less, but also forgets less from previous training epochs(Biderman et al. 2024).

In any field that receives outsized attention (and I feel we can conclude AI is currently such a field) its always critical to evaluate innovations. There is a growing literature around DNA language model evaluation you should familiarize yourself with if you are going to evaluate these models for academic or industry use (Tang et al. 2024; Patel et al. 2024; Marin et al. 2023). For specific tasks its good to continually evaluate whether DNA language models are overkill. Does your model outperform alpha-missense, for which the scores are already available? How does it fair against older, and computationally likely lighter, supervised models like CADD(Schubach et al. 2024)? Don’t just trust the original authors, they are biased (we all are) consider independent evaluations, like for example: (Ljungdahl et al. 2023).

6.4.1 Key challenge: Do these models memorize sequences, or learn biology?

Genomic language models have shown remarkable ability to capture statistical patterns in DNA sequences, yet recent analyses underscore a fundamental limitation: these models often rely on memorizing recurring motifs rather than internalizing the complex regulatory “grammar” of the genome.Some critique is focused on the fact that simple evaluation might not distinguish between memorization and understanding or learning(Consens et al. 2025). In Chapter 4 we discussed language models that lean on Multiuple Sequence alignment, and while these arguably do very well at some task, these clearly memorize the value of the MSA. A recent critical analytical review tries to empirically illustrate that DNA language models more broadly just memorizing (Hassan et al. 2025) by assessing the models relative performance on species represented well or poorly inside the training datasets.

This tendency toward memorization becomes especially apparent when examining early gLM architectures such as DNABERT and GROVER. Research on DNABERT revealed that performance gains often stem from recalling frequent k-mer patterns in the training corpus, rather than modeling the long-range dependencies essential for regulatory element function(Consens et al. 2025). Similarly, (Consens et al. 2025) note, benchmarks relying on distinguishing real from randomly generated sequences predominantly are tests of motif frequency and fail to probe higher-order sequence logic.

Taken together, these critiques argue that merely increasing parameter counts or context windows will not suffice to endow DNA language models with a genuine understanding of genomic “laws.” Instead, both works call for integrating evolutionary constraints, structural information, and explicit functional representations into model design. (Hassan et al. 2025) emphasizes that overcoming inherent architectural limitations might require paradigms beyond autoregressive next-token prediction (The authors do not go into how CLM and MLM might differ in this respect). While (Consens et al. 2025) urges the development of biologically meaningful benchmarks that can distinguish true comprehension from pattern memorization Without such advances, DNA language models risk remaining powerful pattern matchers—valuable for certain tasks but ultimately limited in their ability to unveil the deep regulatory principles critical for biomedical discovery.