What is this Book About?

This book is me trying to keep up with the current state of the art in ML for biology, with an initial focus on language models, tohugh more acurately all models we discuss are “attention” models. The fundamental goal of these models is to learn the dependencies (the joint probability distribution) between elements in a sequence (a DNA or protein sequences) or a 2D field (an image or a protein contact map) or 3D structure (the full protein structure for example).

When studying the latest hyped tools, it’s good to resist the temptation to be awed by the models. Some are great, and it can feel magical to see an unsupervised model pick up important biological signals and be very predictive by only processing sequence data! However, the current state of the art in biology, genetics especially, is remarkable—we know a lot about the genome, about how DNA is transcribed into RNA and then proteins, which proteins are conserved across evolution (i.e., essential for all life). So throughout, we have to keep in mind that in some domains, while it may feel (and actually be) remarkable that a language model picks up fundamental biology just from processing data, it might not be state of the art or even close to it.

A Brief Glossary of ML Model Types

Supervised machine learning is a type of machine learning where a model learns to make predictions based on examples that come with known answers, or “labels.” In biology, this could mean training a model to predict whether a DNA sequence comes from a healthy or diseased tissue, or identifying which species a DNA sample belongs to. The model sees many examples where the input (the DNA sequence) is paired with the correct output (the label, like “healthy” or “diseased”), and learns to find patterns that link the two. Supervised learning is very powerful when we have lots of high-quality labeled data, but in biology, obtaining these labels can be expensive, time-consuming, and sometimes even impossible if we don’t know the “right answer” in advance.

Unsupervised machine learning, in contrast, is used when we don’t have labels—the model only sees the raw data and has to find patterns on its own. This is especially useful in biology when exploring large datasets where the structure isn’t fully understood, such as grouping similar cells in single-cell RNA sequencing or discovering new subtypes of proteins. In the case of biological language models, the “language” is made up of sequences like DNA, RNA, or proteins. Unsupervised models, such as transformers trained on genome sequences, learn the “grammar” and “vocabulary” of these biological molecules just by seeing lots of sequences, without being told what they mean. This allows them to uncover hidden rules of biology, like which sequences are likely to code for stable proteins or which mutations might disrupt function.

Biological language models have become particularly important because DNA, RNA, and proteins all follow sequential, language-like patterns—just as words form sentences, nucleotides form genes, and amino acids form proteins. By training on vast amounts of biological data in an unsupervised way, these models can learn useful representations of biological sequences, even without human-provided labels. Researchers can then use these pretrained models for many downstream tasks, such as predicting gene function, identifying regulatory regions, or studying how genetic variation might affect disease—combining the power of unsupervised learning to understand biology’s “language” with supervised learning for more targeted, disease-specific predictions.

In some cases, the boundary between supervised and unsupervised learning is blurry—for example, in protein language models trained to predict 3D structure from amino acid sequences. These models are not given simple “labels” like “healthy” or “diseased,” but they are provided with 3D structural information that acts as an open-ended example rather than a strict classification label. The model isn’t being asked to sort sequences into a few categories, but rather to learn a very rich and flexible relationship between sequence and structure. This kind of learning—where the system uses biological context to guide its training without explicit classification tasks—occupies a middle ground between supervised and unsupervised methods, illustrating how biological complexity often resists fitting into neat ML categories. A key difference between learning labels and learning open-ended structures is that learning labels is data reduction (from complex 1D sequence to two, or a few, labels) while structure prediction is data expansion (from 1D protein sequence to 3D spatial molecular map).

In sequence analysis, there are also biologically-driven models that sit outside traditional machine learning entirely, or only use minimal regression or statistical modeling. For example, methods to predict whether a missense mutation (a single amino acid change) is deleterious often rely on biological theory, such as identifying mutation-depleted regions—parts of the genome or protein where harmful mutations rarely appear in healthy populations. These models leverage evolutionary conservation, functional annotations, and biochemical properties to prioritize mutations for further study, sometimes incorporating simple regression to combine different biological signals into a final score. These biologically-informed approaches are critical in genomic medicine and show how biology itself can provide a strong prior for prediction, even without extensive machine learning.