How to Read this Book

The book is accompanied by scripts in both the R and Python programming languages. I had to make some choices—some of the biological data repositories have great integrated Perl and R packages. I wouldn’t want to force people into Perl (especially not myself!). I am more comfortable wrangling the initial data in R than in Python, so here we are.

If you want to code along, rest assured you can run most of this on a MacBook. Maybe you’ll need to do a training run overnight a few times. If you want a bit more performance, or do not want to your MacBook turn into a space heater for 24 hours, you can use Google Colab, a notebook environment with reasonably priced access to A100 GPUs. Training the DNABERT model we outline in Chapter 2 on 500k coding sequences from 13 species took ±6 hours on an A100 on Colab, which means that cost me ±$4 in Colab credit. In the chapters that explicitly discuss scaling trainign up we’ll cover how to set up a training run on a server with any number of high power datacentre GPUs.

So keeping things manageable means well be working with small models, on relatively small, or narrow datasets, at times we’ll build prototype or toy models that lack certain features purely to further our understanding of a more complex model. I cant emphasize enough that so far none of the models trained in the making of this book are state of the art models, or trained on sufficient data to be applied for actual biological understanding. I am seriously weighting, pending someone donating the compute time needed, a “lets put all the parts together” chapters at the end of the sections about DNA and Proteins where I train one, still modestly sized, state of the art model integrating lessons on data, architecture, scaling and evaluation into the kind of project that would for the basic for an academic paper, or initial prototype in the biomedical industry.

The GitHub repo that hosts the book will be populated with all the scripts I discuss and use. The data used to train the models, and some of the models themselves, will be hosted on Hugging Face (a repository of ML training data). I will try to make Jupyter notebooks available, though given my R background, I usually run Python from a script, or interactively in a REPL because that is what R people do.

If you come at this from an R background, and Python isn’t your native language, I can highly recommend using Positron as an IDE when following along with this book. It’s a VSCode derivative developed by Posit, the company that developed RStudio. Positron has tooling integrated for data science in both Python and R, and I can switch between Python and R sessions instantly!

Structure

There were two potential ways to structure the book. I could have structured it around elements of, or classes of, deep-learning models: attention, tokenzation, embedding, diffusion, or I could have structured it around biological themes DNA, RNA, Proteins. I opted for the second structure. but that doesn’t mean I don’t cover ML/AI topics extensively. Tokenization is discussed at length in 2 Training our first DNA Language Model, attention is covered in details in 10 Protein contact maps from attention maps, and diffusion is covered fully in 11 Integrated protein diffusion language models.

The book is divided into sections that deal with a specific biological modality or data type. The last chapter in each section is a review of current models of that modality or data type. It’s more common to begin a chapter reviewing what’s available out there, but given the novelty of these models, it makes sense to learn how they work before reading a review of what’s out there. There is a risk of the reader attributing insights to me simply because I describe it to you first. I’ll cite my sources as if this is a peer-reviewed article, and you should assume most models we build together are directly, or indirectly, influenced by the literature. I also ask you do not cite this book other than for novel content, or to refer to it as teaching material—please, for models, architectures, and insights, cite the underlying empirical literature.

DNA Language Models

1 Preparing DNA data for training covers downloading and processing (DNA) sequence data from Ensembl and uploading it to Hugging Face. 2 Training our first DNA Language Model covers training a first small DNA sequence language model; the model is a bog-standard language model meant for natural languages, simply applied to DNA. In 3 Evaluating DNA Language Models, we explore how you’d evaluate whether a DNA model is any good—is our model learning anything at all? Then in 4 Evolution-Aware Encoders, we explore evolutionary-aware encoders and how they relate to DNA language models. In 5 Comparing Models we compare the two model we trained on a number of tasks, getting a feel for comparative evaluation. If you stuck with it and get to **6 A Review of Current DNA Language Models* you are ready for a brief review of existing DNA language models.

Scaling Training

After the book section on DNA models, we step back and consider scaling up model training. To train a full “production” model you’d need to scale from running things interactively on a MacBook, to a GPU in the cloud to 8 GPUs in a server. Conditional on me getting some funds and/or arranging HPC compute access I might even write about/run training on a whole rack of servers, each with 1-8 GPUs. When scaling we are confronted with a whole host of new issues around training stability and parallel compute.

Protein Language Models

In 7 Proteins: from sequence to structure we discuss the advent of protein language model (initially deep learning models not language models, though they share the key attention mechanisms) and their incredible success at solving a core problem in biology: Predicting the shape of a folded protein from the protein sequence. Their success won the Google DeepMind team a Nobel Prize, rightfully so. In 8 Selecting and curating protein sequences we discuss one of the key ingredients of Google’s success: the ubiquitous (taxpayer-funded) access to protein (structure) data. Not only are sequences accessible, but there are also bi-annual drops of protein sequence and structure experiments that haven’t been shared yet, creating a perfect holdout set which ensures healthy competition, no risk of over-fitting, and a lush testing bed for protein models. Then in 9 Training our first Protein Language Model we train a small protein language model, mostly to just get acquainted with the code. Later in the chapter “putting it all together” we’ll actually train a reasonably sized protein language model on a balanced data mixture.