Scale up Training

This chapter isn’t a part of the “DNA” section of the book, because the lessons are really quite general, but it comes after because we needed a little bit of experience with language model training before even considering training a serious model. This is also a somewhat awkward chapter for me to write, especially for the part of the readership that has a background in ML. See, I am a psychologist by training (though I have worked in genetic epidemiology for years and years), and while a lot of my academic work is fairly computational, I am not an expert in language model scaling by any means! Remember, the preamble to the book explains that this book is an account of me learning about biological language models and taking others along for the ride, not an authoritative text!

Don’t Try to Win the Compute Race

Among the DNA models I could find on Hugging Face are 7B parameter models like https://huggingface.co/genbio-ai/AIDO.DNA-7B. AIDO is trained on “256 H100 GPUs” in “8 days”. The training data consisted of 10.6 billion bases. That’s not even a particularly large model in the grand scheme of things, but if you consider a cost of approximately $2 per hour per H100, you are going to spend $100k. Obviously, there are academic compute resources you can get access to by appointment or based on fair use at your institute, university, or through collaborative national infrastructure, but even those are finite.

You have to consider feasibility. Today (March 2025), the Dutch national computer cluster for research (Snellius at SURF Sara) has 88 nodes with 4 H100 GPUs and 72 nodes with 4 A100 GPUs. TACC, the University of Texas at Austin compute provider, has approximately 80 A100 nodes (each with 3 GPUs). Those are two examples of reasonably well-funded HPC providers in academia. In my experience, you could get time reserved for your research at your local academic HPC provider at steep discounts, and these systems are likely large enough to train that 7B model I linked to. However, note how on either TACC or Snellius, 256 GPUs for 8 days would block the entire system for over a week. Perhaps you could apply for access to larger national research clusters, like Isambard-AI in the UK (being built in Bristol right now, a motivation for me to write this) which has 5,000 H100 GPUs. However, in general, it is likely you are going to be relatively limited by compute resources. Don’t be discouraged though—most breakthroughs are not going to be compute-based, and there are immense efficiency gains to be made that will level the playing field.

Focus on a specific question

It is tempting to train a model on all DNA in the known universe, but honestly, there is actually more DNA than people have even started training on. The models discussed so far often train on the reference sequence/genome, a sort of modal or consensus genome, but individual people’s genomes are different from that reference. You could consider thousands of species, or (tens/hundreds of) thousands of individual human genomes. That would require a lot of bioinformatics, which if your background is in bio might set you apart from other ML/AI researchers. You’d have to phase the genomes to untangle the maternal and paternal strand, you’d have to decide whether you want to get rid of the reference entirely and build a specific reference/genome for each individual, you might require some reference, or a graph genome? It’s also worth considering whether your task really requires the whole genome. Are you performing gene-centric tasks (mutation consequence prediction, gene expression prediction, alternative splice modeling)? If your specific tasks don’t require the whole genome, why not consider training on coding sequences only or the sequences of genes and a few thousand bases around them?

Smart Architectures

In Chapter 4, we studied smarter, DNA-specific model architectures. The GPN model inspired by Benegas et al. (2023) that we introduced can outperform a standard BERT in an hour of training on my 2022 MacBook Air (the BERT we trained and compared to our GPN-BERT trained for approximately 8 hours on a strong GPU). The massive efficiency gain may mean you can beat the 7B BERT-like model we took as an example of compute costs with a fraction of the compute! As briefly remarked on in Chapter 6, researchers have designed alternatives for the transformer module in order to expand its context window up to 1 million bases, with far less compute requirement than the transformer (Nguyen et al. 2023). If you are to design and run your own model, it will likely pay off if you implement some of these architectural innovations.

Get most out of your GPU

There is a healthy culture of extreme optimization. A good early example of this is the paper “Craamming: training a language model on a Single GPU in a Day” (Geiping and Goldstein 2022). Other neat examples are this GitHub repo of repeated attempts at training a GPT-2 equivalent (the OG OpenAI model that sort of set the LLM hype cycle in motion) as fast as possible (now in under 3 minutes on 8 H100 GPUs) (https://github.com/KellerJordan/modded-nanogpt). Some of the innovation people made cramming these models won’t generalize to your model, but consider giving their Muon optimizer a go (for GPT-2 it’s a serious efficiency gain), spend time finding optimal learning rates, or consider spending a few extra days/weeks cleaning data. If you do all these things before your big run, it’ll save some serious compute, which means you can push more data through the model in the same compute budget.

Optimisation steps anyone should take

Optimization doesn’t have to be a full-time job though, there are some easy steps anyone can take to get more training out of the same hardware. Full writeups on simple optimizations are found here, here and here. But I’ll cover the low-hanging fruit right here. Follow these steps and you’ll likely get nearly the same results in half the compute.

Batch processing

Transformers are designed with batch processing in mind. All the weight matrices have an extra (second or third, depending on which matrix) dimension to hold multiple sequences in parallel and apply the training step over all of them. We can very easily change the batch size in the training arguments:

training_args = TrainingArguments(..., 
                                  per_device_train_batch_size=16,
                                  ...,)

If you increase the batch size too much, you’ll have a crash and an out-of-memory warning:

RuntimeError: CUDA out of memory. Tried to allocate X MiB ...

We do so because many elements of a GPU, tensor core, shader units, CUDA cores come in powers of 2, and if your batch is 17, and you happen to have 16 tensor cores (or whatever element in the stack), that means processing 16, then 1, then 16 etc. You can go all the way until you run out of memory, but I wouldn’t. Training is bound by the limits of compute (as of writing, perhaps NVIDIA or AMD innovates rapidly in 2025, and this might change). So, once you find a batch size that hits 100% GPU use during training (you can check with the nvidia-smi command line tool or rocm-smi for AMD GPUs).

Lower numerical precision (quantize)

Lower numerical precision. Numbers are stored in 32-bit, which effectively means you have 6-9 significant digits, and the number can be zero or can range from $-3.4*10^{38}$ to $-1.2*10^{-38}$, or from $1.2*10^{-38}$ to $3.4*10^{38}$. It’s not entirely uncommon for scientific computations to run into numbers that can’t be represented well in 32-bits, but in order to speed up large models, people have actually gone down to 16-bit numbers.

In transformers training arguments you can specify 16-bit training, but with 32-bit parameter accumulation (storage) of results where higher precision is needed, so-called mixed precision training with a simple command:

training_args = TrainingArguments(..., 
                                  fp16=True,
                                  ...,)

Or if you have a GPU capable of it (and most are) you can use the more efficient bf16 mixed precision, which has worse precision than fp16 but more dynamic range (can represent a larger range of values).

training_args = TrainingArguments(..., 
                                  bf16=True,
                                  ...,)

Finally, on NVIDIA GPUs, you can use some serious dark magic: tf32, which is actually a 19-bit number (it drops precision but keeps dynamic range), which for most purposes is as precise as fp32. In many versions of PyTorch and transformers, tf32 is automatically enabled. But, if you work on a cluster with older versions of PyTorch pre-compiled, you can manually activate tf32, and you can combine it with bf16 for mixed precision training:

import torch
torch.backends.cuda.matmul.allow_tf32 = True

training_args = TrainingArguments(..., 
                                  bf16=True,
                                  tf32=True,
                                  ...,)

The combination of bf16 and tf32 can result in 4x to 12x speedups, though as tf32 is often already activated as a default the advantage is baked in. There are ways to take this further, use 8-bit numerical representations on modern hardware. The advantage of 8-bit models is greatest for really big models, if you have the means and skills to train those kinds of models, you have no need for this book.

Final recommendation: use bf16 with tf32 unless you are training models that are so large as to require 8-bit data formats, unheard of in biological deep learning so far.

Optimizer choice

We haven’t really spoken about the optimizer itself, the default optimizers used these days are almost always variations of adam (Kingma and Ba 2014). Adam, and a modern implementation of it, like adamW are workhorses. Because they store a rolling average of recent gradients, they are robust if the gradients are very noisy. Because they don’t rely on 2nd order derivatives they are able to deal with very large models, and they are relatively efficient (when optimizing a model with adamW we store 8 bits per parameter). Papers that promise to beat adamW, to then fail, are a bit of a running gag in machine learning. However, you can in fact do better, both in terms of speed and in terms of memory utilization.

adafactor: adafactor is slower, but far more memory efficient. Where adam stores rolling averages of the gradients of weight matrices, adafactor stores row and column averages of those, meaning it only requires 4 bytes per parameter, significantly reducing the memory usage. This optimizer is a drop-in replacement and while slightly less efficient (takes longer to get to the same minimum) a great option if you are short on memory.

training_args = TrainingArguments(..., 
                                  optim="adafactor", 
                                  ...)

paged_adam_8bit: Bits and Bytes is a library that deeply compresses certain optimizer states to 8-bits during training and actually even further during fine-tuning or inferences. It’s less of a direct drop-in replacement but it’s fast, almost as fast as adamw. If you are really memory-bound (say you have a 12Gb or 24Gb GPU but bigger ambitions) then this can be an option. The integration section of the manual is found here.

muon: You can refer to this writeupfor details of the optimizer. It’s a bit of a weird optimizer, as it’s meant ONLY for the inner layers of a transformer, the first and the last layer are still optimized with adamW. This also means you’ll need to write your own code, this certainly isn’t a simple case of dropping an argument into Trainer. The advantage of muon is that it’s both faster, and the loss drops more steeply. In other words, it learns more per token than other optimizers. This is actually very relevant from protein and DNA language models. As we’ll learn in the next few chapters on protein language models, there is way less usable data in the biological sphere than there is for natural language models. Facebook LLama is trained on 15 Trillion tokens, the largest protein language models on 780 billion tokens, which required including massive amounts of meta-genomic, and synthetic protein data. We are running out of data, and considering an optimizer that squeezes a little more out of each amino-acid might be worth your time! To implement moun you’d need to apply it to all 2D layers, and apply a separate optimizer to all other (1D) layers, from the GitHub:

# optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.90, 0.95), weight_decay=0.01)
from muon import Muon


# Find ≥2D parameters in the body of the network -- these should be optimized by Muon
muon_params = [p for p in model.body.parameters() if p.ndim >= 2]
# Find everything else -- these should be optimized by AdamW
adamw_params = ([p for p in model.body.parameters() if p.ndim < 2]
              + [*model.head.parameters(), *model.embed.parameters()])


# Create the optimizer
optimizers = [Muon(muon_params, lr=0.02, momentum=0.95, rank=0, world_size=1),
              torch.optim.AdamW(adamw_params, lr=3e-4, betas=(0.90, 0.95), weight_decay=0.01)]
...

# in the training step
for opt in optimizers:
    opt.step()

The authors mention that muon isn’t tested for fine-tuning, and they don’t think it’ll work well with small batches.

Final recommendation: Stick with adamW unless you are training a mid-size (> 300m parameters) foundational model and risk running out of training data.

Parallel Training

After you have squeezed every drop out of your GPUs, the next performance step is training a model on multiple GPUs. Huggingface has a great read-me on training across multiple GPUs you should check out, below I cover the basics. If you stick with Huggingface transformers library it’s actually quite simple! Their accelerate library enables us to launch a single script across multiple GPUs in one machine, or even multiple GPUs in multiple machines. It’s truly as simple as writing a training script (say train.py) and launching it with the accelerate command line tool from the terminal:

accelerate launch --multi-gpu train.py

If your model is a model that comes with the Huggingface transformers library this command should just find the GPUs, and get going. Though you’ll very likely want a bit more control. In my case, for example, this command kept trying to start a job on my 2 GPUs, AND on the internal GPU in the computer’s CPU package… That built-in GPU is of a different nature altogether and would only slow things down, so I had to specify the specific two GPUs I wanted to use:

CUDA_VISIBLE_DEVICES="0,1" accelerate launch {script_name.py}

This level of control can also come in handy if you want separate GPUs to run separate experiments. It is remarkable how plug-and-play accelerate really is. If you have one machine with multiple GPUs, and your model fits in the GPUs’ memory. In a few test runs on a cloud machine with 8 A100 GPUs, or 8 H100 GPUs I got my Protein language model running across 8 GPUs, munging through 500+ proteins a second (so ±500,000 tokens a second) without any script modifications.

Different kinds of parallel

data parallelism

There are a few different kinds of parallelism. There are data parallel (DP) and distributed data parallel (DDP), where each GPU has a copy of the model, and a unique batch of data (this is why the trainer argument is: per_device_train_batch_size each GPU gets a batch) and the loss, and gradients are averaged or pooled across GPUs after each training step.

Data parallelism requires your model, and a reasonable batch of data to fit on each GPU. If they do, then DDP and DP are the most straightforward option. If you model doesn’t fit your GPU (and large models frequently won’t!) you’ll need a different strategy. If you need to pick between DP and DDP consider that DP is less communication-heavy but slightly less optimal. So if you have 4 GPUs on a slow motherboard PCIe connection, I’d go with DP, but if you have the GPUs linked via NVlink or similar high-speed card-to-card connections, then DDP might be faster. You could inquire which might be best for you with your HPC team, but I find it easier to just try DDP and DP and see which is fastest over a 100-500 step trial run.

The most advanced version of data parallelism is “Fully sharded data parallelism” where a copy of the model isn’t kept on each GPU but the model is distributed across all GPUs, minimizing memory use, but increasing communication overhead. It’s a great option for large models on modern GPUs with fast (nvlink) interconnects) you can read more about its implementation in accelerate here.

model/tensor parallelism

Once a model exceeds the size of the GPU, you have no choice but to distribute layers of the some across multiple GPUs. you could do so naively, say layer 1-4 GPU1, 5-8 GPU2 etc etc. The GPUs would have to wait for the GPU before it to finish, meaning you’d have a log of GPUs idling waiting for layers to be called on. But usually, there are smarter ways to pack things, companies like Google pioneered Gpipe. There are Pytorch and Huggingface tutorials on model parallel training.

Final recommendation: If you are in the audience for this book, then you might be ready for training, or fine-tuning a model in the 200m to 600m parameter range. You should be able to do that on a single machine with 4 or ideally 8 powerful GPUs. Stick with data parallel strategies that are easy to implement while you figure out the thousands of other choices you need to make to arrive at an optimal model. The reason to work your way up from smaller to larger models. The reason to start with smaller models is that there are always new surprises behind each corner. I just trained a protein language model where the data become progressively richer for mammalian proteins, and considered increasingly similar proteins. I thought that would enhance some application on human proteins, but it didn’t in a 45 million parameter and then 70 million parameter model. Had I trained a 300m parameter model on commercial hardware as I intended to without testing, I would have been out $1500 and would have ended up with a “meh” model at best. I did the math on training a 150M or 300M model on 8Xa6000 GPUs, or 8XA100 GPUs. In ~40 hours, and using the suggestions outlined above, you could train a near cutting-edge DNA/protein language model on one of these machines, and more powerful 8xH100 machines are coming online everywhere. Distributed data parallel training across 2,4 or 8 GPUs in one machine should be sufficient for the kind of learning you need to do at this stage, once you are ready to scale further there are serious resources.

Aim for the stars..

Should you get to the point where you exceed what’s feasible on a single very powerful machine with 8 GPUs, then you can move to a cluster. Tools like accelerate will work across multiple machines, and that is likely sufficient compute for almost anyone. But, if you ever get to this stage it’s good to read up on “ultra-scale’ training there are two great manuals you should flick through.

Google deep mind wrote a guide to scaling large language models on TPU/Jax, but it covers all the math and concepts you’ll need in any framework in great detail.

Huggingface wrote an interactive playbook on ultra-scale training which is more practical.