Sequence Language Models & Deep Learning in Genomics

A book for those late to the AI party and trying to catch up.

Author

Michel Nivard

Published

May 8, 2025

Preface

These are my study notes on training DNA/RNA/Protein sequence models and other biological deeplearning models, over the last ~year, year and a half. They are arranged in what is a public alpha/draft of a book/study guide on biological language models, deep-learning models or sequence models for academics and trainees who missed the boat on AI and want to catch up.

The text/book is intended for people who want to on-ramp into the field and begin to actually train large DNA/Biological language models,, but also for their PIs—anxious aging millennial or Gen-X’ers who want to be able to understand the next generation of computational genomics tools that are about to wash over us all.

As I come at this from my background in statistical genetics/psychiatric genetics/epidemiology, I’ll clearly be interest in how mutations or variants I might identify in GWAS alter proteins function, and how protein models can help me asses/model those consequences. In later chapters (currently being drafted) I tackle issues more closely related to statistical genetics, like finemapping and how (language) models of protein complexes might enable prioritization of long range co-evolution between amino-acids and help us prioritize potential GxG interactions between distal variants, or help study biological mechanisms for trans-QTLs.

At all times, I’ll try to add some biological context (though I am no biologist!) for people who have an ML background but no college bio experience, and I’ll try to add context on ML concepts (though I am not an AI engineer!) for those with a bio background but limited experience with language models/deeplearning.

If you have a formal training in Machine Learnign and a background in biology or genomics, this book might be too basic for you, and in fact I’d love it if you could flag any errors or omissions (which I am sure exist), each page has a comments section or you can email me at: m.g.nivard (at) bristol (dot) ac.uk

The Times We Live In

More than natural language models, biological/sequence language models rely heavily on NIH-funded databases, datasets, resources, and scientists. The data we train on was bought and paid for by taxpayers all over the globe. The Human Genome Project was to a great extent funded, directed, and conceived under the auspices of the US federal government, under both Democratic and Republican presidents. Had they not, pharmaceutical companies might have done it, and while those can be highly innovative, there would have been no space for startups, no space for Google DeepMind to come in and iterate, revolutionize, or grow biological modeling. There is no uproar over training data in biology because under the firm guidance of US federal policy, all the data sequencers generate is generally in the public domain, or accessible for those willing and able to meet ethical standards. All scientists reading this know this—should you find yourself as someone from Silicon Valley, from a well-funded startup even, take a beat and think through whether you’d stand a snowball’s chance in hell to compete if the next wave of sequence data isn’t public but generated inside Google/Microsoft/pharma. Then adjust your politics accordingly.

Acknowledgements

These notes are written by me, Michel Nivard, a professor of Epidemiology at the University of Bristol, and as this book is not a core output for my job, I frequently rely on LLMs to help me with grammar, spelling, review, coding, and formatting to a level that exceeds my reliance on those for publications or teaching materials for Uni courses.

These study notes are influenced by discussions with Robbee Wedow and Seyedeh Zahra Paylakhi, with whom I work on related projects.