8 Selecting and curating protein sequences

Protein data comes from multiple sources, each providing different types of information useful for training machine learning models and protein language models. Here’s the revised version with links added at the top of each paragraph. Given we can “convert” DNA to proteins using the genetic code, it also makes sense to consider sources of DNA data we discussed in Chapter 1 for training protein models.

8.1 Protein Sequence Data

UniProt | NCBI RefSeq | MGnify
The most widely used source of protein sequences is UniProt, which contains curated (Swiss-Prot) and unreviewed (TrEMBL) protein sequences from various organisms. Other databases, like NCBI RefSeq, provide reference protein sequences linked to genomic data. Large-scale sequencing projects, such as MGnify, focus on metagenomic protein sequences, offering an even broader diversity. These sequence datasets serve as the foundation for protein language models (PLMs).

8.1.1 UniRef50 and UniRef90

UniRef
UniRef50 and UniRef90 are clustered versions of UniProt protein sequences designed to reduce redundancy while maintaining sequence diversity. UniRef90 groups sequences that share at least 90% identity and have 80% overlap, merging nearly identical sequences into single representative entries. UniRef50 applies a looser 50% identity threshold, further condensing the dataset while preserving diverse functional and structural information. These clustered datasets are particularly useful for training protein language models and other ML applications, as they help mitigate biases from highly redundant sequences while still providing broad coverage across species. By using UniRef datasets, models can learn more generalizable representations of protein sequence space while improving computational efficiency.

8.1.2 Protein Structure Data

Protein Data Bank (PDB) | AlphaFold DB | SCOP
For structural information, the Protein Data Bank (PDB) is the most comprehensive resource, containing experimentally determined 3D structures from X-ray crystallography, NMR, and cryo-electron microscopy. The recent breakthroughs in AlphaFold DB provide high-quality computational structure predictions for almost all known proteins, greatly expanding the available structural information. Another key resource is SCOP (Structural Classification of Proteins), which organizes protein structures into hierarchical families based on evolutionary relationships, helping models understand structure-function relationships.

8.1.3 Evolutionary and Functional Data

Pfam | InterPro | KEGG
Evolutionary conservation is captured in Pfam, a database of protein families built from multiple sequence alignments, which is useful for understanding functional motifs. InterPro integrates multiple databases to annotate protein sequences with conserved domains, GO (Gene Ontology) terms, and other functional descriptors. KEGG (Kyoto Encyclopedia of Genes and Genomes) links proteins to biochemical pathways, aiding in training models to understand biological context.

8.1.4 Protein-Protein Interactions and Kinetics

STRING | BioGRID | BindingDB | PDBBind
Databases like STRING and BioGRID compile large-scale protein-protein interaction networks, valuable for models predicting binding affinity or molecular interactions. Meanwhile, BindingDB and PDBBind offer datasets of protein-ligand interactions with binding affinities, supporting drug discovery applications.

8.1.5 Specialized Datasets for ML and PLMs

TAPE Dataset | SwissProt50/TrEMBL50 | Proteinnet
Several datasets have been specifically designed for training ML models and protein LMs. TAPE (Tasks Assessing Protein Embeddings) provides benchmarks for PLMs on tasks like secondary structure prediction, fluorescence, and stability prediction. SwissProt50/TrEMBL50 datasets are commonly used for training PLMs by balancing redundancy in sequence data. Proteinnet contains the sequence, functional annotations, and atomic structure of all CASP proteins. The cool thing is that they are ordered chronologically, so you can train on CASP 7/8/9/10/11 and use CASP 12 or 13 as holdouts for validation.

8.2 Sequence diversity in training

When training protein sequence models, there is a risk of “overtraining” on abundantly measured, or frequently observed protein sequences. Not all protein families (characterized by a common evolutionary origin) are equally represented, and so uniform sampling from UniProt will skew towards certain types of proteins while undersampling others.

Various empirical projects consider sequencing diversity in training protein (language) models. The ESM-1 paper (Rives et al. 2021) evaluated 3 different levels of sequence diversity. The low-diversity dataset (UR100) consists of UniRef100 representative sequences, which are all proteins in UniProt removing identical copies (clustering sequences that share 100% identity). This dataset will have many near-identical sequences. The high-diversity sparse dataset (UR50/S) is derived from UniRef50 representative sequences; it’s going to have many unique sequences, but limited variation in similar sequences as all sequences with > 50% shared identity are collapsed into a single cluster from which 1 representative sequence is retained. The high-diversity dense dataset (UR50/D) achieves broader coverage by uniformly sampling UniRef100 sequences across all UniRef50 clusters, mixing uniform representation and dense coverage. When models were compared using the exponentiated cross-entropy (ECE) metric, the sense diverse data performed best (see Table 1).

**Table 1:** A single model trained on 3 data mixtures suggests dense diverse sequences lead to better training results (source: (Rives et al. 2021) )
Model	Variant	Params	Training data	ECE
Transformer	34-layer	669.2 M	UR100 (low diversity)	10.32
Transformer	-	-	UR50/S (h diversity sparse)	8.54
Transformer	-	-	UR50/D (h diversity dense)	8.46

If you are going to develop your own base protein (language) model, competing with the likes of Facebook and Google, then selecting the optimal training set is critically important. It is unlikely you’ll be able to compete in terms of compute, but with good data selection, you can save a lot of compute by making the model more efficient.

8.3 Are we running out of training data?

An insightful study by (Cheng et al. 2024) shows that if you train on UniRef50 (about 15 billion tokens, or amino-acids), then larger models (3 Billion) start to deteriorate on validating data after training for 3 (or more) Epochs. While if you expand training to metagenomic sources (that cover eukaryotic species, viruses, etc.), you can expect continued improvements.

The most recent generation of protein deep learning models, like ESM-3(Hayes et al. 2025) and ESM-c (ESM Team 2024), train on a kind of UniRef70 (trying to find a sweet spot between 50 and 70) augmented with millions of metagenomic sequences from EMBL’s MGnify and JGI’s IMG databases. A good open source of that data is OMG_Prot50, a clustered version of the Open MetaGenomic dataset (OMG, which contains MGnify and IMG) clustered at 50% sequence identity to minimize duplication and maximize diversity. Across OMG_prot50 and UniRef50 or 90, there are a few billion relatively unique proteins and several hundreds of billions of amino acids to train on. That sounds like a lot (and it is), but the latest natural language models are trained on tens of trillions of tokens and show no signs of being saturated.