Summary

PLMs are biased by the fact that sequence databases used for training are unevenly distributed toward prokaryotes (1,2). This bias is stronger for larger models and stronger for ProGen than ESM, with the former being trained on UniRef90 and the latter on UniRef50. Augmenting these datasets with metagenomic data can improve generalization (but it is unclear if it would fix this problem specifically; (3)).

Details

Most organisms only have a few proteins assigned to them (2).

Figures

ModelR²_speciesR²_proteinR²_bothR²_S|P
PROGEN2-XLARGE0.500.420.810.67
PROGEN2-BFD900.490.510.850.69
PROGEN2-LARGE0.460.600.870.67
PROGEN2-BASE0.250.640.840.55
PROGEN2-MEDIUM0.440.590.860.66
ESM2-15B0.250.420.600.32
ESM2-3B0.260.460.630.32
ESM2-650M0.190.620.720.26

Figures from (1)

Ref (2)

See also

1.
Ding F, Steinhardt J. Protein language models are biased by unequal sequence sampling across the tree of life. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.03.07.584001
2.
Avasthi P, York R. The known protein universe is phylogenetically biased. 2024; Available from: https://thestacks.org/publications/result-protein-universe-phylogenetic-bias
3.
Cheng X, Chen B, Li P, Gong J, Tang J, Song L. Training Compute-Optimal Protein Language Models. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.06.06.597716