Summary

Training on updated databases (such as UniRef) does not guarantee better performance of language models and protein language models (1). The largest performance boost observed by (1) when training PLMs on UniRef100 coincided with an unusually large culling of data from training databases.

Figures

Ref (1)

1.
Fournier Q, Vernon RM, van der Sloot A, Schulz B, Chandar S, Langmead CJ. Protein Language Models: Is Scaling Necessary?. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.09.23.614603