Summary

BERT PLMs trained on clustered data such as UniRef50 or UniRef90 outperform those trained on unclustered databases such as UniRef100 (1).

See also

1.
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences. 2021;118(15). Available from: https://doi.org/10.1073/pnas.2016239118