Summary

Sequence clustering of training data affects performance of variant effect prediction by protein language models (1). Clustering at 90% leads to the best performance when training ESM models.

Figures

Ref (1)

See also

1.
Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. openRxiv; 2021. Available from: https://doi.org/10.1101/2021.07.09.450648