Sequence clustering of training data affects variant effect prediction performance by PLMs

Summary

Sequence clustering of training data affects performance of variant effect prediction by protein language models (1). Clustering at 90% leads to the best performance when training ESM models.

Figures

Ref (1)

Correlation between sequence log-likelihood and variant effect prediction performance breaks down as PLMs get larger
BERT models trained on sequence clusters outperform those trained on all data
Alternate sequence clustering schemes outperform uniform sampling when training protein language models
Sequence homology composition can affect performance of fine-tuned protein language models for variant effect prediction

Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. openRxiv; 2021. Available from: https://doi.org/10.1101/2021.07.09.450648

Quartz 4

Explorer

Sequence clustering of training data affects variant effect prediction performance by PLMs

Summary

Figures

See also

Graph View

Backlinks