Summary
Sequence clustering of training data affects performance of variant effect prediction by protein language models (1). Clustering at 90% leads to the best performance when training ESM models.
Figures
Ref (1)
See also
- Correlation between sequence log-likelihood and variant effect prediction performance breaks down as PLMs get larger
- BERT models trained on sequence clusters outperform those trained on all data
- Alternate sequence clustering schemes outperform uniform sampling when training protein language models
- Sequence homology composition can affect performance of fine-tuned protein language models for variant effect prediction
1.
Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. openRxiv; 2021. Available from: https://doi.org/10.1101/2021.07.09.450648