Summary

Larger protein language models are better able to predict thermostability (1). This is likely related to their ability to better model structures.

Figures

ModelEPAFLIPDifference
ESM-1b (650M), frozen0.620*0.680*+0.060
ESM2 (35M), frozen0.4970.610+0.113
ESM2 (650M), frozen0.5230.618+0.095
ESM2 (3B), frozen0.6740.699+0.025
ESM2 (15B), frozen0.6700.721+0.051
ESM2 (35M), finetuned0.4360.478+0.042
ESMFold seq. rep. (650M), frozen0.4160.546+0.130
Mean over ESM variants~0.548~0.603+0.055
Ref (1)

See also

1.
Hermann L, Fiedler T, Nguyen HA, Nowicka M, Bartoszewicz JM. Beware of Data Leakage from Protein LLM Pretraining. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.07.23.604678