Summary
Larger protein language models are better able to predict thermostability (1). This is likely related to their ability to better model structures.
Figures
| Model | EPA | FLIP | Difference |
|---|---|---|---|
| ESM-1b (650M), frozen | 0.620* | 0.680* | +0.060 |
| ESM2 (35M), frozen | 0.497 | 0.610 | +0.113 |
| ESM2 (650M), frozen | 0.523 | 0.618 | +0.095 |
| ESM2 (3B), frozen | 0.674 | 0.699 | +0.025 |
| ESM2 (15B), frozen | 0.670 | 0.721 | +0.051 |
| ESM2 (35M), finetuned | 0.436 | 0.478 | +0.042 |
| ESMFold seq. rep. (650M), frozen | 0.416 | 0.546 | +0.130 |
| Mean over ESM variants | ~0.548 | ~0.603 | +0.055 |
| Ref (1) |
See also
1.
Hermann L, Fiedler T, Nguyen HA, Nowicka M, Bartoszewicz JM. Beware of Data Leakage from Protein LLM Pretraining. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.07.23.604678