Summary
ML models trained exclusively on experimental structures are less effective on computational models (1,2). (1) attributed this to “structure embedding bias”, hypothesize that improvements in structure prediction are not predicted to remove this bias, and address it using contrastive learning. (2) observe this in a version of SaProt which is trained on Foldseek tokens derived from PDB structures rather than AlphaFold2 models.
See also
- Training inverse folding and diffusion models exclusively on predicted protein structures worsens performance due to how locally perfect they are
- Contrastive fine-tuning PLMs on inverse folding embeddings of experimental structures but not computational models improves downstream tasks
1.
Huang Y, Li S, Wu L, Su J, Lin H, Zhang O, et al. Protein 3D Graph Structure Learning for Robust Structure-Based Protein Property Prediction. Proceedings of the AAAI Conference on Artificial Intelligence. 2024;38(11):12662–70. Available from: https://doi.org/10.1609/aaai.v38i11.29161
2.
Su J, Han C, Zhou Y, Shan J, Zhou X, Yuan F. SaProt: Protein Language Modeling with Structure-aware Vocabulary. openRxiv; 2023. Available from: https://doi.org/10.1101/2023.10.01.560349