Summary

ML models trained exclusively on experimental structures are less effective on computational models (1,2). (1) attributed this to “structure embedding bias”, hypothesize that improvements in structure prediction are not predicted to remove this bias, and address it using contrastive learning. (2) observe this in a version of SaProt which is trained on Foldseek tokens derived from PDB structures rather than AlphaFold2 models.

See also

1.
Huang Y, Li S, Wu L, Su J, Lin H, Zhang O, et al. Protein 3D Graph Structure Learning for Robust Structure-Based Protein Property Prediction. Proceedings of the AAAI Conference on Artificial Intelligence. 2024;38(11):12662–70. Available from: https://doi.org/10.1609/aaai.v38i11.29161
2.
Su J, Han C, Zhou Y, Shan J, Zhou X, Yuan F. SaProt: Protein Language Modeling with Structure-aware Vocabulary. openRxiv; 2023. Available from: https://doi.org/10.1101/2023.10.01.560349