Summary

Training machine learning models for either inverse folding or protein backbone design via diffusion exclusively on predicted models worsens performance (1,2). This was observed when training ESM-IF and GVP as well as when training using the (Evoformer) or the hybrid sequence-structure method MIF-ST, but not (SaProt) (which uses tokens from the (Foldseek) alphabet). The latter study also looked at downstream performance and saw worse results. This was shown to be because predicted models are “too perfect” at a local level (3).

Details

(4) found that training their diffusion model on both predicted models and experimental structures worsened designability and novelty relative to a model trained on experimental structures only.

(5) trained a version of the backbone diffusion model GENIE on (AlphaFold2) models from SwissProt and found that although designability increased was greater than a model trained on the PDB, diversity was lower.

Figures

Exp onlyAF2+ExpAF2 only
GVP-GNN5.436.066.52
GVP-GNN-Large6.174.0811.51
GVP-Transformer6.444.0110.95

Table from (1)

Ref (2)

![[bafkreiawfxhyqc4grpfhhgjsyezzahtsrehsxzughw6vmpsuw2tqsazz64@jpeg.jpg]] Figures from (3)

See also

21 March 2026: https://biomlzk.ghost.io/training-protein-structure-based-neural-networks-exclusively-on-predicted-protein-structures-worsens-performance-on-experimental-structures-due-to-how-locally-perfect-the-training-data-/

1.
Hsu C, Verkuil R, Liu J, Lin Z, Hie B, Sercu T, et al. Learning inverse folding from millions of predicted structures. openRxiv; 2022. Available from: https://doi.org/10.1101/2022.04.10.487779
2.
Su J, Han C, Zhou Y, Shan J, Zhou X, Yuan F. SaProt: Protein Language Modeling with Structure-aware Vocabulary. openRxiv; 2023. Available from: https://doi.org/10.1101/2023.10.01.560349
3.
Tan C, Cao Z, Gao Z, Li S, Huang Y, Li SZ. AlphaFold Database Debiasing for Robust Inverse Folding. 2025; Available from: https://arxiv.org/abs/2506.08365
4.
Huguet G, Vuckovic J, Fatras K, Thibodeau-Laufer E, Lemos P, Islam R, et al. Sequence-Augmented SE(3)-Flow Matching For Conditional Protein Backbone Generation. 2024; Available from: https://arxiv.org/abs/2405.20313
5.
Lin Y, Alquraishi M. Generating Novel, Designable, and Diverse Protein Structures by Equivariantly Diffusing Oriented Residue Clouds. In: International Conference on Machine Learning. PMLR; 2023. p. 20978–1002. Available from: https://proceedings.mlr.press/v202/lin23a.html