Training inverse folding and diffusion models exclusively on predicted protein structures worsens performance due to how locally perfect they are

Summary

Training machine learning models for either inverse folding or protein backbone design via diffusion exclusively on predicted models worsens performance (1,2). This was observed when training ESM-IF and GVP as well as when training using the (Evoformer) or the hybrid sequence-structure method MIF-ST, but not (SaProt) (which uses tokens from the (Foldseek) alphabet). The latter study also looked at downstream performance and saw worse results. This was shown to be because predicted models are “too perfect” at a local level (3).

Details

(4) found that training their diffusion model on both predicted models and experimental structures worsened designability and novelty relative to a model trained on experimental structures only.

(5) trained a version of the backbone diffusion model GENIE on (AlphaFold2) models from SwissProt and found that although designability increased was greater than a model trained on the PDB, diversity was lower.

Figures

	Exp only	AF2+Exp	AF2 only
GVP-GNN	5.43	6.06	6.52
GVP-GNN-Large	6.17	4.08	11.51
GVP-Transformer	6.44	4.01	10.95

Table from (1)

Ref (2)

![[bafkreiawfxhyqc4grpfhhgjsyezzahtsrehsxzughw6vmpsuw2tqsazz64@jpeg.jpg]] Figures from (3)

Quartz 4

Explorer

Training inverse folding and diffusion models exclusively on predicted protein structures worsens performance due to how locally perfect they are

Summary

Details

Figures

See also

Graph View

Backlinks