Summary

Protein models designed by inverse folding can be used to augment datasets of natural sequences when training protein language models and protein folding neural networks (1,2). This was demonstrated using ProFam1 and AlphaFold2, with inference on synthetic MSAs outperforming single-sequence predictions. Similar results were also presented with ESM3, although no ablation data is available to show its impact.

See also

1.
Hayes T, Rao R, Akin H, Sofroniew NJ, Oktay D, Lin Z, et al. Simulating 500 million years of evolution with a language model. Science. 2025;387(6736):850–8. Available from: https://doi.org/10.1126/science.ads0018
2.
Wells J, Hooker AH, Livne M, Lin W, Miller D, Dallago C, et al. ProFam: Open-Source Protein Family Language Modelling for Fitness Prediction and Design. openRxiv; 2025. Available from: https://doi.org/10.64898/2025.12.19.695431