Fitness prediction describes the problem of predicting a protein’s fitness from its sequence, with or without structural data. It is affected by many other observables (stability, correct folding, etc) which can lead to epistasis, which is the inability to model fitness as a linear combination of the effects of individual mutations.

Ref (1)

Notes

  • Sequence-based predictors (e.g., PSSMs and Potts models) could miss high-fitness naturally occurring mutations (2).
  • Including nonfunctional sequences during training improves prediction of poor performers but not top performers (3). This was demonstrated using several ML models.
  • Complex models are worse than ridge regression when the number of training examples is low (4). Ref (4)
  • Threshold robustness refers to the fact that slightly deleterious mutations that negatively effect stability might have no impact on fitness up to a certain point, and that the effects can be devastating beyond this point. Ref (5)

Evaluation

  • The Spearman correlation coefficient is the most widely used metric for evaluating how well methods can predict protein fitness.

  • Non-Discounted Cumulative Gains (NDCG) is used by other papers ((6,7), others), and measures how well predictions at the top end of a distribution are predicted. The cutoff is user-defined.

    : Discount cumulative gain at rank . This penalizes the appearance of important results far from the top of the distribution : Ideal discount cumulative gain at rank ; basically the above but the rank-ordering is perfect. : Relevance score of result : Relevance score of result in its ideal position

1.
Sandhu M, Chen JZ, Matthews DS, Spence MA, Pulsford SB, Gall B, et al. Computational and Experimental Exploration of Protein Fitness Landscapes: Navigating Smooth and Rugged Terrains. Biochemistry. 2025;64(8):1673–84. Available from: https://doi.org/10.1021/acs.biochem.4c00673
2.
Johnston KE, Almhjell PJ, Watkins-Dulaney EJ, Liu G, Porter NJ, Yang J, et al. A combinatorially complete epistatic fitness landscape in an enzyme active site. Proceedings of the National Academy of Sciences. 2024;121(32). Available from: https://doi.org/10.1073/pnas.2400439121
3.
Moreno-Paz S, van der Hoek R, Eliana E, Zwartjens P, Gosiewska S, Martins dos Santos VAP, et al. Machine Learning-Guided Optimization of p -Coumaric Acid Production in Yeast. ACS Synthetic Biology. 2024;13(4):1312–22. Available from: https://doi.org/10.1021/acssynbio.4c00035
4.
Singh R, Im C, Qiu Y, Mackness B, Gupta A, Joren T, et al. Learning the language of antibody hypervariability. Proceedings of the National Academy of Sciences. 2024;122(1). Available from: https://doi.org/10.1073/pnas.2418918121
5.
Sarkisyan KS, Bolotin DA, Meer MV, Usmanova DR, Mishin AS, Sharonov GV, et al. Local fitness landscape of the green fluorescent protein. Nature. 2016;533(7603):397–401. Available from: https://doi.org/10.1038/nature17995
6.
Paul S, Kollasch A, Notin P, Marks D. Combining Structure and Sequence for Superior Fitness Prediction. In: GenBio@NeurIPS2023. 2023. Available from: https://openreview.net/forum?id=8PbTU4exnV
7.
Ruffolo JA, Bhatnagar A, Beazer J, Nayfach S, Russ J, Hill E, et al. Adapting protein language models for structure-conditioned design. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.08.03.606485