Fitness prediction

Fitness prediction describes the problem of predicting a protein’s fitness from its sequence, with or without structural data. It is affected by many other observables (stability, correct folding, etc) which can lead to epistasis, which is the inability to model fitness as a linear combination of the effects of individual mutations.

Ref (1)

Notes

Sequence-based predictors (e.g., PSSMs and Potts models) could miss high-fitness naturally occurring mutations (2).
Including nonfunctional sequences during training improves prediction of poor performers but not top performers (3). This was demonstrated using several ML models.
Complex models are worse than ridge regression when the number of training examples is low (4). Ref (4)
Threshold robustness refers to the fact that slightly deleterious mutations that negatively effect stability might have no impact on fitness up to a certain point, and that the effects can be devastating beyond this point. Ref (5)

Evaluation

The Spearman correlation coefficient is the most widely used metric for evaluating how well methods can predict protein fitness.
Non-Discounted Cumulative Gains (NDCG) is used by other papers ((6,7), others), and measures how well predictions at the top end of a distribution are predicted. The cutoff is user-defined.
$N D C G_{p} = \frac{D C G _{p}}{I D C G _{p}} = \frac{\sum _{i = 1}^{p} \frac{re l _{i}}{l o g _{2} ( i + 1 )}}{\sum _{i = 1}^{p} \frac{re l _{i}^{ideal}}{l o g _{2} ( i + 1 )}}$
$D C G_{p}$ : Discount cumulative gain at rank $p$ . This penalizes the appearance of important results far from the top of the distribution $I D C G_{p}$ : Ideal discount cumulative gain at rank $p$ ; basically the above but the rank-ordering is perfect. $re l_{p}$ : Relevance score of result $i$ $re l_{p}^{i d e a l}$ : Relevance score of result $i$ in its ideal position

Sandhu M, Chen JZ, Matthews DS, Spence MA, Pulsford SB, Gall B, et al. Computational and Experimental Exploration of Protein Fitness Landscapes: Navigating Smooth and Rugged Terrains. Biochemistry. 2025;64(8):1673–84. Available from: https://doi.org/10.1021/acs.biochem.4c00673

Johnston KE, Almhjell PJ, Watkins-Dulaney EJ, Liu G, Porter NJ, Yang J, et al. A combinatorially complete epistatic fitness landscape in an enzyme active site. Proceedings of the National Academy of Sciences. 2024;121(32). Available from: https://doi.org/10.1073/pnas.2400439121

Moreno-Paz S, van der Hoek R, Eliana E, Zwartjens P, Gosiewska S, Martins dos Santos VAP, et al. Machine Learning-Guided Optimization of p -Coumaric Acid Production in Yeast. ACS Synthetic Biology. 2024;13(4):1312–22. Available from: https://doi.org/10.1021/acssynbio.4c00035

Singh R, Im C, Qiu Y, Mackness B, Gupta A, Joren T, et al. Learning the language of antibody hypervariability. Proceedings of the National Academy of Sciences. 2024;122(1). Available from: https://doi.org/10.1073/pnas.2418918121

Sarkisyan KS, Bolotin DA, Meer MV, Usmanova DR, Mishin AS, Sharonov GV, et al. Local fitness landscape of the green fluorescent protein. Nature. 2016;533(7603):397–401. Available from: https://doi.org/10.1038/nature17995

Paul S, Kollasch A, Notin P, Marks D. Combining Structure and Sequence for Superior Fitness Prediction. In: GenBio@NeurIPS2023. 2023. Available from: https://openreview.net/forum?id=8PbTU4exnV

Ruffolo JA, Bhatnagar A, Beazer J, Nayfach S, Russ J, Hill E, et al. Adapting protein language models for structure-conditioned design. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.08.03.606485

Quartz 4

Explorer

Fitness prediction

Notes

Evaluation

Graph View

Backlinks