Summary

Sequence perplexity is a metric used by protein language models and inverse folding to quantify sequence recovery. Self-consistency perplexity is a derived metric where the perplexity is calculated using a forward-folded model rather than the original model/structure.

Details

The perplexity values of null models was calculated below by (1):

Table from (1)

Hie et al. found that sequence length does not consistently correlate with perplexity values within protein families. (2)

Meier et al. introduced ways of calculating probabilities or “energies” of sequences using masked LMs (e.g., how favored or disfavored they are given evolution). The authors of ESM-1v propose four approaches for this: (3)

  • Masked marginal probability: Masks are introduced at all mutation residues and probabilities relative to the wildtype residue are calculated:
  • Mutant marginal probability: Requires a single forward pass for each mutation: . The entire sequence is kept fixed except for one mutation and the preference for the mutation of interest, normalized to the WT residue, is calculated. This was used for structure-based fitness prediction in (4) with Spearman values ranging from 0.39–0.71.
  • Wildtype marginal probability: The same as mutant marginal probability, except the background is the wildtype sequence:
  • Pseudo-likelihood: (5) found that ESM2 pseudo-log likelihood values scaled linearly with the length of the sequence. They introduced a length-invariant correction (): ,

Ref (5)

The Spearman correlations against ProteinGym were: masked marginal 0.582, mutant marginal 0.578, wildtype marginal 0.572, pseudolikelihood 0.552.

Gordon et al. propose a way to approximate sequence log-likelihood in a single pass using masked LMs: (6)

Figures from (6)

See also

1.
Ingraham J, Garg V, Barzilay R, Jaakkola T. Generative Models for Graph-Based Protein Design. Advances in Neural Information Processing Systems. 2019;32. Available from: https://proceedings.neurips.cc/paper/2019/hash/f3a4ff4839c56a5f460c88cce3666a2b-Abstract.html
2.
Hie BL, Yang KK, Kim PS. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Systems. 2022;13(4):274-285.e6. Available from: https://doi.org/10.1016/j.cels.2022.01.003
3.
Meier J, Rao R, Verkuil R, Liu J, Sercu T, Rives A. Language models enable zero-shot prediction of the effects of mutations on protein function. openRxiv; 2021. Available from: https://doi.org/10.1101/2021.07.09.450648
4.
Ding D, Shaw AY, Sinai S, Rollins N, Prywes N, Savage DF, et al. Protein design using structure-based residue preferences. Nature Communications. 2024;15(1). Available from: https://doi.org/10.1038/s41467-024-45621-4
5.
Devkota K, Shonai D, Mao J, Ko YS, Wang W, Soderling S, et al. Miniaturizing, Modifying, and Magnifying Nature’s Proteins with Raygun. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.08.13.607858
6.
Gordon C, Lu AX, Abbeel P. Protein Language Model Fitness Is a Matter of Preference. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.10.03.616542