Found 104 total tags.

affinity-maturation

Affinity maturation is a process by which antibodies are iteratively improved in vivo. Described as “hyper-darwinian”.

29 items with this tag. Showing first 10 tags.

alignment

Alignment collects notes on comparing proteins by either sequence-derived representations or structural information. The main split here is between sequence-based alignment approaches, including PLM embedding methods and homology retrieval, and structure-based alignment approaches that operate on coordinates, structural descriptors, or structure-derived embeddings.

alignment/sequence-based

14 items with this tag. Showing first 10 tags.

alignment/structure-based

alphafold2

(OpenFold redirects here)

AlphaFold2 is a protein structure prediction method from 2020, and is the first to make extensive use of the Transformer architecture, consisting of 97 million parameters. AlphaMissense is a derivative of this method.

Architecture of AlphaFold2 from Jumper et al. (1)

Architectural and ML contributions

1.
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. Available from: https://doi.org/10.1038/s41586-021-03819-2

alphafold3

AlphaFold3 is a diffusion-based all-atom structure prediction method that is widely seen as state-of-the-art.

Architecture of AlphaFold3 from Abramson et al. (1)

Architectural and ML contributions

1.
Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. Available from: https://doi.org/10.1038/s41586-024-07487-w

17 items with this tag. Showing first 10 tags.

ancestral-sequence-reconstruction

Ancestral sequence reconstruction refers to the process of inferring ancestral sequences using extant sequences. Figure from (1)

1.
Dishman AF, Tyler RC, Fox JC, Kleist AB, Prehoda KE, Babu MM, et al. Evolution of fold switching in a metamorphic protein. Science. 2021;371(6524):86–90. Available from: https://doi.org/10.1126/science.abd8700

antibodies

Antibodies are proteins with two heavy chains and two light chains produced by B cells, central to the adaptive immune system. Their structure consists of a variable region (containing CDRs and a framework region) and three constant regions (CH1, CH2, CH3). The variable region and CH1 form the Fab, while the remainder forms the Fc region. B cell receptors are antibodies with an additional CH4 domain.

Types of antibodies

IgA is found in mucous membranes, cannot activate the complement system, and makes up roughly two-thirds of antibodies in healthy adults. It is notable for being double-sided.

IgE has only one binding site, mainly protects against large organisms like parasites, and is responsible for allergic reactions.

IgG is the standard antibody class used in therapeutic design. Cannot activate the complement system and can pass through the placenta.

  • IgG1 makes up 67% of all antibodies in the human body, is capable of antibody-dependent cellular phagocytosis, and has the longest hinge region. IgG1 immune repertoires consist mostly of just a few dozen dominant clones and are unique to each individual, remaining largely stable over time.
  • IgG2 — mice have IgG2a and IgG2b instead; some strains have IgG2c instead of IgG2a.
  • IgG3 makes up ~7% of IgGs and has a notably shorter half-life due to poor binding to FcRn, attributed to an H435R mutation.
  • IgG4 is the rarest IgG and undergoes Fab-arm exchange, dissociating into half-bodies and forming novel combinations via R409 in the hinge. Therapeutic IgG4 antibodies use the S228P substitution to stabilize the hinge and prevent this.

IgM is far less subject to affinity maturation than IgG, responds to lipids and polysaccharides, and activates the complement system.

Datasets

SAbDab is a database of all antibody and nanobody structures in the PDB. The full list can be downloaded with:

curl -s https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab/summary/all/

antibodies/engineering-and-design

antibodies/nanobodies

antibody-antigen-interactions

45 items with this tag. Showing first 10 tags.

antibody-antigen-interactions/binding-affinity

14 items with this tag. Showing first 10 tags.

antibody-antigen-interactions/complex-prediction

16 items with this tag. Showing first 10 tags.

antibody-antigen-interactions/misc

15 items with this tag. Showing first 10 tags.

antibody-developability

Antibody developability refers to a series of properties linked to the development of antibodies as therapeutics. These arise because natural antibodies do not possess them to begin with, or because they arise during optimization as a consequence of affinity maturation via techniques such as Yeast display.

General observations

  • Raybould et al. (1) outlined five criteria (in a parallel to Lepinski’s rule of five):
  • Length of CDRs, specifically CDRH3
  • Hydrophobic patches
  • Negative patches
  • Positive patches
  • Viscosity

Thermostability

See Stability and thermostability

Immunogenicity

  • Glycine residues doesn’t cause immunogenicity as much as other residues (2). This makes them used for including during design of cyclic peptides or peptide therapeutics.
  • Antibody humanization can lead to immunogenic reactions.
1.
Raybould MIJ, Marks C, Krawczyk K, Taddese B, Nowak J, Lewis AP, et al. Five computational developability guidelines for therapeutic antibody profiling. Proceedings of the National Academy of Sciences. 2019;116(10):4025–30. Available from: https://doi.org/10.1073/pnas.1810576116
2.
Aina A, Hsueh SCC, Gibbs E, Peng X, Cashman NR, Plotkin SS. De Novo Design of a β-Helix Tau Protein Scaffold: An Oligomer-Selective Vaccine Immunogen Candidate for Alzheimer’s Disease. ACS Chemical Neuroscience. 2023;14(15):2603–17. Available from: https://doi.org/10.1021/acschemneuro.3c00007

antibody-developability/general

antibody-developability/polyspecificity

antibody-structure-prediction

The antibody structure prediction problem is a subclass of the protein structure prediction problem that is mostly focused on predicting the CDRs, specifically the CDRH3 loop that mediates antigen binding. It also sometimes includes the antibody-antigen docking problem.

Models

1.
Peng C, Wang Z, Zhao P, Ge W, Huang C. AbFold – an AlphaFold Based Transfer Learning Model for Accurate Antibody Structure Prediction. openRxiv; 2023. Available from: https://doi.org/10.1101/2023.04.20.537598

25 items with this tag. Showing first 10 tags.

antibody-structure-prediction/cdr

antibody-structure-prediction/complex-prediction

b-cells

B cells are a type of lymphocyte that synthesizes and secretes antibodies as well as B-cell receptors. Each cell synthesizes a single antibody sequence throughout its life cycle (a phenomenon termed allelic exclusion). Can be categorized as either naive B cells or memory B cells. Activation by antigen presentation causes them to proliferate into effector cells that secrete antibodies.

blosum62

The BLOSUM62 matrix quantifies the similarity of amino acids to one another when computing evolutionary distance sequences (1). Interestingly, it has math errors in its computation that make it a more effective matrix for substitution calculation (2). A custom version for TCRs has also been presented (3).

1.
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences. 1992;89(22):10915–9. Available from: https://doi.org/10.1073/pnas.89.22.10915
2.
Styczynski MP, Jensen KL, Rigoutsos I, Stephanopoulos G. BLOSUM62 miscalculations improve search performance. Nature Biotechnology. 2008;26(3):274–5. Available from: https://doi.org/10.1038/nbt0308-274
3.
Postovskaya A, Vercauteren K, Meysman P, Laukens K. tcrBLOSUM: an amino acid substitution matrix for sensitive alignment of distant epitope-specific TCRs. Briefings in Bioinformatics. 2024;26(1). Available from: https://doi.org/10.1093/bib/bbae602

cdrh3

citation-fix

conformational-dynamics

63 items with this tag. Showing first 10 tags.

conformational-dynamics/allostery

conformational-dynamics/evolution

conformational-dynamics/experimental-ensembles

conformational-dynamics/kinetics

conformational-dynamics/modeling

23 items with this tag. Showing first 10 tags.

conformational-dynamics/molecular-dynamics

contrastive-learning

Contrastive learning is a supervised method for driving the latent representations of data points towards or away from each other by adding custom losses.

Implementations

  • The triplet margin loss used by Yu et al. (1): : Enzyme embedding : Positive case : Negative case, selected to have EC numbers close in Euclidean space to the positive case : Margin; set to 1
  • The supercon hard loss used by Yu et al. (1): : temperature, set to 0.1 in the paper
  • Noise contrastive estimation (variant 1): : Input text : Another example (either positive or negative) : Scoring function, usually cosine similarity, dot product, or logit “produced by input-sample matcher sub-network” (from Rethmeier and Augenstein 2021) : Scaling function, usually sigmoid
  • Noise contrastive estimation (variant 2, ranks a single positive pair over negative pairs): Variant 2 ranks a single positive pair over negative pairs
1.
Yu T, Cui H, Li JC, Luo Y, Jiang G, Zhao H. Enzyme function prediction using contrastive learning. Science. 2023;379(6639):1358–63. Available from: https://doi.org/10.1126/science.adf2465

diffusion-guidance

Diffusion guidance refers to inference-time methods that steer a diffusion process toward desired properties, constraints, or observations. It is a general concept that applies to both protein design, structure prediction, and sequence-based protein design, of which all-atom diffusion is just one application.

Details

Xie et al (1) outline three broad types of guidance used by diffusion models:

  1. Score guidance, in which gradients from classifiers, constraints, or rewards are used to nudge the diffusion path.
  2. Path-integral reweighting, such as Feynman-Kac potentials, which use importance weights to update trajectories.
  3. Invariant correctors, such as Metropolis-adjusted Langevin methods, which mix within a biased marginal without changing the trajectory weights.

However, other search algorithms such as Beam search and Monte Carlo Tree Search have been used in conjunction with diffusion models (2).

1.
Xie Y, Winkler L, Sun L, Lewis S, Foster AE, Luna JJ, et al. Enhanced Diffusion Sampling: Efficient Rare Event Sampling and Free Energy Calculation with Diffusion Models. 2026; Available from: https://arxiv.org/abs/2602.16634
2.
Didi K, Zhang Z, Zhou G, Reidenbach D, Cao Z, Cha S, et al. Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute. 2026; Available from: https://arxiv.org/abs/2603.27950

12 items with this tag. Showing first 10 tags.

diffusion-guidance/protein-design

diffusion-guidance/structure-prediction

8 items with this tag.

diffusion-models

Diffusion models are generative models whose outputs are generated by iteratively denoising Gaussian noise. During inference, the reverse process is executed, whereas during training, the forward process is carried out (converting data to noise). This process can be guided at inference time to respect specific geometric constraints.

Details

Diffusion models try to recover samples drawn from from Gaussian noise by iteratively denoising across time steps. The forward process is defined as

: Standard Weiner process (by which noise is added)

Typically these drift and diffusion functions are used:

The reverse process:

The scoring function is approximated by a neural network, .

Diffusion models can also be zero-shot classifiers (1).

Diffusion models have been combined with Replica-exchange molecular dynamics (2).

Types of sequence-based diffusion

For protein sequences, which are fundamentally discrete, Yang et al (3) describe three approaches to generation using diffusion models:

  • Diffusion in pre-trained latent space (e.g., continuous diffusion)
  • Diffusion in discrete space with uniform noise matrices
  • Diffusion in discrete space via absorbing matrices, e.g., one-at-a-time masking/unmasking of individual tokens
1.
Li AC, Prabhudesai M, Duggal S, Brown E, Pathak D. Your Diffusion Model is Secretly a Zero-Shot Classifier. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE; 2023. p. 2206–17. Available from: https://doi.org/10.1109/iccv51070.2023.00210
2.
Wang Y, Herron L, Tiwary P. From data to noise to data for mixing physics across temperatures with generative artificial intelligence. Proceedings of the National Academy of Sciences. 2022;119(32). Available from: https://doi.org/10.1073/pnas.2203656119
3.
Yang J, Chu W, Khalil D, Astudillo R, Wittmann BJ, Arnold FH, et al. Steering Generative Models with Experimental Data for Protein Fitness Optimization. 2025; Available from: https://arxiv.org/abs/2505.15093

31 items with this tag. Showing first 10 tags.

diffusion-models/protein-design

15 items with this tag. Showing first 10 tags.

diffusion-models/structure-prediction

12 items with this tag. Showing first 10 tags.

epistasis

Epistasis refers to the non-additivity of fitness effects arising from specific combinations of mutations. (1) outline three types of epistasis: magnitude epistasis (“same direction as expected but are not perfectly additive”), sign epistasis (“effect of one of the substitutions changes direction in the context of the other”), and reciprocal epistasis (“effects of both substitutions change direction when they are made together”).

Figure from (1)

Figure from (2)

Notes

  • Negative epistasis is approximately 100x more common than positive epistasis (3).
  • Rates of magnitude, sign, and reciprocal epistasis are constant across positions (1). These differed from a null additive-only model with noise, which had 74% magnitude, 22% sign, and 4% reciprocal sign epistasis. Figure from (1)

Examples

  • The S373P mutation in the Spike protein is disadvantageous in pre-Omicron-variant versions of SARS-CoV-2 but advantageous in Omicron. Mentioned by Bloom and Neher (4).
  • The combination of N501Y and Q498R in the Spike protein of SARS-CoV-2 increases the binding affinity to ACE2 by 387-fold. This is believed to have led to the immune-evasive mutations in the Omicron-variant and was observed in vitro by Zahradník et al. (5) prior to the emergence of Omicron:
  • One example of third-order epistasis is the J-domain, where strong non-additivity is observed among a triad, but disappears if any of the three are mutated to alanine (2). Figure from (2)

Measuring epistasis

Figure from (6)

  • Graph Fourier Transform: A decomposition that breaks down the signal (here, fitness) into distinct “epistatic orders”
  • Dirichlet energy: quantifies how variable (i.e., rugged) the landscape is, with high energy corresponding to high ruggedness
  • NK model: A linear model where is the number of sites and is the number of interacting sites; when , effect of all mutations is linear, whereas larger indicates more interactions and therefore more epistasis
1.
Johnston KE, Almhjell PJ, Watkins-Dulaney EJ, Liu G, Porter NJ, Yang J, et al. A combinatorially complete epistatic fitness landscape in an enzyme active site. Proceedings of the National Academy of Sciences. 2024;121(32). Available from: https://doi.org/10.1073/pnas.2400439121
2.
Tsuboyama K, Dauparas J, Chen J, Laine E, Mohseni Behbahani Y, Weinstein JJ, et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature. 2023;620(7973):434–44. Available from: https://doi.org/10.1038/s41586-023-06328-6
3.
Sarkisyan KS, Bolotin DA, Meer MV, Usmanova DR, Mishin AS, Sharonov GV, et al. Local fitness landscape of the green fluorescent protein. Nature. 2016;533(7603):397–401. Available from: https://doi.org/10.1038/nature17995
4.
Bloom JD, Neher RA. Fitness effects of mutations to SARS-CoV-2 proteins. Virus Evolution. 2023;9(2):vead055. Available from: https://doi.org/10.1093/ve/vead055
5.
Zahradník J, Marciano S, Shemesh M, Zoler E, Harari D, Chiaravalli J, et al. SARS-CoV-2 variant prediction and antiviral drug design are enabled by RBD in vitro evolution. Nature Microbiology. 2021;6(9):1188–98. Available from: https://doi.org/10.1038/s41564-021-00954-4
6.
Sandhu M, Chen JZ, Matthews DS, Spence MA, Pulsford SB, Gall B, et al. Computational and Experimental Exploration of Protein Fitness Landscapes: Navigating Smooth and Rugged Terrains. Biochemistry. 2025;64(8):1673–84. Available from: https://doi.org/10.1021/acs.biochem.4c00673

evolution-and-natural-selection

Protein evolution is the process and result of gradual sequence changes resulting in functional and/or structural changes. See Epistasis for examples on why evolutionary trajectories are difficult to predict. This note excludes any discussion of somatic hypermutation.

Notes

Paradigms and preliminaries

  • Neutral theory: most observed amino acid changes are neutral (i.e., silent in fitness effects). This leads to genetic drift. Developed by Kimura (1).
  • Nearly neutral theory: deleterious mutations are retained and subsequently compensated for by advantageous mutations (which is consistent with the observation that most missense mutations are destabilizing). Developed by Ohta (2) to explain why the rate of protein evolution was independent of generation time, which is in turn inversely proportional to population size. Figure from (2)
  • The theory of punctuated equilibrium suggests that phenotypes change very little for long stretches of time, followed by abrupt rapid changes (3).
  • Statistical physics approach: population size is equated with inverse temperature (such that infinite population is analogous to zero degrees Kelvin), and log-fitness with energy (4). Advantageous and deleterious mutations are predicted to occur with equal frequency. This framing ignores the imbalance in sequence data (5,6).
  • Fisher’s geometric model: The overall fitness of a phenotype can be quantified along dimensions; Fisher postulated that phenotypes in a population were distributed as a hypersphere centered on a local maximum.
  • Protein evolvability refers to the ability of a protein to 1) evolve new functions in relatively few mutations and 2) be robust to mutations that lead to loss-of-function (7). These are described as contradictory statements by Tokuriki & Tawfik (8) but are described as complementary at the structural level.
  • “The principle of minimal frustration suggests that naturally evolved proteins with the same structure should have similar folding rates and that modulation of thermodynamic stability should occur via unfolding rates” (quoted from (9)). This has been supported by the observation that thioredoxins fold at similar rates but unfold at rates that correlate with their thermostability values.

Observations

  • Protein folds with high sequence diversity also have high functional diversity (7).
  • The sequence capacity of a protein exceeds for even small proteins (35-40 AAs), but the fraction of stable states is extremely small and inversely correlated with protein size. These values were estimated using Potts models (10). Figure from (10)
1.
Kimura M. The Neutral Theory of Molecular Evolution. Cambridge University Press; 1985. Available from: https://books.google.com/books/about/The_Neutral_Theory_of_Molecular_Evolutio.html?id=e_HoAwAAQBAJ
2.
OHTA T. Slightly Deleterious Mutant Substitutions in Evolution. Nature. 1973;246(5428):96–8. Available from: https://doi.org/10.1038/246096a0
3.
Duran-Nebreda S, Bentley RA, Vidiella B, Spiridonov A, Eldredge N, O’Brien MJ, et al. On the multiscale dynamics of punctuated evolution. Trends in Ecology & Evolution. 2024;39(8):734–44. Available from: https://doi.org/10.1016/j.tree.2024.05.003
4.
Sella G, Hirsh AE. The application of statistical physics to evolutionary biology. Proceedings of the National Academy of Sciences. 2005;102(27):9541–6. Available from: https://doi.org/10.1073/pnas.0501865102
5.
Ding F, Steinhardt J. Protein language models are biased by unequal sequence sampling across the tree of life. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.03.07.584001
6.
Weinstein EN, Amin AN, Frazer J, Marks DS. Non-identifiability and the Blessings of Misspecification in Models of Molecular Fitness. openRxiv; 2022. Available from: https://doi.org/10.1101/2022.01.29.478324
7.
Wagner A. Robustness and evolvability: a paradox resolved. Proceedings of the Royal Society B: Biological Sciences. 2007;275(1630):91–100. Available from: https://doi.org/10.1098/rspb.2007.1137
8.
Tokuriki N, Tawfik DS. Protein Dynamism and Evolvability. Science. 2009;324(5924):203–7. Available from: https://doi.org/10.1126/science.1169375
9.
Tzul FO, Vasilchuk D, Makhatadze GI. Evidence for the principle of minimal frustration in the evolution of protein folding landscapes. Proceedings of the National Academy of Sciences. 2017;114(9). Available from: https://doi.org/10.1073/pnas.1613892114
10.
Tian P, Best RB. How Many Protein Sequences Fold to a Given Structure? A Coevolutionary Analysis. Biophysical Journal. 2017;113(8):1719–30. Available from: https://doi.org/10.1016/j.bpj.2017.08.039

immune-repertoires

Immune repertoires are the full breadth of B-cell and T-cell receptors being expressed by a human that are available for potential antigen binding. Rees (1) provides estimates suggesting that humans have naive repertoires of about sequences.

Related:

1.
Rees AR. Understanding the human antibody repertoire. mAbs. 2020;12(1). Available from: https://doi.org/10.1080/19420862.2020.1729683

inverse-folding

Inverse folding describes the problem of designing a sequence for a structure. Typically these are limited to the twenty canonical amino acids.

Methods

See Hybrid sequence-structure models for a list of methods that incorporate PLMs

Notes

Training

  • Training inverse folding models with backbone dihedral angles as features usually improved sequence recovery (1). Figure from (1)

Execution

  • Forward-folding is a stronger predictor of inverse folding success than sequence recovery ((2), citing Watson et al. (3,4)).

Datasets

  • PDBench is a dataset of 595 protein structures with diverse, evenly divided topologies for benchmarking of Inverse folding methods (5). Figure 2 from Castorina et al. (5)
1.
Jamasb AR, Morehead A, Joshi CK, Zhang Z, Didi K, Mathis SV, et al. Evaluating Representation Learning on the Protein Structure Universe. In: ICLR 2024. 2024. Available from: https://openreview.net/forum?id=sTYuRVrdK3
2.
Yang JJ, Yim J, Barzilay R, Jaakkola T. Fast non-autoregressive inverse folding with discrete diffusion. 2023; Available from: https://arxiv.org/abs/2312.02447
3.
Watson JL, Juergens D, Bennett NR, Trippe BL, Yim J, Eisenach HE, et al. De novo design of protein structure and function with RFdiffusion. Nature. 2023;620(7976):1089–100. Available from: https://doi.org/10.1038/s41586-023-06415-8
4.
Dauparas J, Anishchenko I, Bennett N, Bai H, Ragotte RJ, Milles LF, et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science. 2022;378(6615):49–56. Available from: https://doi.org/10.1126/science.add2187
5.
Castorina LV, Petrenas R, Subr K, Wood CW. PDBench: evaluating computational methods for protein-sequence design. Bioinformatics. 2023;39(1). Available from: https://doi.org/10.1093/bioinformatics/btad027

inverse-folding/evaluation

inverse-folding/training

10 items with this tag.

ligand-docking

light-chains

The light chain of an antibody makes up part of its variable region and Fab, and therefore is involved in antigen binding. In humans, the Kappa and Lambda subtypes are found and split about 60:40, whereas in mice it is closer to 90:10.

Kappa and lambda subtype

  • The ratio of kappa to lambda in circulating antibodies is about 60:40; when this falls out of balance, that can be a symptom of B cell lymphoma.
  • Lambda light chains are more flexible than kappa light chains due to an extra glycine in the switch region (_Articles that need citations).

low-rank-adaptation

pae

Predicted aligned error (PAE) is a measurement calculated by protein structure prediction neural networks to capture positional errors between two amino acids in a computational model. It was introduced by AlphaFold2. A derivative metric, ipSAE, has been shown to be more robust at identifying potential binders.

Figure from (1)

1.
Chow A, Chu H, Li R, Nalbant BN, Dozic AV, Kida LC, et al. Sequence and structural determinants of efficacious de novo chimeric antigen receptors. openRxiv; 2025. Available from: https://doi.org/10.64898/2025.12.12.694033

plddt

(LDDT redirects here) pLDDT (predicted local distance difference test) is a confidence metric used by neural networks for protein structure prediction. It captures the per-residue accuracy, both in terms of neighborhood and side chain rotamer. It was first directly integrated into structure prediction by AlphaFold2 at the per-residue level and has been widely adopted since. AlphaFold3 adopted per-atom pLDDT.

Figure from (1)

Notes

  • When clustering predicted protein structures, sparse clusters tend to have lower pLDDT (2). This was found to be independent of MSA depth.
  • pLDDT correlates poorly with GDT-TS among AlphaFold2 models in CASP15. This was observed in a repeat that used deeper MSAs (3).
  • De novo sequences designed by inversion with high pLDDT were found by ESM to have high perplexity (4).
  • While the default pLDDT is not continuously differentiable and thus unsuitable for training, (5) use a modified version that can be used as a loss function.
  • pLDDT can be used as spatial restraints in biomolecular simulations: The equation was originally presented by Hiranuma et al. (6) and was used by del Alamo et al. (7) as coordinate constraints in Rosetta when refining AlphaFold2 models.
1.
Terwilliger TC, Afonine PV, Liebschner D, Croll TI, McCoy AJ, Oeffner RD, et al. Accelerating crystal structure determination with iterative AlphaFold prediction. Acta Crystallographica Section D Structural Biology. 2023;79(3):234–44. Available from: https://doi.org/10.1107/s205979832300102x
2.
Nomburg J, Doherty EE, Price N, Bellieny-Rabelo D, Zhu YK, Doudna JA. Birth of protein folds and functions in the virome. Nature. 2024;633(8030):710–7. Available from: https://doi.org/10.1038/s41586-024-07809-y
3.
Lee S, Kim G, Karin EL, Mirdita M, Park S, Chikhi R, et al. Petascale Homology Search for Structure Prediction. openRxiv; 2023. Available from: https://doi.org/10.1101/2023.07.10.548308
4.
Verkuil R, Kabeli O, Du Y, Wicky BIM, Milles LF, Dauparas J, et al. Language models generalize beyond natural proteins. openRxiv; 2022. Available from: https://doi.org/10.1101/2022.12.21.521521
5.
Trinquier J, Petti S, Park S, Herath K, van Kempen M, Feng S, et al. SoftAlign: End-to-end protein structures alignment. openRxiv; 2025. Available from: https://doi.org/10.1101/2025.05.09.653096
6.
Hiranuma N, Park H, Baek M, Anishchenko I, Dauparas J, Baker D. Improved protein structure refinement guided by deep learning based accuracy estimation. Nature Communications. 2021;12(1). Available from: https://doi.org/10.1038/s41467-021-21511-x
7.
del Alamo D, DeSousa L, Nair RM, Rahman S, Meiler J, Mchaourab HS. Integrated AlphaFold2 and DEER investigation of the conformational dynamics of a pH-dependent APC antiporter. Proceedings of the National Academy of Sciences. 2022;119(34). Available from: https://doi.org/10.1073/pnas.2206129119

29 items with this tag. Showing first 10 tags.

protein-backbone-design

Protein backbone design is the generation of protein backbones in three-dimensional space. This section also covers generation and design of entire protein structures in Cartesian space, but most methods uncouple design of the backbone and design of the sequence given the backbone (inverse folding). As of May 2024, the current state of the art uses diffusion.

Methods

Datasets

  • Verkuil et al. (5) use a test set of 39 PDBs for their validation, although they cite someone else:
  • 1QYS
  • 2KL8
  • 2KPO
  • 2LN3
  • 2LTA
  • 2LVB
  • 2N2T
  • 2N2U
  • 2N3Z
  • 2N76
  • 4KY3
  • 4KYZ
  • 5CW9
  • 5KPE
  • 5KPH
  • 5L33
  • 5TPJ
  • 5TRV
  • 6CZG
  • 6CZH
  • 6CZI
  • 6CZJ
  • 6D0T
  • 6DG6
  • 6DKM A
  • 6DKM B
  • 6DLM A
  • 6DLM B
  • 6E5C
  • 6LLQ
  • 6MRR
  • 6MRS
  • 6MSP
  • 6NUK
  • 6W3F
  • 6W3W
  • 6WI5
  • 6WVS
  • 7MCD
1.
Ingraham JB, Baranov M, Costello Z, Barber KW, Wang W, Ismail A, et al. Illuminating protein space with a programmable generative model. Nature. 2023;623(7989):1070–8. Available from: https://doi.org/10.1038/s41586-023-06728-8
2.
Watson JL, Juergens D, Bennett NR, Trippe BL, Yim J, Eisenach HE, et al. De novo design of protein structure and function with RFdiffusion. Nature. 2023;620(7976):1089–100. Available from: https://doi.org/10.1038/s41586-023-06415-8
3.
Kim D, Woodbury SM, Ahern W, Tischer D, Kang A, Joyce E, et al. Computational design of metallohydrolases. Nature. 2025;649(8095):246–53. Available from: https://doi.org/10.1038/s41586-025-09746-w
4.
Wang J, Lisanza S, Juergens D, Tischer D, Watson JL, Castro KM, et al. Scaffolding protein functional sites using deep learning. Science. 2022;377(6604):387–94. Available from: https://doi.org/10.1126/science.abn2100
5.
Verkuil R, Kabeli O, Du Y, Wicky BIM, Milles LF, Dauparas J, et al. Language models generalize beyond natural proteins. openRxiv; 2022. Available from: https://doi.org/10.1101/2022.12.21.521521

22 items with this tag. Showing first 10 tags.

protein-backbone-design/designability

20 items with this tag. Showing first 10 tags.

protein-design

Auto-generated

This page is generated automatically from notes tagged protein-design/*. Add prose above the generated marker to preserve it across regenerations.

28 items with this tag. Showing first 10 tags.

protein-design/design

19 items with this tag. Showing first 10 tags.

protein-folding

Not to be confused with protein structure prediction

Protein folding is the process by which an amino acid polypeptide self-organizes into a 3D structure.

Prediction

26 items with this tag. Showing first 10 tags.

protein-folding/structure-prediction

10 items with this tag.

protein-folding/unfolding

protein-language-models

Protein language models (PLMs) are a type of Transformer model trained on either protein sequences or Multiple sequence alignments.

Methods

Single-sequence

  • ESM: currently the most widely-used encoder PLM
  • ProGen: probably the most widely-used decoder PLM
  • ProtBERT and DistillProtBERT (1)
  • ProteinNPT
  • xTrimoPGLM
  • CARP: A CNN that performs as well as transformer-based methods on both pretraining and downstream tasks. Anecdotally, these can’t indirectly calculate contact maps via the Categorical Jacobian method as well as transformer-based models.
  • DASM (deep amino acid sequence model), which is trained on germline-descendant point mutation pairs to learn relative mutation frequencies, after normalizing for expected mutation frequencies in the codon table (2).

Notes

General observations

  • PLMs are in-context learners that default to retrieving information from nearby repeats (3).

Representations

  • Multiple instance learning using PLM embeddings of all genes in a viral genome identifies which sequences are responsible for host tropism (4). For example, this ranked the Spike protein as the key contributor of host tropism.
  • Homolog detection using PLM representations can be improved by compression (5). Using the full representations worsened detection AUC by 7.4%.
  • PLMs with a smoother representation space are better predictors of protein function (6). Figure from (6)

Hybrid PLM-inverse folding models

From Hybrid sequence-structure models

Training

  • Matthews et al. (6) found that masking 0.5% of residues when training PLMs improved predictive performance (greater ) relative to 15% used by ESM.
1.
Geffen Y, Ofran Y, Unger R. DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. Bioinformatics. 2022;38(Supplement_2):ii95–8. Available from: https://doi.org/10.1093/bioinformatics/btac474
2.
Bitbol A-F. eLife Assessment: Separating selection from mutation in antibody language models. 2026; Available from: https://doi.org/10.7554/elife.109644.3.sa0
3.
Kantroo P, Wagner GP, Machta BB. In-Context Learning can distort the relationship between sequence likelihoods and biological fitness. 2025; Available from: https://arxiv.org/abs/2504.17068
4.
Liu D, Young F, Lamb KD, Robertson DL, Yuan K. Prediction of virus-host associations using protein language models and multiple instance learning. PLOS Computational Biology. 2024;20(11):e1012597. Available from: https://doi.org/10.1371/journal.pcbi.1012597
5.
Kilinc M, Jia K, Jernigan RL. Improved global protein homolog detection with major gains in function identification. Proceedings of the National Academy of Sciences. 2023;120(9). Available from: https://doi.org/10.1073/pnas.2211823120
6.
Matthews DS, Spence MA, Mater AC, Nichols J, Pulsford SB, Sandhu M, et al. Leveraging ancestral sequence reconstruction for protein representation learning. Nature Machine Intelligence. 2024;6(12):1542–55. Available from: https://doi.org/10.1038/s42256-024-00935-2

91 items with this tag. Showing first 10 tags.

protein-language-models/antibodies

21 items with this tag. Showing first 10 tags.

protein-language-models/representations

35 items with this tag. Showing first 10 tags.

protein-language-models/training

protein-protein-interactions

Protein-protein interactions describe when two or more proteins bind to one another.

19 items with this tag. Showing first 10 tags.

protein-structure-tokenization

Protein structure tokenization refers to the process of discretizing protein structure using a learned codebook derived from vector-quantized variational autoencoders. It was first used for search purposes with Foldseek and has since been adopted for use with ESM3, SaProt, and others.

Notes

rosetta

1 item with this tag.

rosettafold

RosettaFold (sometimes stylized as RoseTTAfold) is a protein structure prediction method unveiled in mid-2021 (1). A second version, RosettaFold2, was released in early 2023 (2). Its architecture closely tracks that of AlphaFold2, with several changes, such as the use of an SE3-transformer.

1.
Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373(6557):871–6. Available from: https://doi.org/10.1126/science.abj8754
2.
Baek M, Anishchenko I, Humphreys IR, Cong Q, Baker D, DiMaio F. Efficient and accurate prediction of protein structure using RoseTTAFold2. openRxiv; 2023. Available from: https://doi.org/10.1101/2023.05.24.542179

structure-prediction

Structure prediction refers to the problem of predicting the 3D shape of a protein or nucleotide sequence without any experimental information. Common metrics used for evaluating the quality of predicted structures include LDDT (residue-level, TM-score (whole-structure level), and DockQ (complex level).

Methods

MSA-based

  • AlphaFold2: currently viewed as the highest-accuracy method
  • RosettaFold
  • Diffold: A fine-tuned version of AlphaFold2

PLM-based

  • ESMFold: currently the most widely-used method, albeit probably not the most accurate model in this category
  • OmegaFold
  • xTrimoPGLM

Others

  • EquiFold: a method that needs to be fine-tuned on specific families of proteins
  • EigenFold: a method that uses diffusion to model the dynamics of proteins, albeit unsuccessfully

For antibodies

See Antibody structure prediction

Notes

Training

Figure from (1)

Sidechain prediction

  • Formulating the sidechain prediction problem as a classification problem by binning chi angles, rather than a regression problem, let to improved performance (2).
  • Sidechain prediction methods not sensitive to B-factor cutoffs. The outcome of sidechain prediction model PIPPack was not strongly affected by B-factor values of protein structures in the training set (2).
1.
Ahdritz G, Bouatta N, Floristean C, Kadyan S, Xia Q, Gerecke W, et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nature Methods. 2024;21(8):1514–24. Available from: https://doi.org/10.1038/s41592-024-02272-z
2.
Randolph NZ, Kuhlman B. Invariant point message passing for protein side chain packing. Proteins: Structure, Function, and Bioinformatics. 2024;92(10):1220–33. Available from: https://doi.org/10.1002/prot.26705

85 items with this tag. Showing first 10 tags.

structure-prediction/architecture

structure-prediction/complex-prediction

8 items with this tag.

structure-prediction/limitations

19 items with this tag. Showing first 10 tags.

structure-prediction/metrics

structure-prediction/sampling

27 items with this tag. Showing first 10 tags.

structure-prediction/training

13 items with this tag. Showing first 10 tags.

thermostability

Thermostability refers to a protein’s ability to remain folded at high temperatures or under harsh conditions. It is a highly desirable property for engineered proteins.

Prediction

  • Phantom epistasis refers to the inclusion of unnecessary model parameters when building biophysical/statistical fitness models (Fitness prediction). Faure et al. (1) attribute this to the epistasis mechanisms reviewed by Domingo et al. (2).
  • Thermodynamic reversability can be used for expanding training sets for stability prediction/ddG prediction ML models. However, it has been shown to lead to biases that favor WT amino acids. Diaz et al. (3) claim to mitigate this.
  • The amount of ddG data available for a given residue for training can be expanded using thermodynamic permutation, where measurements are increased to . This was used by MutComputeXGT on the Tsuboyama et al. (4) dataset. It is useful for stability prediction and improves generalization in (3).
  • ddG data is skewed with hydrophobic amino acids (e.g., alanine scans). This has been reported to increase solvation ddG by 0.8 kcal/mol in studies cited by (3). The Tsuboyama et al. (4) data does not have this bias.
1.
Faure AJ, Martí-Aranda A, Hidalgo-Carcedo C, Beltran A, Schmiedel JM, Lehner B. The genetic architecture of protein stability. Nature. 2024;634(8035):995–1003. Available from: https://doi.org/10.1038/s41586-024-07966-0
2.
Domingo J, Baeza-Centurion P, Lehner B. The Causes and Consequences of Genetic Interactions (Epistasis). Annual Review of Genomics and Human Genetics. 2019;20:433–60. Available from: https://doi.org/10.1146/annurev-genom-083118-014857
3.
Diaz DJ, Gong C, Ouyang-Zhang J, Loy JM, Wells J, Yang D, et al. Stability Oracle: A Structure-Based Graph-Transformer for Identifying Stabilizing Mutations. openRxiv; 2023. Available from: https://doi.org/10.1101/2023.05.15.540857
4.
Tsuboyama K, Dauparas J, Chen J, Laine E, Mohseni Behbahani Y, Weinstein JJ, et al. Mega-scale experimental analysis of protein folding stability in biology and design. Nature. 2023;620(7973):434–44. Available from: https://doi.org/10.1038/s41586-023-06328-6

54 items with this tag. Showing first 10 tags.

thermostability/design

thermostability/determinants

thermostability/mutations

thermostability/prediction

tm-score

Summary

TM-score is an alignment-dependent protein structure similarity term introduced by (1) that is widely used for assessing protein structure prediction methods. It is defined as:

: length of the amino acid sequence of the target protein : number of residues in both the target and query proteins : Distance between pairs of residues : Distance scaling factor

1.
Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics. 2004;57(4):702–10. Available from: https://doi.org/10.1002/prot.20264

20 items with this tag. Showing first 10 tags.

variant-effect-prediction

Variant effect prediction covers the changes in properties or fitness (measured in various ways) resulting from small sequence-level changes in proteins.

21 items with this tag. Showing first 10 tags.