Distillation of machine learning models refers to the generation of additional training data using either another neural network (cross-distillation, or student-teacher training) or an earlier version of a neural network (self distillation). This can be to expose more training data (AlphaFold2, AlphaMissense, ESM3) or to shrink the model (DistillProtBERT; (1)) for better generalization. It has also been used to turn PLMs into Variant effect prediction models (2), by adding a top-off layer that is trained using predictions from larger specialized models. Finally, cross-distillation was used when training AlphaFold3 for prediction of disordered regions (see AlphaFold3 uses cross-distillation from AlphaFold2 to avoid hallucinating secondary structure in low-pLDDT regions; (3)).