Summary

PLMs learn from a sequence context of about twenty to forty amino acids (1). This is evidence that they have not learned how proteins fold, but have rather memorized family-specific statistics of protein motif pairings. Evidence that they learned family-specific features was observed by (2) using sparse autoencoders. Nevertheless, the observation that Protein language models and PLM-based structure prediction generalize to de novo designed proteins calls this conclusion into question, at least for small globular proteins.

Details

BOS and EOS tokens help reduce the size of the window required.

Evidence of family-specific features in ESM2-650M was observed by (2) (see below).

Figures

Excerpt from (1)

Ref (2)

See also

1.
Zhang Z, Wayment-Steele HK, Brixi G, Wang H, Kern D, Ovchinnikov S. Protein language models learn evolutionary statistics of interacting sequence motifs. Proceedings of the National Academy of Sciences. 2024;121(45). Available from: https://doi.org/10.1073/pnas.2406285121
2.
Adams E, Bai L, Lee M, Yu Y, AlQuraishi M. From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models. openRxiv; 2025. Available from: https://doi.org/10.1101/2025.02.06.636901