PLMs learn family-specific protein contacts from sequence context windows of about 20-40 amino acids

Summary

PLMs learn from a sequence context of about twenty to forty amino acids (1). This is evidence that they have not learned how proteins fold, but have rather memorized family-specific statistics of protein motif pairings. Evidence that they learned family-specific features was observed by (2) using sparse autoencoders. Nevertheless, the observation that Protein language models and PLM-based structure prediction generalize to de novo designed proteins calls this conclusion into question, at least for small globular proteins.

Details

BOS and EOS tokens help reduce the size of the window required.

Evidence of family-specific features in ESM2-650M was observed by (2) (see below).

Figures

Excerpt from (1)

Ref (2)

Quartz 4

Explorer

PLMs learn family-specific protein contacts from sequence context windows of about 20-40 amino acids

Summary

Details

Figures

See also

Graph View

Table of Contents

Backlinks