Summary
DL models for protein sequences (including PLMs) learn the phylogenetics of protein evolution. (1) showed this in the 2D latent space of protein family-specific Variational autoencoders (below); whereas (2) showed how repurposing perplexity as velocities within a manifold of sequence embedding space from ESM-1b recapitulated evolutionary trajectories of viral and eukaryotic proteins
Figures
Figures from (1)
Ref (2)
See also
- PLMs are biased by uneven distribution of sequence data in datasets such as UniRef and UniProt
- The latent space of VAEs can encode the conformational landscape of dynamic proteins
1.
Detlefsen NS, Hauberg S, Boomsma W. Learning meaningful representations of protein sequences. Nature Communications. 2022;13(1). Available from: https://doi.org/10.1038/s41467-022-29443-w
2.
Hie BL, Yang KK, Kim PS. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Systems. 2022;13(4):274-285.e6. Available from: https://doi.org/10.1016/j.cels.2022.01.003