PLM embeddings contain enough information to align proteins without fine-tuning

Summary

PLM embeddings contain enough information to be aligned without fine-tuning (1), and these alignments outperform purely sequence-based but not structure-based methods ((2), (3)). This could be since the embeddings of aligned positions in related sequences tend to co-cluster (4). Alignment quality can be further improved by normalization (5), which does not require PLM fine-tuning. (6) surmise that the distance matrices implied by these embeddings are more effective than the BLOSUM62 matrix used by many sequence alignments by default.

Details

TM-vec, a fine-tuned model, was found to be worse than non-fine-tuned models (3).

To improve alignment quality using “normalization” (per (5)), all distances (for their paper, Euclidean distances are used) are computed and converted to Z-scores to normalize relative to other entries in the same column and row. This was shown to improve performance relative to using non-normalized/enhanced values for alignment calculation.

Figures

Ref (4)

Ref (3)

Ref (5); EBA and EBA-plain refer to PLM-based alignment with and without normalization

Pantolini L, Studer G, Pereira J, Durairaj J, Tauriello G, Schwede T. Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone. Bioinformatics. 2024;40(1). Available from: https://doi.org/10.1093/bioinformatics/btad786

Ashrafzadeh S, Golding GB, Ilie S, Ilie L. Scoring alignments by embedding vector similarity. Briefings in Bioinformatics. 2024;25(3). Available from: https://doi.org/10.1093/bib/bbae178

Quartz 4

Explorer

PLM embeddings contain enough information to align proteins without fine-tuning

Summary

Details

Figures

See also

Graph View

Table of Contents

Backlinks