Summary

PLM embeddings contain enough information to be aligned without fine-tuning (1), and these alignments outperform purely sequence-based but not structure-based methods ((2), (3)). This could be since the embeddings of aligned positions in related sequences tend to co-cluster (4). Alignment quality can be further improved by normalization (5), which does not require PLM fine-tuning. (6) surmise that the distance matrices implied by these embeddings are more effective than the BLOSUM62 matrix used by many sequence alignments by default.

Details

TM-vec, a fine-tuned model, was found to be worse than non-fine-tuned models (3).

To improve alignment quality using “normalization” (per (5)), all distances (for their paper, Euclidean distances are used) are computed and converted to Z-scores to normalize relative to other entries in the same column and row. This was shown to improve performance relative to using non-normalized/enhanced values for alignment calculation.

Figures

Ref (4)

Ref (3)

Ref (5); EBA and EBA-plain refer to PLM-based alignment with and without normalization

See also

1.
Kaminski K, Ludwiczak J, Pawlicki K, Alva V, Dunin-Horkawicz S. pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein language models. Bioinformatics. 2023;39(10). Available from: https://doi.org/10.1093/bioinformatics/btad579
2.
Llinares-López F, Berthet Q, Blondel M, Teboul O, Vert J-P. Deep embedding and alignment of protein sequences. Nature Methods. 2022;20(1):104–11. Available from: https://doi.org/10.1038/s41592-022-01700-2
3.
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, et al. Protein remote homology detection and structural alignment using deep learning. Nature Biotechnology. 2023;42(6):975–85. Available from: https://doi.org/10.1038/s41587-023-01917-2
4.
McWhite CD, Armour-Garb I, Singh M. Leveraging protein language models for accurate multiple sequence alignments. Genome Research. 2023; Available from: https://doi.org/10.1101/gr.277675.123
5.
Pantolini L, Studer G, Pereira J, Durairaj J, Tauriello G, Schwede T. Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone. Bioinformatics. 2024;40(1). Available from: https://doi.org/10.1093/bioinformatics/btad786
6.
Ashrafzadeh S, Golding GB, Ilie S, Ilie L. Scoring alignments by embedding vector similarity. Briefings in Bioinformatics. 2024;25(3). Available from: https://doi.org/10.1093/bib/bbae178