Summary

Transformer models will attend to gap tokens, such as <bos>, <eos>, etc., to avoid updates when calculating attention (1).

Details

This is also true of visual transformers and image data.

(Miller 2023) proposes using Quiet Attention AKA Softmax1 to avoid this:

This was used by (2) to modify the Invariant point attention.

1.
Bondarenko Y, Nagel M, Blankevoort T. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. Advances in Neural Information Processing Systems. 2023;36:75067–96. Available from: https://papers.nips.cc/paper_files/paper/2023/hash/edbcb7583fd8921dad78adecfe06a99b-Abstract-Conference.html
2.
Billera L, Oresten A, Stålmarck A, Sato K, Kaduk M, Murrell B. The Continuous Language of Protein Structure. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.05.11.593685