Transformers attend to gap tokens to "opt out" of updates

Summary

Transformer models will attend to gap tokens, such as <bos>, <eos>, etc., to avoid updates when calculating attention (1).

Details

This is also true of visual transformers and image data.

(Miller 2023) proposes using Quiet Attention AKA Softmax1 to avoid this:

so f t ma x_{1} (x_{i}) = \frac{exp x _{i}}{1 + \sum _{j} exp x _{j}}

This was used by (2) to modify the Invariant point attention.

Bondarenko Y, Nagel M, Blankevoort T. Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. Advances in Neural Information Processing Systems. 2023;36:75067–96. Available from: https://papers.nips.cc/paper_files/paper/2023/hash/edbcb7583fd8921dad78adecfe06a99b-Abstract-Conference.html

Billera L, Oresten A, Stålmarck A, Sato K, Kaduk M, Murrell B. The Continuous Language of Protein Structure. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.05.11.593685

Quartz 4

Explorer

Transformers attend to gap tokens to "opt out" of updates

Summary

Details

Graph View