Layer normalization

Summary

Layer normalization is a procedure to normalize the values of each token to a gaussian with mean $β$ and width $γ$ : $y = \frac{x - μ}{σ + ϵ} * γ + β$ where $μ$ and $σ$ are the per-token mean and standard deviations, respectively. In the original Transformer paper (1), the Post-LN approach was used where layer normalization followed multi-head attention and again after the feed-forward network, but more recent methods (as of June 2024) use the Pre-LN approach where layer normalization precedes attention and the feed-forward network (2).

Figures

Ref (2)

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. 2017; Available from: https://arxiv.org/abs/1706.03762

Xiong R, Yang Y, He D, Zheng K, Zheng S, Xing C, et al. On Layer Normalization in the Transformer Architecture. In: International Conference on Machine Learning. PMLR; 2020. p. 10524–33. Available from: https://proceedings.mlr.press/v119/xiong20b.html

Quartz 4

Explorer

Layer normalization

Summary

Figures

Graph View

Backlinks