Mixture-of-depths

The mixture-of-depths method enforces a fixed-size compute budget for variable-length inputs to individual layers of a Transformer (1). In other words, only the top $k$ tokens will be used for multi-head attention. Raposo et al. (1) also combined this with Mixture-of-experts.

Figures

Ref (1)

Raposo D, Ritter S, Richards B, Lillicrap T, Humphreys PC, Santoro A. Mixture-of-Depths: Dynamically allocating compute in transformer-based language models. 2024; Available from: https://arxiv.org/abs/2404.02258

Quartz 4

Explorer

Mixture-of-depths

Figures

Graph View

Backlinks