Summary

Disentangled attention is an alternative way to represent position in encoder-type transformers.

Details

  • Disentangled attention splits word info and position info. This leads to four types of attention, though they don’t use position-to-position attention
  • Final attention matrix adds these together
  • New K Q matrices for position vectors
  • P vector is not changed, but the initial assignment matrix is learned
  • Absolute position encodings still required, and added in the last transformer layer - shown to work better
  • The combination is what gives them the performance boost