Masked LMs can be fine-tuned starting from autoregressive LMs, but not vice-versa

Summary

Masked language models (such as BERT) can be fine-tuned starting from autoregressive models (such as GPT) in ways that benefit from scale, but not vice-versa (1). This was demonstrated using protein language models.

Figures

Ref (1)

Quartz 4

Explorer

Masked LMs can be fine-tuned starting from autoregressive LMs, but not vice-versa

Summary

Figures

See also

Graph View