BERT is a type of encoder-only Transformer model that is typically trained by masking 15% of tokens. It is less widely-used as decoder-only LMs for generative modeling.