Summary

Protein language models are biased toward residues found in germline sequences (12). Reasons include the fact that blood samples that are sequenced contain mostly naive B cells, and fewer memory B cells and plasma cells. (3) found that Focal loss improves prediction of non-germline. Paired antibody language models are less sensitive to this bias (Burbach and (4)).

Details

Relatedly, non-antibody PLMs are also biased towards sequences from model organisms, which arises from the same sensitivity to training data.

Figures

Germline residuesNon-germline residues
HeavyLightHeavyLight
FRWCDR1/2FRWCDR1/2FRWCDR1/2CDR3FRWCDR1/2CDR3
---------------------------------
ESM-21.914.122.546.1132.0324.3620.8523.2019.3724.29
AntiBERTy1.051.101.171.2829.6421.5118.4440.1421.7516.95
AbLang-11.031.081.071.1625.8017.7314.4752.1425.7216.75
Ab-Unpaired1.021.071.011.0526.8118.9514.4237.6019.3717.25
Ab-Paired1.021.061.021.0527.2418.7014.2338.9519.2516.98
Ab-FL1.101.171.091.1610.3311.1812.6910.8210.2411.04
Ab-ModMask1.111.181.091.1710.2611.1313.1810.7810.1911.42
Ab-FT1.111.181.101.1810.8811.9113.6711.2510.6312.29
AbLang-21.101.171.091.169.9211.1312.4710.099.5410.77
Table 3 from (3)

Figure 3 from Burbach and (4)

See also

1.
Olsen TH, Moal IH, Deane CM. AbLang: an antibody language model for completing antibody sequences. Bioinformatics Advances. 2022;2(1). Available from: https://doi.org/10.1093/bioadv/vbac046
2.
Nijkamp E, Ruffolo JA, Weinstein EN, Naik N, Madani A. ProGen2: Exploring the boundaries of protein language models. Cell Systems. 2023;14(11):968-978.e3. Available from: https://doi.org/10.1016/j.cels.2023.10.002
3.
Olsen TH, Moal IH, Deane CM. Addressing the antibody germline bias and its effect on language models for improved antibody design. Bioinformatics. 2024;40(11). Available from: https://doi.org/10.1093/bioinformatics/btae618
4.
Burbach SM, Briney B. Improving antibody language models with native pairing. Patterns. 2024;5(5):100967. Available from: https://doi.org/10.1016/j.patter.2024.100967