Summary
There is insufficient antibody-antigen binding data to train statistical or machine learning docking models to high accuracy, even when the data points are simulated (1). Authors say at least 90,000 structures are required for prediction quality to plateau.
Details
Overfitting was avoided with 1 million synthetic measurements from FoldX on SAbDab. Authors used Ab-Bind DB, but propose replacing it with something more limited with 608 measurements. Overfitting was detected in part by placing specific CDR lengths exclusively in the validation/test set, which reduced Pearson correlation.
See also
1.
Hummer AM, Schneider C, Chinery L, Deane CM. Investigating the volume and diversity of data needed for generalizable antibody–antigen ΔΔG prediction. Nature Computational Science. 2025;5(8):635–47. Available from: https://doi.org/10.1038/s43588-025-00823-8