Summary

Unbalanced composition of protein sequence databases prevents prediction of fitness from sequence data alone, or from models derived from such data such as PLMs (1). Those authors argue that bigger models and datasets do not guarantee improvements in fitness prediction.

Details

Phylogenetic effects - how sequences have evolved over time - plays a huge role in the data distribution , and the ideal distribution and model are non-identifiable, even with infinite data. This is because it is impossible to differentiate imbalances in sampling related sequences from fitness peaks. They claim that a better data representation can be obtained by fitting a generative model such that and ; e.g., fitting the data distribution more poorly by making the fitting model parametric instead of non-parametric.

See also

1.
Weinstein EN, Amin AN, Frazer J, Marks DS. Non-identifiability and the Blessings of Misspecification in Models of Molecular Fitness. openRxiv; 2022. Available from: https://doi.org/10.1101/2022.01.29.478324