Summary
Unbalanced composition of protein sequence databases prevents prediction of fitness from sequence data alone, or from models derived from such data such as PLMs (1). Those authors argue that bigger models and datasets do not guarantee improvements in fitness prediction.
Details
Phylogenetic effects - how sequences have evolved over time - plays a huge role in the data distribution , and the ideal distribution and model are non-identifiable, even with infinite data. This is because it is impossible to differentiate imbalances in sampling related sequences from fitness peaks. They claim that a better data representation can be obtained by fitting a generative model such that and ; e.g., fitting the data distribution more poorly by making the fitting model parametric instead of non-parametric.
See also
- Protein property prediction using PLMs does not benefit from scale except when predicting inferring features of either structural or sparsely populated sequence families
- PLMs are biased by uneven distribution of sequence data in datasets such as UniRef and UniProt