Unbalanced composition of sequence data prevents protein fitness from being identifiable from sequence data alone

Summary

Unbalanced composition of protein sequence databases prevents prediction of fitness from sequence data alone, or from models derived from such data such as PLMs (1). Those authors argue that bigger models and datasets do not guarantee improvements in fitness prediction.

Details

Phylogenetic effects - how sequences have evolved over time - plays a huge role in the data distribution $p_{0}$ , and the ideal distribution $p^{\infty}$ and model $f$ are non-identifiable, even with infinite data. This is because it is impossible to differentiate imbalances in sampling related sequences from fitness peaks. They claim that a better data representation can be obtained by fitting a generative model $M = {q_{θ} : θ \in Θ}$ such that $p^{\infty} \in M$ and $p_{0} \in / M$ ; e.g., fitting the data distribution $p_{0}$ more poorly by making the fitting model parametric instead of non-parametric.

Quartz 4

Explorer

Unbalanced composition of sequence data prevents protein fitness from being identifiable from sequence data alone

Summary

Details

See also

Graph View

Backlinks