Summary
Focused protein sequence libraries are poor training sets. (1) found that when sequence search is highly exploitative (i.e., targeting exclusively sequences predicted to be high-fitness), it becomes a poor training set. In contrast, more explorative libraries could more effectively learn the fitness landscape.
See also
- Insufficient ddG data on Abs for training
- Accurate fitness landscapes are unnecessary for productively engineering enzymes or proteins
1.
Yang J, Ducharme J, Johnston KE, Li F-Z, Yue Y, Arnold FH. DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering. ACS Synthetic Biology. 2023;12(8):2444–54. Available from: https://doi.org/10.1021/acssynbio.3c00301