Protein structure tokenization refers to the process of discretizing protein structure using a learned codebook derived from vector-quantized variational autoencoders. It was first used for search purposes with Foldseek and has since been adopted for use with ESM3, SaProt, and others.
Notes
- Structural tokenization circumvents the inability of inverse folding models to be effectively trained on exclusively computational models.