Grokking refers to training a neural network far beyond overfitting. Wang et al. (1) showed that it can improve reasoning abilities of transformers, whereas Springer et al. (2) showed it can make models more difficult to fine-tune.
1.
Wang B, Yue X, Su Y, Sun H. Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization. 2024; Available from: https://arxiv.org/abs/2405.15071
2.
Springer JM, Goyal S, Wen K, Kumar T, Yue X, Malladi S, et al. Overtrained Language Models Are Harder to Fine-Tune. In: International Conference on Machine Learning. PMLR; 2025. p. 56719–89. Available from: https://proceedings.mlr.press/v267/springer25a.html