Summary

Pretraining performance does not capture effectiveness on downstream tasks (1,2). Neyshabur et al found that using different post-plateau checkpoints as starting points for fine-tuning had drastically different performance on downstream tasks, with more heavily trained models showing better performance.

See also

1.
Neyshabur B, Sedghi H, Zhang C. What is being transferred in transfer learning? Advances in Neural Information Processing Systems. 2020;33:512–23. Available from: https://proceedings.neurips.cc/paper/2020/hash/0607f4c705595b911a4f3e7a127b44e0-Abstract.html
2.
Li F-Z, Amini AP, Yue Y, Yang KK, Lu AX. Feature Reuse and Scaling: Understanding Transfer Learning with Protein Language Models. openRxiv; 2024. Available from: https://doi.org/10.1101/2024.02.05.578959