Pretraining performance does not capture effectiveness on downstream tasks

Summary

Pretraining performance does not capture effectiveness on downstream tasks (1,2). Neyshabur et al found that using different post-plateau checkpoints as starting points for fine-tuning had drastically different performance on downstream tasks, with more heavily trained models showing better performance.

Quartz 4

Explorer

Pretraining performance does not capture effectiveness on downstream tasks

Summary

See also

Graph View