Summary

Synthetic data over-represents mean values of the original data distribution and either underrepresents or exaggerates the presence of outliers (1). This can pose a challenge for models trained using Distillation and can be a cause of Catastrophic forgetting.

Figures

Ref (1)

See also

1.
Shumailov I, Shumaylov Z, Zhao Y, Gal Y, Papernot N, Anderson R. The Curse of Recursion: Training on Generated Data Makes Models Forget. 2023; Available from: https://arxiv.org/abs/2305.17493