Poster
Retraining with Predicted Hard Labels Provably Increases Model Accuracy
Rudrajit Das · Inderjit Dhillon · Alessandro Epasto · Adel Javanmard · Jieming Mao · Vahab Mirrokni · Sujay Sanghavi · Peilin Zhong
West Exhibition Hall B2-B3 #W-914
Training machine learning (ML) models with incorrect or noisy supervision (i.e., labels) is a common challenge in the real world. Surprisingly, simply retraining a model using its own predicted labels often improves its performance -- even though those predictions come from the same model initially trained on bad data. Despite the practical success of this trick, a solid mathematical understanding of how/when/why it works has been missing.We theoretically analyze model retraining for a binary (two-class) classification problem where the given labels are corrupted, and characterize the conditions under which retraining can improve the model's performance.We also explore how this idea helps in label differential privacy (DP), a private machine learning technique wherein the privacy of the training labels is protected by deliberately adding label noise. We propose consensus-based retraining, a method that only uses those examples for which the model's prediction matches the given label. We empirically show that consensus-based retraining leads to significant performance gains.Ultimately, our paper offers theoretical insight and practical value for building better ML models under noisy supervision with the simple idea of retraining.