Skip to yearly menu bar Skip to main content


Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains

f-INE: Influence Estimation using Hypothesis Testing

Subhodip Panda · Shashwat Sourav · Prathosh AP · Sai Praneeth Reddy Karimireddy

Keywords: [ data attribution ] [ privacy ] [ explanability ]


Abstract:

Understanding the influence of individual or groups of training examples on a model's predictions lies at the core of data valuation, attribution, debugging, and privacy. Yet, due to the inherent randomness and complexity of modern ML training pipelines, reliably measuring this influence remains elusive. In fact, we show that existing influence estimation methods fail to account for this, leading to highly inconsistent results - small changes in data ordering can result in drastically different influence estimates. Instead, we introduce f-influence - a new definition of influence grounded in hypothesis testing that explicitly captures training-time randomness.We define the influence of a data subset as the statistical ease of distinguishing between the outputs of models trained with and without that subset. We prove that f-influence satisfies desirable properties such as compositionality and asymptotic normality, analogous to central limit theorems. Leveraging these properties, we design a new algorithm f-INfluence Estimation (f-INE) algorithm that efficiently estimates the influence of training data in a single training run. Our approach is a theoretically sound, highly scalable, and empirically robust alternative to prior methods, enabling consistent influence estimation.

Chat is not available.