Skip to yearly menu bar Skip to main content


Poster
in
Workshop: 1st Workshop on Foundation Models for Structured Data (FMSD)

Improving Treatment Effect Estimation with LLM-Based Data Augmentation

Nicolas Huynh · Julianna Piskorz · Jeroen Berrevoets · Max Ruiz Luyten · Mihaela van der Schaar


Abstract: We introduce $\texttt{GATE}$, a framework which improves conditional average treatment effects (CATE) estimation in small-sample regimes. Our framework augments datasets with synthetic _counterfactual_ outcomes using _pre-trained_ generative models. Doing so addresses the covariate shift problem when inferring CATE from observational data. By using pre-trained generative models, $\texttt{GATE}$ augments downstream CATE models with knowledge _beyond the training data_. In particular, we instantiate $\texttt{GATE}$ with large language models (LLMs), which we show to work exceptionally well. LLMs utilise rich contextual information, such as dataset metadata, to generate outcomes grounded in real-world contexts. We demonstrate, both theoretically and empirically, that restricting augmentation to a carefully chosen subset of the covariate space can achieve performance gains—_even with imperfect generated outcomes._

Chat is not available.