ICML Reliable Statistical Inference with Synthetic Data from Large Language Models

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Reliable Statistical Inference with Synthetic Data from Large Language Models

Yewon Byun · Shantanu Gupta · Zachary Lipton · Rachel Childers · Bryan Wilder

Keywords: [ LLMs for Social Science ] [ Human-AI Collaboration ] [ Reliable Statistical Inference ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

There is increasing interest in using large language models to generate entirely new synthetic samples to support social science and human subject research, such as in responses to surveys or in human behavior simulation. However, it is not immediately clear by what means practitioners can incorporate such data and yet draw reliable insights and conclusions upon them. In this work, we introduce a principled framework for reliably incorporating fully synthetic samples from text-based foundation models into downstream statistical analyses. Our estimator offers a hyperparameter-free solution with strong theoretical guarantees, allowing practitioners to retain key statistical properties---even when incorporating imperfect, biased synthetic data. We empirically validate the finite-sample performance of our estimator, which improves statistical efficiency, across different regression tasks in social science applications. To the best of our knowledge, our framework provides the first theoretically-sound approach for safely incorporating synthetic samples from foundation models for reliable statistical inference.

Chat is not available.

Poster in Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Reliable Statistical Inference with Synthetic Data from Large Language Models

Yewon Byun · Shantanu Gupta · Zachary Lipton · Rachel Childers · Bryan Wilder

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models