Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models
Reliable Statistical Inference with Synthetic Data from Large Language Models
Yewon Byun · Shantanu Gupta · Zachary Lipton · Rachel Childers · Bryan Wilder
Keywords: [ LLMs for Social Science ] [ Human-AI Collaboration ] [ Reliable Statistical Inference ]
There is increasing interest in using large language models to generate entirely new synthetic samples to support social science and human subject research, such as in responses to surveys or in human behavior simulation. However, it is not immediately clear by what means practitioners can incorporate such data and yet draw reliable insights and conclusions upon them. In this work, we introduce a principled framework for reliably incorporating fully synthetic samples from text-based foundation models into downstream statistical analyses. Our estimator offers a hyperparameter-free solution with strong theoretical guarantees, allowing practitioners to retain key statistical properties---even when incorporating imperfect, biased synthetic data. We empirically validate the finite-sample performance of our estimator, which improves statistical efficiency, across different regression tasks in social science applications. To the best of our knowledge, our framework provides the first theoretically-sound approach for safely incorporating synthetic samples from foundation models for reliable statistical inference.