ICML In-Context Bias Propagation in LLM-Based Tabular Data Generation

Poster
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

In-Context Bias Propagation in LLM-Based Tabular Data Generation

Pol G. Recasens · Alberto Gutierrez-Torre · Jordi Torres · Josep Lluís Berral · Anisa Halimi · Kieran Fraser

Keywords: [ In-Context Learning ] [ Bias Propagation ] [ Synthetic Tabular Data ] [ Fairness in Machine Learning ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Sat 19 Jul 3 p.m. PDT — 3:45 p.m. PDT

Abstract:

Large Language Models (LLMs) are increasingly used for synthetic tabular data generation through in-context learning (ICL), offering a practical solution for data augmentation in data scarce scenarios. While prior work has shown the LLM potential to improve downstream performance through augmenting underrepresented groups, these benefits often assume access to a subset of in-context examples unbiased and representative of the real dataset. In real-world settings, however, data is frequently noisy and demographically skewed. In this paper, we systematically study how statistical biases within in-context examples propagate to the distribution of synthetic tabular data, showing that even mild in-context biases lead to global statistical distortions. We further introduce an adversarial scenario where a malicious contributor can inject bias into the synthetic dataset via a subset of in-context examples, ultimately compromising the fairness of downstream classifiers for a targeted and protected subgroup. Our findings led us to define a new vulnerability associated with LLM-based data generation pipelines which rely on in-context prompts within sensitive domains.

Chat is not available.

Poster in Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)

In-Context Bias Propagation in LLM-Based Tabular Data Generation

Pol G. Recasens · Alberto Gutierrez-Torre · Jordi Torres · Josep Lluís Berral · Anisa Halimi · Kieran Fraser

Poster
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)