Poster
Synthetic Text Generation for Training Large Language Models via Gradient Matching
Dang Nguyen · Zeman Li · MohammadHossein Bateni · Vahab Mirrokni · Meisam Razaviyayn · Baharan Mirzasoleiman
East Exhibition Hall A-B #E-2810
Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM.
Large Language Models (LLMs) require massive amounts of high-quality data, but collecting and using such data raises concerns about privacy, cost, and efficiency. Our work introduces GRADMM, the first method that can generate readable synthetic text with strong theoretical guarantees for training LLMs. Unlike existing methods that rely on expensive prompts or unreadable embeddings, GRADMM creates human-like text that mimics how real data trains the model—without leaking sensitive information.We achieve this by matching the training dynamics (gradients) of real data using a technique called ADMM, while ensuring the output is coherent and diverse. This allows us to train LLMs using only a handful of real examples or replace real data entirely with synthetic ones. Our experiments show that GRADMM outperforms both traditional data selection and LLM-generated text in accuracy—while being significantly more private and efficient. This opens the door to safer and more accessible LLM training.