Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains
SIEVE: A Scalable and General Purpose Data Filtering System for Large Language Models
Jifan Zhang · Ziyue Luo · Jia (Kevin) Liu · Ness Shroff · Robert Nowak
Keywords: [ Active Learning ] [ Large Language Model Pretraining ] [ Data Filtering ]
Large language models demand vast amounts of high-quality training data, yet efficiently filtering web-scale datasets remains a significant challenge. General-purpose systems like GPT-4o are capable of analyzing and filtering data, but this is an extremely expensive proposition. We propose a novel system that effectively and efficiently distills GPT-4o capabilities into a lightweight filtering model. Our system, called SIEVE, dramatically reduces the cost of data filtering, while maintaining accuracy comparable to GPT-4o.The key innovation in our approach is the strategic use of active learning, which finetunes these lightweight models with minimal GPT-4o calls, enabling performance that matches GPT-4o at a fraction of the computational cost.SIEVE can efficiently curate high-quality data for both general and specialized domains from web-scale corpora through customizable filtering prompts—a particularly valuable capability given the current scarcity of high-quality domain-specific datasets. Our extensive experiments, using both automatic and human evaluation metrics, demonstrate that SIEVE achieves performance comparable to GPT-4o across five highly specific filtering tasks. Furthermore, when applied to web crawl datasets, SIEVE improves upon existing quality filtering methods, establishing a new state-of-the-art on the DataComp-LM challenge.