Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment for Code
Elyas Obbad · Brando Miranda · Iddah Mlauzi · Rylan Schaeffer · Kamal Obbad · Suhana Bedi · Sanmi Koyejo
Keywords: [ machine learning ] [ data curation ]
Data selection is crucial for optimizing language model (LM) performance on specific tasks, yet most existing methods fail to effectively consider the target task distribution. Current approaches either ignore task-specific requirements entirely or rely on approximations that fail to capture the nuanced patterns needed for tasks like Autoformalization or code generation. We introduce \texttt{ZIP-FIT}, a data selection framework that uses compression to directly measure the alignment between potential training data and the target task distribution. Our key insight is that compression-based similarity captures both syntactic and structural patterns relevant to the target task (like code), enabling more precise selection of task-relevant data for code. In extensive evaluations on Autoformalization and Python code generation, \texttt{ZIP-FIT} significantly outperforms leading baselines like DSIR and D4. Models trained on \texttt{ZIP-FIT}-selected data achieve their lowest cross-entropy loss up to 85.1\% faster than these baselines, demonstrating that better task alignment leads to more efficient learning. In addition, \texttt{ZIP-FIT} performs selection up to 65.8\% faster than DSIR and two orders of magnitude faster than D4. In addition, we achieve 18.86\% Pass@1 on HumanEval compared to LESS's 18.06\% while being approximately 2000 times faster. Notably, \texttt{ZIP-FIT} shows that smaller, well-aligned datasets often outperform larger but less targeted ones, demonstrating that a small amount of higher quality data is superior to a large amount of lower quality data.