Workshop
DataWorld: Unifying data curation frameworks across domains
Neha Hulkund · Sara Beery · Benjamin Feuer · Niv Cohen · Thao Nguyen · Ludwig Schmidt · Serena Yeung · Yuhui Zhang
West Meeting Room 208-209
Sat 19 Jul, 8:55 a.m. PDT
Recently, data-centric research, which has historically taken a backseat to model-centric research, has assumed a central role in the machine learning community. Our workshop aims to explore data-centric methods and theory, with a particular emphasis on real-world data curation. By curation, we mean the set of actions taken by some curator(s) to transition from ideation to a complete dataset. Our topic is wide-ranging, with recent work studying everything from sourcing, to benchmarks.One area that remains relatively underexplored is how data-centric methods can perform differently, depending on the modality and the domain of the data and the downstream application. Which lessons can be shared across domains and modalities, and which cannot? For example, a common part of the data pipeline involves data filtration. Filtration, in domains like medical imaging and wildlife camera traps, faces similar challenges including long-tailed distributions and natural distribution shifts (between hospitals and camera locations, respectively). However, the two domains differ in the types of distribution shift encountered (covariate vs. label vs. subpopulation) and dataset scale (there are generally more camera trap images than medical scans). Another example is the fact that most successful filtration methods in the recent DataComp benchmark tend to disproportionately remove images with non-English captions. Such methods not only degrade the performance on non-English benchmarks, but are also not generalizable to other domains and most real-world applications. Our workshop will invite novel research which seeks to unify seemingly disparate frameworks for data curation; where this is impossible, we hope that the necessary trade-offs and domain-specific challenges will be made clearer.
Schedule
Sat 8:55 a.m. - 9:05 a.m.
|
Opening remarks
|
🔗 |
Sat 9:05 a.m. - 9:35 a.m.
|
Invited Talk 1: Pang Wei Koh
(
Invited Talk
)
>
|
🔗 |
Sat 9:35 a.m. - 10:05 a.m.
|
Invited Talk 2: Meta Llama Team
(
Invited Talk
)
>
|
🔗 |
Sat 10:05 a.m. - 11:20 a.m.
|
Poster Session 1
(
Poster Session
)
>
|
🔗 |
Sat 11:20 a.m. - 11:30 a.m.
|
Coffee Break
|
🔗 |
Sat 11:30 a.m. - 12:15 p.m.
|
Data Curation Across Different Modalities and Domains
(
Panel
)
>
|
🔗 |
Sat 12:15 p.m. - 12:30 p.m.
|
Leveraging Base Language Models for Few-Shot Synthetic Data Generation ( Oral ) > link | Alan Zhu · Parth Asawa · Jared Davis · Lingjiao Chen · Boris Hanin · Ion Stoica · Joseph E Gonzalez · Matei Zaharia 🔗 |
Sat 12:30 p.m. - 12:45 p.m.
|
DataDecide: How to Predict Best Pretraining Data with Small Experiments ( Oral ) > link |
13 presentersIan Magnusson · Tai Nguyen · Ben Bogin · David Heineman · Jena Hwang · Luca Soldaini · Akshita Bhagia · Jiacheng Liu · Dirk Groeneveld · Oyvind Tafjord · Noah Smith · Pang Wei Koh · Jesse Dodge |
Sat 12:45 p.m. - 12:30 p.m.
|
Lunch Break
|
🔗 |
Sat 1:30 p.m. - 2:00 p.m.
|
Invited Talk 3: Aditi Ragunathan
(
Invited Talk
)
>
|
🔗 |
Sat 2:00 p.m. - 2:30 p.m.
|
Invited Talk 4: Ari Marcos
(
Invited Talk
)
>
|
🔗 |
Sat 2:30 p.m. - 2:45 p.m.
|
Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights ( Oral ) > link | Tzu-Heng Huang · Manjot Bilkhu · Frederic Sala 🔗 |
Sat 2:45 p.m. - 3:00 p.m.
|
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding ( Oral ) > link | Zhangchen Xu · Yang Liu · Yueqin Yin · Mingyuan Zhou · Radha Poovendran 🔗 |
Sat 3:00 p.m. - 3:10 p.m.
|
Coffee Break
|
🔗 |
Sat 3:10 p.m. - 4:20 p.m.
|
Poster Session 2
(
Poster Session
)
>
|
🔗 |
Sat 4:20 p.m. - 4:50 p.m.
|
Invited Talk 5: James Zou
(
Invited Talk
)
>
|
🔗 |
Sat 4:50 p.m. - 5:00 p.m.
|
Closing Remarks
|
🔗 |
-
|
HiLWS: A Human-in-the-Loop Weak Supervision Framework for Curating Clinical and Home Video Data for Neurological Assessment ( Poster ) > link | Atefeh Irani · Maryam Mirian · Alexander Lassooij · Reshad Hosseini · Hadi Moradi · Martin McKeown 🔗 |
-
|
FAIM: Fair Imputation with Adversarial Training for Mitigating Bias in Missing Data ( Poster ) > link | Rasta Tadayon · Haewon Jeong · Ramtin Pedarsani 🔗 |
-
|
How to Get Your LLM to Generate Challenging Problems for Evaluation ( Poster ) > link | Arkil Patel · Siva Reddy · Dzmitry Bahdanau 🔗 |
-
|
DCA-Bench: A Benchmark for Dataset Curation Agents ( Poster ) > link | Benhao Huang · Yingzhuo Yu · JIN HUANG · Xingjian Zhang · Jiaqi Ma 🔗 |
-
|
FSPO: Few-Shot Preference Optimization of Synthetic Preference Data Elicits LLM Personalization to Real Users ( Poster ) > link | Anikait Singh · Sheryl Hsu · Kyle Hsu · Eric Mitchell · Stefano Ermon · Tatsunori Hashimoto · Archit Sharma · Chelsea Finn 🔗 |
-
|
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors ( Poster ) > link | Tiancheng Hu · Joachim Baumann · Lorenzo Lupo · Nigel Collier · Dirk Hovy · Paul Röttger 🔗 |
-
|
Filter, Augment, Forecast: Online Data Selection for Robust Time Series Forecasting ( Poster ) > link | Ege Onur Taga · Halil Alperen Gozeten · Kutay Tire · Rahul Dalvi · Reinhard Heckel · Samet Oymak 🔗 |
-
|
DEETS: Detailed Evaluation of Image Text Specificity ( Poster ) > link |
13 presentersYasumasa Onoe · Hailey Joren · Cyrus Rashtchian · Su Wang · Olivia Wiles · Yonatan Bitton · Brian Gordon · Keran Rong · Austin Waters · Jason Baldridge · Roopal Garg · Radu Soricut · Jordi Pont-Tuset |
-
|
UNREAL:Unlabeled Nodes Retrieval and Labeling for Heavily-imbalanced Node Classification ( Poster ) > link | Divin Yan · Shengzhong Zhang · Bisheng Li · Menglin Yang · Chen Yang · Min Zhou · Weiyang Ding · Zengfeng Huang 🔗 |
-
|
Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement ( Poster ) > link | Simon Yu · Liangyu Chen · Sara Ahmadian · Marzieh Fadaee 🔗 |
-
|
Lookahead Bias in Pretrained Language Models ( Poster ) > link | Suproteem Sarkar · Keyon Vafa 🔗 |
-
|
Aquilon: Towards Building Multimodal Weather LLMs ( Poster ) > link |
13 presentersSumanth Varambally · Veeramakali Vignesh Manivannan · Yasaman Jafari · Luyu Han · Zachary Novack · Zhirui Xia · Salva Ruhling Cachay · Srikar Eranky · Brooks(Ruijia) Niu · Taylor Berg-Kirkpatrick · Duncan Watson-Parris · Yian Ma · Rose Yu |
-
|
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models ( Poster ) > link | Thao Nguyen · Yang Li · Olga Golovneva · Luke Zettlemoyer · Sewoong Oh · Ludwig Schmidt · Xian Li 🔗 |
-
|
Daunce: Data Attribution through Uncertainty Estimation ( Poster ) > link | xingyuan pan · Chenlu Ye · Joseph Melkonian · Jiaqi Ma · Tong Zhang 🔗 |
-
|
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis ( Poster ) > link | Zijian Wu · Jinjie Ni · Xiangyan Liu · Zichen Liu · Hang Yan · Michael Shieh 🔗 |
-
|
Multimodal-Guided Dynamic Dataset Pruning for Robust and Generalizable Data-Centric Learning ( Poster ) > link | Suorong Yang · Peijia Li · Yujie Liu · Xu Zhiming · Peng Ye · Wanli Ouyang · Furao Shen · Dongzhan Zhou 🔗 |
-
|
What Variables Affect Out-of-Distribution Generalization in Pretrained Models? ( Poster ) > link | Md Yousuf Harun · Kyungbok Lee · Jhair Gallardo · Giri Krishnan · Christopher Kanan 🔗 |
-
|
R&B: Breaking the Data Mixing Bottleneck with Just 0.01% Overhead ( Poster ) > link | Albert Ge · Tzu-Heng Huang · John Cooper · Avi Trost · Ziyi Chu · Satya Sai Srinath Namburi GNVV · Jack Cai · Kendall Park · Nicholas Roberts · Frederic Sala 🔗 |
-
|
EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits ( Poster ) > link |
11 presentersWayne Chi · Valerie Chen · Ryan Shar · Aditya Mittal · Jenny Liang · Wei-Lin Chiang · Anastasios Angelopoulos · Ion Stoica · Graham Neubig · Ameet Talwalkar · Chris Donahue |
-
|
Inferring the Invisible: Neuro-Symbolic Rule Discovery for Missing Value Imputation ( Poster ) > link | Wendi Ren · Ke Wan · Junyu Leng · Shuang Li 🔗 |
-
|
Core Knowledge Deficits in Multi-Modal Language Models ( Poster ) > link |
11 presentersYijiang Li · Qingying Gao · Tianwei Zhao · Bingyang Wang · Haiyun Lyu · Haoran Sun · Robert Hawkins · Nuno Vasconcelos · Tal Golan · Dezhi Luo · Hokin Deng |
-
|
Active sample selection with stable reversible graph convolutional networks ( Poster ) > link | Hichem Sahbi 🔗 |
-
|
Less is More? Data Specialization for Self-Supervised Remote Sensing Models ( Poster ) > link | Alvard Barseghyan · Ani Vanyan · Hakob Tamazyan · Evan Shelhamer · Hrant Khachatrian 🔗 |
-
|
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems ( Poster ) > link | Elad Levi · Ilan Kadar 🔗 |
-
|
Pearls from Pebbles: Improved Confidence Functions for Auto-labeling ( Poster ) > link | Harit Vishwakarma · Yi Chen · Sui Jiet Tay · Satya Sai Srinath Namburi GNVV · Frederic Sala · Ramya Vinayak 🔗 |
-
|
Faithful Group Shapley Value ( Poster ) > link | Kiljae Lee · Ziqi Liu · Weijing Tang · Yuan Zhang 🔗 |
-
|
VISUALSPHINX: Large-Scale Synthetic Vision Logic Puzzles for RL ( Poster ) > link | Yichen Feng · Zhangchen Xu · Fengqing Jiang · Yuetai Li · Bhaskar Ramasubramanian · Luyao Niu · Yuchen Lin · Radha Poovendran 🔗 |
-
|
ZIP-FIT: Embedding-Free Data Selection via Compression-Based Alignment for Code ( Poster ) > link | Elyas Obbad · Brando Miranda · Iddah Mlauzi · Rylan Schaeffer · Kamal Obbad · Suhana Bedi · Sanmi Koyejo 🔗 |
-
|
Learning from the Best: Smoothness-Driven Metrics for Data Quality in Imitation Learning ( Poster ) > link | Soham Kulkarni · Raayan Dhar · Yuchen Cui 🔗 |
-
|
General and Estimable Learning Bound Unifying Covariate and Concept Shifts ( Poster ) > link | Hongbo Chen · Li Xia 🔗 |
-
|
How to Recommend a Dataset for Model Training Team? Rethinking Proxy-Model-based Technique ( Poster ) > link | Jiachen (Tianhao) Wang · Tong Wu · Kaifeng Lyu · Dawn Song · Ruoxi Jia · Prateek Mittal 🔗 |
-
|
The BrainApp Study: Engineering a New Frontier in Brain Tumor Speech Research ( Poster ) > link | N. Aizaan Anwar · Elias Allara · Lucia Specia · Matt Williams 🔗 |
-
|
Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework ( Poster ) > link | Can Polat · HASAN KURBAN · Erchin Serpedin · Mustafa Kurban 🔗 |
-
|
Embrace the Diversity: Avoiding Mode Collapse with Polarized Curation in Generative Retraining ( Poster ) > link | Ali Falahati · Mohammad Mohammadi Amiri · Kate Larson · Lukasz Golab 🔗 |
-
|
Evaluating Deepfake Detectors in the Wild ( Poster ) > link | Viacheslav Pirogov 🔗 |
-
|
LARP: Learner-Agnostic Robust Data Prefiltering ( Poster ) > link | Kristian Minchev · Dimitar I. Dimitrov · Nikola Konstantinov 🔗 |
-
|
Towards Cross-Modal Error Detection with Tables and Images ( Poster ) > link | Olga Ovcharenko · Sebastian Schelter 🔗 |
-
|
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning ( Poster ) > link | Yuzheng Hu · Fan Wu · Haotian Ye · David Forsyth · James Zou · Nan Jiang · Jiaqi Ma · Han Zhao 🔗 |
-
|
Data Curation Matters: Model Collapse and Spurious Shift Performance Prediction from Training on Uncurated Text Embeddings ( Poster ) > link | Lucas Mattioli · Youness Ait Hadichou · Sabrina Chaouche · Martin Gonzalez 🔗 |
-
|
SIEVE: A Scalable and General Purpose Data Filtering System for Large Language Models ( Poster ) > link | Jifan Zhang · Ziyue Luo · Jia (Kevin) Liu · Ness Shroff · Robert Nowak 🔗 |
-
|
Domain-Constrained Diffusion Models to Synthesize Tabular Data: A Case Study in Power Systems ( Poster ) > link | Milad Hoseinpour · Vladimir Dvorkin 🔗 |
-
|
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets ( Poster ) > link | Corinna Coupette · Jeremy Wayland · Emily Simons · Bastian Rieck 🔗 |
-
|
DataS3: Dataset Subset Selection for Specialization ( Poster ) > link |
18 presentersNeha Hulkund · Alaa Maalouf · Levi Cai · Daniel Yang · Abigail O'Neill · Timm Haucke · Sandeep Mukherjee · Vikram V Ramaswamy · Judy Hanwen Shen · Gabriel Tseng · Mike Walmsley · Johnson Tsun-Hsuan Wang · Hannah Kerner · Irene Y. Chen · Yogesh Girdhar · Daniela Rus · Ken Goldberg · Sara Beery |
-
|
SNAC-DB: The Hitchhiker’s Guide to Building Better Predictive Models of Antibody & NANOBODY® VHH–Antigen Complexes ( Poster ) > link | Abhinav Gupta · Bryan Munoz Rivero · Jorge Roel-Touris · Ruijiang Li · Norbert Furtmann · Yves Nanfack · Maria Wendt · Yu Qiu 🔗 |
-
|
Robust Reward Modeling via Causal Rubrics and Synthetic Data Curation ( Poster ) > link |
12 presentersPragya Srivastava · Harman Singh · Rahul Madhavan · Gandharv Patil · Sravanti Addepalli · Arun Sai Suggala · Rengarajan Aravamudhan · Soumya Sharma · Anirban Laha · Aravindan Raghuveer · Karthikeyan Shanmugam · Doina Precup |
-
|
AutoDavis: Automatic and Dynamic Evaluation Protocol of Large Vision-Language Models on Visual Question-Answerin ( Poster ) > link | Han Bao · Yue Huang · Yanbo Wang · Jiayi Ye · Xiangqi Wang · Xiuying Chen · Yue Zhao · Tianyi Zhou · Mohamed Elhoseiny · Xiangliang Zhang 🔗 |
-
|
Quantifying the Importance of Data Alignment in Downstream Model Performance ( Poster ) > link | Krrish Chawla · Aryan Sahai · Mario DePavia · Sudharsan Sundar · Brando Miranda · Elyas Obbad · Sanmi Koyejo 🔗 |
-
|
f-INE: Influence Estimation using Hypothesis Testing ( Poster ) > link | Subhodip Panda · Shashwat Sourav · Prathosh AP · Sai Praneeth Reddy Karimireddy 🔗 |
-
|
Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs ( Poster ) > link | Dongyang Fan · Vinko Sabolčec · Matin Ansaripour · Ayush Tarun · Martin Jaggi · Antoine Bosselut · Imanol Schlag 🔗 |
-
|
EvalX: A Platform for Code LLM Evaluation in the Wild ( Poster ) > link | Wayne Chi · Valerie Chen · Anastasios Angelopoulos · Wei-Lin Chiang · Aditya Mittal · Naman Jain · Tianjun Zhang · Ion Stoica · Chris Donahue · Ameet Talwalkar 🔗 |
-
|
Do Data Valuations Make Good Data Prices? ( Poster ) > link | Dongyang Fan · Tyler Rotello · Sai Praneeth Reddy Karimireddy 🔗 |