Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains
DCA-Bench: A Benchmark for Dataset Curation Agents
Benhao Huang · Yingzhuo Yu · JIN HUANG · Xingjian Zhang · Jiaqi Ma
Keywords: [ LLM Agent ] [ Dataset Curation ] [ Automatic Evaluation ]
Abstract:
The quality of datasets is increasingly vital for modern AI research and development. Despite the rise of open dataset platforms, issues like poor documentation, mislabeled data, and outdated content remain widespread. Furthermore, these issues are often subtle and difficult to be detected by rule-based scripts, therefore requiring identification and verification by dataset users or maintainers--a process that is both time-consuming and prone to human mistakes. With the surging ability of large language models, it’s promising to streamline the discovery of hidden dataset issues with LLM agents. To achieve this, one significant challenge is enabling LLM agents to detect issues in the wild rather than simply fixing known ones. In this work, we carefully curate 221 real-world test cases from eight popular dataset platforms and propose an automatic evaluation framework using GPT-4o. Our proposed framework shows strong empirical alignment with expert evaluations, validated through extensive comparisons with human annotations. Without any hints, most competitive Curator agent can only reveal $\sim$30\% of the data quality issues in the proposed dataset, highlighting the complexity of this task and indicating that applying LLM agents to real-world dataset curation still requires further in-depth exploration and innovation.
Chat is not available.