ICML Poster CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

Spotlight Poster

CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation

Aditya Gorla · Ryan Wang · Zhengtong Liu · Ulzee An · Sriram Sankararaman

East Exhibition Hall A-B #E-1300

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features — captured by column names and text descriptions — to better represent feature dependence. These dual sources of inductive bias enable CACTIto outperform state-of-the-art methods — an average $R^2$ gain of 7.8\% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) — across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.

Lay Summary:

Imagine trying to complete a puzzle where some pieces are missing—this is what data scientists face daily when working with incomplete datasets. Missing information in medical records, survey responses, or business data can lead to flawed analyses and poor decisions. Current methods for filling these gaps treat all missing data the same way, like assuming puzzle pieces disappeared randomly. We created CACTI, a new machine learning approach that recognizes that data often goes missing in patterns. CACTI learns these real-world patterns by reusing observed missingness structures to improve its predictions for filling in missing data. CACTI also reads column descriptions to understand relationships between different types of information, much like understanding that "blood pressure" and "heart rate" are related health measurements. When tested on real datasets, CACTI outperformed the best existing methods by an average of 7.8%, reaching up to 13.4% improvement in the most complex cases. This means researchers and organizations can now extract more accurate insights from incomplete data, more reliable findings, better analysis and improved downstream decisions—all from the same imperfect datasets they already have.

Chat is not available.