Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains
Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework
Can Polat · HASAN KURBAN · Erchin Serpedin · Mustafa Kurban
Keywords: [ dft ] [ crystal dataset ] [ multimodal ] [ benchmark ]
Most materials science datasets are limited to atomic geometries (e.g., XYZ files), restricting their utility for multimodal learning and comprehensive data-centric analysis. These constraints have historically impeded the adoption of advanced machine learning techniques in the field. This work introduces MultiCrystalSpectrumSet (MCS-Set), a curated framework that expands materials datasets by integrating atomic structures with 2D projections and structured textual annotations, including lattice parameters and coordination metrics. MCS-Set enables two key tasks: (1) multimodal property and summary prediction, and (2) constrained crystal generation with partial cluster supervision. Leveraging a human-in-the-loop pipeline, MCS-Set combines domain expertise with standardized descriptors for high-quality annotation. Evaluations using state-of-the-art language and vision-language models reveal substantial modality-specific performance gaps and highlight the importance of annotation quality for generalization. MCS-Set offers a foundation for benchmarking multimodal models, advancing annotation practices, and promoting accessible, versatile materials science datasets. The dataset and implementations are available at https://anonymous.4open.science/r/MultiCrystalSpectrumSet/README.md.