Poster
in
Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning
Provenance Design and Evolution in a Production ML Library
Adam C Pocock · Joseph Wonsil · Romina Mahinpei · Jack Sullivan · Margo Seltzer
Data Provenance is a formal record documenting how a digital artifact came to be in its present state. In the context of a Machine Learning model, provenance includes the data sources, data transformations, and algorithmic hyperparameters that are used to create the model. We present the design of AnonLib (name anonymised for submission, available on GitHub), an open-source, production ML library with integrated data provenance. AnonLib collects provenance automatically requiring no user action or intervention. Using the provenance data, we developed systems for reproducing ML models and generating model cards. Like a type-system, integrated provenance collection constrains design and provides utility in other parts of the system. Our integrated provenance approach has allowed us to automatically fix bugs in old models, detect non-obvious platform dependencies and deeply understand and debug models built by other groups. Integrating provenance collection into the library influences the design and evolution of the system which requires making trade-offs among provenance fidelity, provenance size, and developer flexibility.