Skip to yearly menu bar Skip to main content


Poster
in
Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning

Provenance Design and Evolution in a Production ML Library

Adam C Pocock · Joseph Wonsil · Romina Mahinpei · Jack Sullivan · Margo Seltzer

[ ] [ Project Page ]
Fri 18 Jul 2:15 p.m. PDT — 3 p.m. PDT

Abstract:

Data Provenance is a formal record documenting how a digital artifact came to be in its present state. In the context of a Machine Learning model, provenance includes the data sources, data transformations, and algorithmic hyperparameters that are used to create the model. We present the design of AnonLib (name anonymised for submission, available on GitHub), an open-source, production ML library with integrated data provenance. AnonLib collects provenance automatically requiring no user action or intervention. Using the provenance data, we developed systems for reproducing ML models and generating model cards. Like a type-system, integrated provenance collection constrains design and provides utility in other parts of the system. Our integrated provenance approach has allowed us to automatically fix bugs in old models, detect non-obvious platform dependencies and deeply understand and debug models built by other groups. Integrating provenance collection into the library influences the design and evolution of the system which requires making trade-offs among provenance fidelity, provenance size, and developer flexibility.

Chat is not available.