Poster
in
Workshop: Exploration in AI Today (EXAIT)
Stabilizing protein fitness predictors via the PCS framework
Omer Ronen · Alex Zhao · Ron Boger · Chengzhong Ye · Bin Yu
Keywords: [ uncertainty qunatification ] [ protein fitness prediction ] [ bayesian optimization ]
Abstract:
We improve protein fitness prediction by addressing an often-overlooked sourceof instability in machine learning models: the choice of data representation.Guided by the Predictability–Computability–Stability (PCS) framework forveridical (truthful) data science, we construct $\textit{SP}$ (Stable and Pred-checked) predictors byapplying a prediction-based screening procedure (pred-check in PCS) to selectpredictive representations, followed by ensembling models trained on each—thereby leveraging representation-level diversity. This approachimproves predictive accuracy, out-of-distribution generalization, and uncertaintyquantification across a range of model classes. Our SP variant of the recently introducedkernel regression method, Kermut, achieves state-of-the-art performance on theProteinGym supervised fitness prediction benchmark: it reduces mean squared errorby up to 20\% and improves Spearman correlation by up to 10\%, with the largest improvementson splits representing a distribution shift. We further demonstrate that SP predictors yield statistically significant improvements in in-silico proteindesign tasks. Our results highlight the critical role of representation-level variability in fitness prediction and, more broadly, underscore the need to address instability throughout the entire data science lifecycle to advance protein design.
Chat is not available.