Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Exploration in AI Today (EXAIT)

Stabilizing protein fitness predictors via the PCS framework

Omer Ronen · Alex Zhao · Ron Boger · Chengzhong Ye · Bin Yu

Keywords: [ uncertainty qunatification ] [ protein fitness prediction ] [ bayesian optimization ]


Abstract: We improve protein fitness prediction by addressing an often-overlooked sourceof instability in machine learning models: the choice of data representation.Guided by the Predictability–Computability–Stability (PCS) framework forveridical (truthful) data science, we construct $\textit{SP}$ (Stable and Pred-checked) predictors byapplying a prediction-based screening procedure (pred-check in PCS) to selectpredictive representations, followed by ensembling models trained on each—thereby leveraging representation-level diversity. This approachimproves predictive accuracy, out-of-distribution generalization, and uncertaintyquantification across a range of model classes. Our SP variant of the recently introducedkernel regression method, Kermut, achieves state-of-the-art performance on theProteinGym supervised fitness prediction benchmark: it reduces mean squared errorby up to 20\% and improves Spearman correlation by up to 10\%, with the largest improvementson splits representing a distribution shift. We further demonstrate that SP predictors yield statistically significant improvements in in-silico proteindesign tasks. Our results highlight the critical role of representation-level variability in fitness prediction and, more broadly, underscore the need to address instability throughout the entire data science lifecycle to advance protein design.

Chat is not available.