Poster
in
Workshop: 2nd Generative AI for Biology Workshop
Sparse Autoencoders in Protein Engineering Campaigns: Steering and Model Diffing
Gerard Corominas · Filippo Stocco · Noelia Ferruz
Keywords: [ Sparse autoencoders ] [ Protein language models ] [ Mechanistic interpretability ] [ Enzyme design ]
Protein Language Models (pLM) have proven versatile tools in protein design, but their internal workings remain difficult to interpret. Here, we implement a mechanistic interpretability framework and apply it in two scenarios. First, by training sparse autoencoders (SAEs) on the model activations, we identify and annotate features relevant to enzyme variant activity through a two-stage process involving candidate selection and causal intervention. During sequence generation, we steer the model by clamping or ablating key SAE features, which increases the predicted enzyme activity. Second, we compare pLM checkpoints before and after three rounds of Reinforcement Learning (RL) by examining sequence regions with high divergence of per-token log-likelihood, detecting the residues that most align with higher predicted affinities.