Skip to yearly menu bar Skip to main content


Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Semi-Nonnegative GPT: Towards Monosemantic representations

Junyi Li · Jinqi Liu · Qi Zhang · Yisen Wang

Keywords: [ autoregressive model ] [ monosemantic ] [ interpretability ] [ Representation learning ]


Abstract:

Autoregressive models have achieved remarkable success across various modalities and downstream tasks. However, the black-box nature of these models limits their interpretability and broader applicability. To address this, recent efforts have focused on improving interpretability by obtaining monosemantic models, where each dimension corresponds to a single natural concept in the data.In this paper, we introduce Semi-Nonnegative Generative Pretrained Transformer (Semi-NGPT), a theoretically guaranteed model that intrinsically learns monosemantic representations by imposing non-negative constraints during the pretraining phase. We find that our method leads to representations with high sparsity and orthogonality, and generalizes well to downstream tasks both theoretically and empirically. Our findings establish this technique as a simple yet powerful approach for enhancing interpretability in autoregressive models while maintaining strong downstream performance.

Chat is not available.