ICML Semi-Nonnegative GPT: Towards Monosemantic representations

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Semi-Nonnegative GPT: Towards Monosemantic representations

Junyi Li · Jinqi Liu · Qi Zhang · Yisen Wang

Keywords: [ autoregressive model ] [ monosemantic ] [ interpretability ] [ Representation learning ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Autoregressive models have achieved remarkable success across various modalities and downstream tasks. However, the black-box nature of these models limits their interpretability and broader applicability. To address this, recent efforts have focused on improving interpretability by obtaining monosemantic models, where each dimension corresponds to a single natural concept in the data.In this paper, we introduce Semi-Nonnegative Generative Pretrained Transformer (Semi-NGPT), a theoretically guaranteed model that intrinsically learns monosemantic representations by imposing non-negative constraints during the pretraining phase. We find that our method leads to representations with high sparsity and orthogonality, and generalizes well to downstream tasks both theoretically and empirically. Our findings establish this technique as a simple yet powerful approach for enhancing interpretability in autoregressive models while maintaining strong downstream performance.

Chat is not available.

Poster in Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models

Semi-Nonnegative GPT: Towards Monosemantic representations

Junyi Li · Jinqi Liu · Qi Zhang · Yisen Wang

Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models