ICML Poster An Efficient Private GPT Never Autoregressively Decodes

Poster

An Efficient Private GPT Never Autoregressively Decodes

Zhengyi Li · Yue Guan · Kang Yang · Yu Feng · Ning Liu · Yu Yu · Jingwen Leng · Minyi Guo

East Exhibition Hall A-B #E-904

[ Abstract ] [ Lay Summary ]

[ Slides] [ Poster] [ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead. To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a $2.1\times \sim 6.0\times$ speedup compared to standard decoding across three pairs of public-private models and different network conditions.

Lay Summary:

As AI tools like ChatGPT become more widely used, protecting clients' privacy when accessing this service becomes increasingly important. One way to ensure privacy is for the client to upload cryptographic ciphertext, allowing the private model to generate responses using this ciphertext. The magic cryptographic techniques guarantee that the private model cannot see the client's words but can respond accurately. However, these methods often result in a significant slowdown in performance.Our research accelerates secure conversations with a private GPT by introducing the public GPT model into the process for the first time. Instead of performing all generation tasks in the secure world, we divide the task into a public part and a private part: the client uses the public GPT to suggest several possible next words. The suggested words are then encrypted, and a private GPT privately checks which words are acceptable. Since verifying suggestions is much easier than generating them, this division allows us to greatly reduce the computational burden on the secure world.Importantly, our method maintains full privacy protection and does not compromise the quality of the generated responses. An especially promising aspect of our approach is that its performance improvement depends on the strength of the public GPT model—meaning it will only get better as public models continue to advance.

Chat is not available.