ICML Poster EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption

Poster

EncryptedLLM: Privacy-Preserving Large Language Model Inference via GPU-Accelerated Fully Homomorphic Encryption

Leo de Castro · Daniel Escudero · Adya Agrawal · Antigoni Polychroniadou · Manuela Veloso

East Exhibition Hall A-B #E-1008

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: As large language models (LLMs) become more powerful, the computation required to run these models is increasingly outsourced to a third-party cloud. While this saves clients' computation, it risks leaking the clients' LLM queries to the cloud provider. Fully homomorphic encryption (FHE) presents a natural solution to this problem: simply encrypt the query and evaluate the LLM homomorphically on the cloud machine. The result remains encrypted and can only be learned by the client who holds the secret key. In this work, we present a GPU-accelerated implementation of FHE and use this implementation to benchmark an encrypted GPT-2 forward pass, with runtimes over $200\times$ faster than the CPU baseline. We also present novel and extensive experimental analysis of approximations of LLM activation functions to maintain accuracy while achieving this performance.

Lay Summary:

Large language models (LLMs) are typically deployed in cloud environments. To use these models, the user's data must be sent to an external cloud machine. For sensitive queries (e.g., topics related to healthcare or finance), this represents a major security concern. This work improves the efficiency of techniques to privately evaluate models over sensitive queries. This allows users to safely send their query to a cloud machine and receive the model output without allowing the cloud to learn anything about their data. The main underlying tool is an advanced cryptography primitive called fully homomorphic encryption (FHE), and a technical contribution of this work is a new GPU-accelerated implementation of FHE. We also develop methods to evaluate LLMs using FHE while preserving the quality of the model outputs.

Chat is not available.