Poster
On Exact Bit-level Reversible Transformers Without Changing Architecture
Guoqiang Zhang · John Lewis · W. Bastiaan Kleijn
East Exhibition Hall A-B #E-3505
Nowadays, almost all popular large language models (LLMs) use the so-called transformer (consisting of a sequence of blocks) architecture to learn from data. Fine-tuning a pre-trained transformer-based LLM for downstream tasks usually involves a small or medium-sized dataset and is prone to overfitting, implying that the LLM attempts to memorize the data rather than understand it.Our work proposes a new technique named bidirectional integration approximation (BDIA) to assist fine-tuning a transformer-based LLM to reduce overfitting. The basic idea is to fine-tune an ensemble of transformer-based LLMs parameterised by a set of binary random variables, which essentially enforces those LLMs in the ensemble to understand data rather than memorize it. After finishing fine-tuning, we take the average of all the LLMs in the ensemble as the final LLM to be employed in practice.If needed, BDIA can also be implemented to save GPU memory in the fine-tuning process. To do so, we perform quantization on the output of each block of each transformer model in the ensemble when feeding input data to the model. With BDIA and quantization, it becomes feasible to update each block in each model in the ensemble on-the-fly.Experiments on natural language generation, translation, and image classification, confirm that our new BDIA technique can indeed reduce over-fitting, promoting the transformer model to understand data.