Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Tiny Titans: The next wave of On-Device Learning for Foundation Models (TTODLer-FM)

TensorSLM: Energy-efficient Embedding Compression of Sub-billion Parameter Language Models on Low-end Devices

Mingxue Xu · Yao Lei Xu · Danilo Mandic

[ ] [ Project Page ]
Fri 18 Jul 1 p.m. PDT — 1:45 p.m. PDT

Abstract: Small Language Models (SLMs, or on-device LMs) have significantly fewer parameters than Large Language Models (LLMs). They are typically deployed on low-end devices, like mobile phones and single-board computers. Unlike LLMs, which rely on increasing model size for better generalisation, SLMs designed for edge applications are expected to have **adaptivity** to the deployment environments and **energy efficiency** given the device battery life constraints, which are not addressed in datacenter-deployed LLMs.This paper addresses these two requirements by proposing a training-free token embedding compression approach using Tensor-Train Decomposition (TTD).Each pre-trained token embedding vector is converted into a lower-dimensional Matrix Product State (MPS).We comprehensively evaluate the extracted low-rank structures across compression ratio, language task performance, latency, and energy consumption on a typical low-end device, i.e. Raspberry Pi.Taking the sub-billion parameter versions of GPT-2/Cerebres-GPT and OPT models as examples, our approach achieves a comparable language task performance to the original model with around $ 2.0\times $ embedding layer compression, while the energy consumption of a single query drops by half.

Chat is not available.