Artigo Acesso aberto Revisado por pares

Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers

2024; Institute of Electrical and Electronics Engineers; Volume: 43; Issue: 11 Linguagem: Inglês

10.1109/tcad.2024.3443718

ISSN

1937-4151

Autores

Moritz Scherer, Luka Macan, Victor J. B. Jung, Philip Wiese, Luca Bompani, Alessio Burrello, Francesco Conti, Luca Benini,

Tópico(s)

Parallel Computing and Optimization Techniques

Resumo

With the rise of embodied foundation models (EFMs), most notably small language models (SLMs), adapting Transformers for the edge applications has become a very active field of research. However, achieving the end-to-end deployment of SLMs on the microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this article, we demonstrate high efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multidimensional memory versus computation tradeoffs involved in the aggressive SLM deployment on the heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel deep neural network (DNN) compiler, which generates highly optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates the end-to-end code for executing SLMs, fully exploiting the RV32 cores' instruction extensions and the NPU. We achieve leading-edge energy and throughput of $490 \; \mu $ J per token, at 340 token per second for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without the external memory.

Referência(s)