Knowledge Distillation, Pruning & Quantization: Techniques for Optimizing AI Models

During the long Pentecost weekend, I used the bad weather for further education at the Hasso Plattner Institute and enrolled in the course Sustainability in the Digital Age: Efficient AI Techniques in the LLM Era, led by PD Dr. Haojin Yang.

The course provided a comprehensive overview of modern techniques for making deep learning models more efficient without significant performance losses. Especially in times when AI models are becoming increasingly large and computationally intensive (keyword: climate change), methods that make these large models scalable and sustainable in practice are crucial.

First, the development of the Transformer architecture and its impact on modern language models such as BERT and GPT was explained. It became clear why these models are so powerful, but also why they require enormous resources. The course emphasized the urgency of finding efficient alternatives to keep AI sustainable and accessible.

One of the most exciting techniques is knowledge distillation, where the knowledge of a large language model is transferred to a smaller one. This allows compact models to be developed that are suitable for deployment on end devices.

Another important aspect was network pruning, which focuses on selectively removing unimportant neurons or weights from a model. The course presented both unstructured and structured pruning methods and showed how these can drastically reduce model size while minimally affecting accuracy.

Low-bit quantization, or the reduction of the precision of model parameters, was another key topic. Both 8-bit and aggressive 2-bit methods were introduced. Impressively, these techniques allow models to be heavily compressed, enabling them to run even on regular devices such as laptops or smartphones.

Dynamic networks adapt their structure depending on the complexity of the input. This happens, for example, by activating a variable number of layers or neurons. Through this flexible adaptation, an optimal balance between efficiency and performance can be achieved, which is particularly beneficial for applications with fluctuating resource requirements.

Although I had enrolled for this course quite some time ago, I only got around to working through the content during the Pentecost weekend. The course was originally designed for two weeks, but I had only a few days left. In the end, I had to hurry a bit because I had dived into one or another rabbit hole. It was just too fascinating! I therefore completed the final exam at the very last minute. Another challenge, aside from time pressure, was the English language of instruction. Nevertheless, despite the language barrier, I managed to place among the top 20% of participants.