Knowledge Distillation in Large Language Models

user33 · June 25, 2025, 2:21am

Introduction

LLM Distillation is a specialized form of Knowledge Distillation (KD) aimed at compressing large-scale language models while preserving their performance. It enables smaller, more efficient models to approximate the performance of massive models, making them deployable on a broader range of devices.

Knowledge distillation (KD)

Knowledge Distillation (KD) is a technique in deep learning where a smaller model (called the student model) learns from a larger, pre-trained model (called the teacher model). The idea is that instead of just training the smaller model from scratch on raw data, it learns from the soft predictions (probabilistic outputs) of the teacher, which contain more information than hard labels.

How LLM Distillation Works?

1. Selecting the Teacher Model

Choose a pre-trained large language model (e.g., GPT-4, LLaMA, PaLM, BERT-large) as the teacher model. The model should have been trained on a large dataset and perform well on the desired NLP tasks.

2. Creating the Student Model

The student model should be a smaller, computationally efficient version of the teacher model. It can have fewer layers, reduced embedding sizes, or fewer attention heads while maintaining core capabilities.

Some commonly used student models include DistilBERT (from BERT), TinyBERT, MobileBERT, and MiniLM.

3. Distillation Training Process

The student model is trained using knowledge from the teacher model through various techniques:

Soft Labels: The student learns from the probability distributions of the teacher’s predictions instead of just hard labels.
Feature-Based Learning: Intermediate layer representations of the teacher model are transferred to the student.
Loss Function Optimization: The training loss combines standard cross-entropy loss with KL divergence (a measure of similarity between distributions).

4. Fine-tuning and Evaluation

Fine-tune the student model on task-specific datasets to enhance performance, and by applying quantization or pruning techniques to further reduce memory consumption.

Compare the student model’s performance against the teacher model using:

Accuracy & F1 Score (for classification tasks)
Perplexity (for language modeling tasks)
Inference speed & latency

What is Hard and Soft Labels?

Example: Sentiment Classification

Input Sentence:

“I love this movie!”

Classes:

Negative
Neutral
Positive

Hard Label (Ground Truth)

This is the actual label from a human-annotated dataset:

hard_label = [0, 0, 1] #One-hot: Positive sentiment

This tells the model:

“The correct answer is Positive. All other answers are completely wrong.”

Soft Label (From Teacher Model)

This comes from the teacher model’s softmax output:

soft_label = [0.05, 0.10, 0.85] #Probabilities from teacher

This tells the model:

“Positive is most likely, but Neutral has some chance, and Negative is unlikely.”

Why Soft Labels Help in Distillation

They provide rich signals about the teacher’s confidence and error margins, which helps the student learn more generalizable patterns, rather than just memorizing the right answer.

Techniques Used in LLM Distillation

1. Logit-Based Distillation (Response-Based)

The student model is trained to match the soft probability distributions (soft labels) output by the teacher, rather than relying only on hard labels.

Loss Function:
Kullback–Leibler (KL) divergence is commonly used to measure the difference between the teacher and student outputs.

2. Feature-Based Distillation

Instead of only mimicking the final output, the student is trained to match the hidden representations from one or more intermediate layers of the teacher.

Loss Function:
Typically uses Mean Squared Error (MSE) between teacher and student layer outputs.
Purpose:
Encourages the student to learn not just what the teacher predicts, but how it arrives at its predictions.

3. Progressive Layer Dropping

This technique gradually removes or skips teacher layers during training, helping the student to learn from fewer, more meaningful signals.

Helps in reducing redundancy in deep models.
Encourages student networks to generalize better from sparse supervision.

4. Task-Specific Distillation

After general distillation, the student is further fine-tuned on downstream tasks such as:

Sentiment analysis
Summarization
Code generation
Question answering

This stage helps the student specialize in real-world applications and potentially outperform its size class.

Benefits of LLM Distillation (Summary)

LLM distillation enhances the efficiency and deployability of large language models by transferring knowledge from a large teacher model to a smaller student model. Key advantages include:

1. Reduced Model Size

Student models are significantly smaller while retaining high performance.
Leads to:
- Faster loading and inference
- Lower storage requirements

2. Improved Inference Speed

Ideal for real-time applications like chatbots and virtual assistants.
Enables deployment on resource-constrained devices (e.g., smartphones, edge devices).

3. Lower Computational Costs

Requires less compute power, reducing cloud and on-premise infrastructure costs.
More energy efficient, making it suitable for large-scale and sustainable deployments.

4. Wider Deployment & Accessibility

Easily deployable on mobile and edge devices.
Expands AI access to more industries, including healthcare, finance, and education.
Supports offline use cases and improves data privacy.

Applications of LLM Distillation

Deploying LLMs on Edge Devices: Mobile apps, IoT devices, and embedded systems benefit from lightweight LLMs that maintain high accuracy.
Optimizing Chatbots and Virtual Assistants: Virtual assistants like Siri, Google Assistant, and Alexa can use distilled models for fast and efficient responses.
Efficient Search and Recommendation Systems: Search engines and personalized recommendation models can utilize small but effective LLMs to deliver results quickly.
Privacy-Preserving AI: Distilled models allow AI to be deployed on-device, reducing the need for cloud-based processing and improving privacy.

Example: Deepseek-R1-Distill-Llama-70B

DeepSeek R1 Distill Llama 70B is a distilled large language model based on Llama-3.3-70B-Instruct, using outputs from DeepSeek-R1. The model combines advanced distillation techniques to achieve high performance across multiple benchmarks, including:

AIME 2024 pass@1: 70.0
MATH-500 pass@1: 94.5
CodeForces Rating: 1633