Can anyone ELI5 what it means to distill Deepseek into Llama?
First welcome to the community.
Distilling a larger model into a smaller model is essentially using the larger to āteachā the smaller how to do itās reasoning. It will not be 100% on par reasoning wise but the trade off to the smaller amount of compute power to run it and the speed in which you can bring it up makes it worth it .
Here is a Medium write up on the process of distilling
-Coby
Its just a finetune version of llama 70b.
only use it if u donāt have a cot system prompt.
If u have a good cot system prompt, use llama 405.
In my tests deepseek llama perform worse than the llama3.3 70b
Tldr its far worse than r1 670b
OK sure I understand that, but how are they technically ādistillingā it into the other model? How do they āupgradeā it and make it better?
@johgananda Essentially the Larger DeepSeek is used to create training data to finetune/instruction tune the smaller LlaMa model . It can be improved over time via further fine tuning or new tuning datasets derived from DeepSeek if it has further reinforcement training. this is the technique anytime you here something is Distilled .
- Coby
Yes I understand that this is the concept - but how is it done? Do they have some script that generates millions of calls to the model and then they somehow put that into a structred format and then update the weights? How do they do it since llama hasnāt released the weights?