What is DeepSeek Distill R1 Llama 70B compared to DeepSeek?

Can anyone ELI5 what it means to distill Deepseek into Llama?

1 Like

@johgananda

First welcome to the community.

Distilling a larger model into a smaller model is essentially using the larger to ā€œteachā€ the smaller how to do itā€™s reasoning. It will not be 100% on par reasoning wise but the trade off to the smaller amount of compute power to run it and the speed in which you can bring it up makes it worth it .

Here is a Medium write up on the process of distilling

-Coby

Its just a finetune version of llama 70b.

only use it if u donā€™t have a cot system prompt.

If u have a good cot system prompt, use llama 405.

In my tests deepseek llama perform worse than the llama3.3 70b

Tldr its far worse than r1 670b

1 Like

OK sure I understand that, but how are they technically ā€˜distillingā€™ it into the other model? How do they ā€˜upgradeā€™ it and make it better?

@johgananda Essentially the Larger DeepSeek is used to create training data to finetune/instruction tune the smaller LlaMa model . It can be improved over time via further fine tuning or new tuning datasets derived from DeepSeek if it has further reinforcement training. this is the technique anytime you here something is Distilled .

  • Coby

Yes I understand that this is the concept - but how is it done? Do they have some script that generates millions of calls to the model and then they somehow put that into a structred format and then update the weights? How do they do it since llama hasnā€™t released the weights?