What is DeepSeek Distill R1 Llama 70B compared to DeepSeek?

johgananda · February 6, 2025, 4:18pm

Can anyone ELI5 what it means to distill Deepseek into Llama?

coby.adams · February 6, 2025, 4:24pm

First welcome to the community.

Distilling a larger model into a smaller model is essentially using the larger to “teach” the smaller how to do it’s reasoning. It will not be 100% on par reasoning wise but the trade off to the smaller amount of compute power to run it and the speed in which you can bring it up makes it worth it .

Here is a Medium write up on the process of distilling

-Coby

kollikiran456 · February 6, 2025, 8:58pm

Its just a finetune version of llama 70b.

only use it if u don’t have a cot system prompt.

If u have a good cot system prompt, use llama 405.

In my tests deepseek llama perform worse than the llama3.3 70b

Tldr its far worse than r1 670b

johgananda · February 10, 2025, 5:07pm

OK sure I understand that, but how are they technically ‘distilling’ it into the other model? How do they ‘upgrade’ it and make it better?

coby.adams · February 11, 2025, 5:32am

@johgananda Essentially the Larger DeepSeek is used to create training data to finetune/instruction tune the smaller LlaMa model . It can be improved over time via further fine tuning or new tuning datasets derived from DeepSeek if it has further reinforcement training. this is the technique anytime you here something is Distilled .

Coby

johgananda · February 11, 2025, 3:11pm

Yes I understand that this is the concept - but how is it done? Do they have some script that generates millions of calls to the model and then they somehow put that into a structred format and then update the weights? How do they do it since llama hasn’t released the weights?