Quantization of DeepSeek Models

KeepingITSound · April 11, 2025, 2:59pm

Hey!

I was wondering; as I didn’t find a explicit answer to this:

Are the DeepSeek models quantized in any way? Or are they the full models with the exact same weights as in the initial publication from deepseek?

So are they fp8 or the full weights?

seth.kneeland · April 11, 2025, 3:13pm

Hello, KeepingITSound.

We are running the full weights, which are fp8:

KeepingITSound · April 11, 2025, 3:54pm

Hi again,

Thanks for the info earlier!

I’ve noticed something strange when using your API:

It often feels like the model suddenly forgets the entire conversation context, as if it’s starting from scratch — even though I’m sure I’m sending the messages correctly with proper history.

Is the context window extremely limited on your end, or are you filtering out previous messages in some way?

Also, since you mentioned that the models run in FP8 — wouldn’t that technically mean it’s not the most precise version of the model available? I understand it’s efficient, but it still implies some trade-off in accuracy, right?

Would love to hear your thoughts on both points. Thanks in advance!

KeepingITSound · April 11, 2025, 4:02pm

One more follow-up question:

According to the DeepSeek-V3 technical report, not everything in DeepSeek runs at 8-bit. The team specifically keeps certain components in higher precision — such as the embedding layers, final output layer, MoE gating, layer normalization, and attention softmax (using BF16 or FP32).

Do you follow the same setup on your end, or is your FP8 execution applied more broadly across all components?

Would be great to understand how closely your deployment aligns with the original design choices.

coby.adams · April 11, 2025, 8:06pm

@KeepingITSound

Thank you for your thoughtful questions about our model implementation. We appreciate your interest in understanding the technical details behind our API.

Regarding the conversation context, we can assure you that our model is designed to maintain context throughout the conversation. However, we do have limitations on the context window to ensure efficient processing and response times. We’re constantly working to optimize this balance between context retention and performance.

Regarding the use of FP8, you’re correct that it’s a trade-off between efficiency and precision. While FP8 allows us to achieve significant performance gains, we also recognize the potential impact on model accuracy. To mitigate this, we employ a mixed-precision approach, where certain components of the model, such as embedding layers and output layers, are maintained in higher precision (e.g., BF16 or FP32) to preserve accuracy. This approach enables us to balance efficiency with the need for reliable and accurate results.
We’ve conducted extensive testing to ensure that our deployment aligns with the original design choices and produces high-quality results that are consistent with the reference model. We’re committed to ongoing optimization and improvement to provide the best possible experience for our users.

We will investigate more into why you seem to be losing conversation context. Could you provide us an example of a conversation where this occurred and a sample of your code processing these conversations?

-Coby

KeepingITSound · April 21, 2025, 9:18pm

Hey, I am pretty sure now that the loss of Context is just because of a small context length on your side. It happens roughly around 8K Tokens. So at Minimum the Provider needs to be switched with Long Chats. Also i see that in the last days the ttft is quit high again in your api. Is there a time when this shall stabilize? Seems to be an up and down in the last weeks