High Time to First Token

KeepingITSound · April 4, 2025, 4:41am

Hey there,

I am suprised to See that, while there is a high tokens/second still, the time to First Token is about 2 to sometimes 10 seconds when having 4 digit Input tokens on deepseeks Models for Example.

As I read an Artillerie of you guys directly talking on how important it is to keep this very low, I don’t know Whats going on.

Somebody got some Input on that? ( of course i am talking about serverless not a dedicated deploymemt as that could be much faster with every provider.

user32 · April 4, 2025, 4:55am

Hello @KeepingITSound,
Thanks for reaching out. We’ll look into the issue with the time to First Token and get back to you.
Feel free to reach out if you have any additional details or questions in the meantime!
Thanks,

Regards,
Rohit

justin.woo · April 7, 2025, 3:46pm

@KeepingITSound Thank you for bringing this to our attention and also for testing and trying out SambaNova cloud.

While our team is investigating this would you be able to share more about your use case and what you are building. Are you using DeepSeek models mostly and calling using our API?

KeepingITSound · April 8, 2025, 6:12am

Hey, yeah I do use Deepseek models. But I would be surprised if this issue wouldn’t arise with other, too. Also I am talking about the api. But this is also to be seen in the playground or what you guys call it ( although I am a bit puzzled by the question; the playground surely is just an api terminal and I already said that I am talking about sever less inference; maybe I am missing something here ) | The use case is just having the fastest completion of the inference possible in all heights of input token amounts. With your platform we have seen great t/s speeds ///when the stream has started and if you just count the time after the first token, which to us in the end would be artificial/// but we see a rapid and dramatic decline of time to first token speeds when growing the input token count. Because both are crucial to us and in some cases a rapid fast first token along with a good t/s is more important, if we can’t get both from you’re platform we can’t consider it for production use.