LLaMA 3.1 70B API Latency

Is there any way to get the latency down? I am testing with a very short: {“role”: “user”, “content”: “Hello” and seeing 700 - 2400 ms delay to DONE using LLaMA 3.1 70B. If I ask for something more complex, the T/s is crazy fast, but I still have the time to do the first few chunks in the 700 - 2400 ms range.

1 Like

Nathan ,

Please give me a couple starting timestamps for these larger latencies so that I can get the infrastructure team to investigate.

Sun Sep 1 06:58:02 PM EDT 2024

nathan@bart ~ $ time ./sambanova
data: {“id”: “39d68f5f-4f2e-40a7-9fcd-7d0bb598c910”, “object”: “chat.completion.chunk”, “created”: 1725231407, “model”: “Meta-Llama-3.1-70B-Instruct”, “system_fingerprint”: “fastcoe”, “choices”: [{“index”: 0, “delta”: {“content”: “”}, “logprobs”: null, “finish_reason”: null}]}

data: {“id”: “39d68f5f-4f2e-40a7-9fcd-7d0bb598c910”, “object”: “chat.completion.chunk”, “created”: 1725231407, “model”: “Meta-Llama-3.1-70B-Instruct”, “system_fingerprint”: “fastcoe”, “choices”: [{“index”: 0, “delta”: {“content”: "Hello! How can I "}, “logprobs”: null, “finish_reason”: null}]}

data: {“id”: “39d68f5f-4f2e-40a7-9fcd-7d0bb598c910”, “object”: “chat.completion.chunk”, “created”: 1725231407, “model”: “Meta-Llama-3.1-70B-Instruct”, “system_fingerprint”: “fastcoe”, “choices”: [{“index”: 0, “delta”: {“content”: "assist "}, “logprobs”: null, “finish_reason”: null}]}

data: {“id”: “39d68f5f-4f2e-40a7-9fcd-7d0bb598c910”, “object”: “chat.completion.chunk”, “created”: 1725231407, “model”: “Meta-Llama-3.1-70B-Instruct”, “system_fingerprint”: “fastcoe”, “choices”: [{“index”: 0, “delta”: {“content”: “you today?”}, “logprobs”: null, “finish_reason”: null}]}

data: {“id”: “39d68f5f-4f2e-40a7-9fcd-7d0bb598c910”, “object”: “chat.completion.chunk”, “created”: 1725231407, “model”: “Meta-Llama-3.1-70B-Instruct”, “system_fingerprint”: “fastcoe”, “choices”: [{“index”: 0, “delta”: {}, “logprobs”: null, “finish_reason”: “end_of_text”}]}

data: [DONE]

0.05user 0.00system 0:05.82elapsed 0%CPU (0avgtext+0avgdata 13952maxresident)k
0inputs+0outputs (0major+1103minor)pagefaults 0swaps

nathan@bart ~ $ time ./sambanova
data: {“id”: “0c69da5a-fa24-47bf-8699-7786d2b966d8”, “object”: “chat.completion.chunk”, “created”: 1725231556, “model”: “Meta-Llama-3.1-70B-Instruct”, “system_fingerprint”: “fastcoe”, “choices”: [{“index”: 0, “delta”: {“content”: “”}, “logprobs”: null, “finish_reason”: null}]}

data: {“id”: “0c69da5a-fa24-47bf-8699-7786d2b966d8”, “object”: “chat.completion.chunk”, “created”: 1725231556, “model”: “Meta-Llama-3.1-70B-Instruct”, “system_fingerprint”: “fastcoe”, “choices”: [{“index”: 0, “delta”: {“content”: "Hello! How can I "}, “logprobs”: null, “finish_reason”: null}]}

data: {“id”: “0c69da5a-fa24-47bf-8699-7786d2b966d8”, “object”: “chat.completion.chunk”, “created”: 1725231556, “model”: “Meta-Llama-3.1-70B-Instruct”, “system_fingerprint”: “fastcoe”, “choices”: [{“index”: 0, “delta”: {“content”: "assist "}, “logprobs”: null, “finish_reason”: null}]}

data: {“id”: “0c69da5a-fa24-47bf-8699-7786d2b966d8”, “object”: “chat.completion.chunk”, “created”: 1725231556, “model”: “Meta-Llama-3.1-70B-Instruct”, “system_fingerprint”: “fastcoe”, “choices”: [{“index”: 0, “delta”: {“content”: “you today?”}, “logprobs”: null, “finish_reason”: null}]}

data: {“id”: “0c69da5a-fa24-47bf-8699-7786d2b966d8”, “object”: “chat.completion.chunk”, “created”: 1725231556, “model”: “Meta-Llama-3.1-70B-Instruct”, “system_fingerprint”: “fastcoe”, “choices”: [{“index”: 0, “delta”: {}, “logprobs”: null, “finish_reason”: “end_of_text”}]}

data: [DONE]

0.05user 0.00system 0:01.20elapsed 5%CPU (0avgtext+0avgdata 14080maxresident)k
0inputs+0outputs (0major+1104minor)pagefaults 0swaps

Also, it looks like you guys use AWS US West 1; I am about ~70 ms away with less than 2 ms jitter. I assume AWS is just acting as a load balancer, and your stuff is then over the internet or tunneled to your data center.

Nathan let us take this offline .