Very inconsistent inference speeds

While on the web UI (playground) i am able to see these incredible speeds, in my Python (FastAPI) app when i query the SambaNova completion API (using the openai library) it just doesn’t deliver as expected.

start_time = time.perf_counter()
llm_response = llm_client.chat.completions.create(
    model=LLAMA_70,
    messages=[
        {
            "role": "system",
            "content": "You will now roleplay as a friend of mine named Alice that I met right now at the bus stop. You will strictly keep the conversation going by strictly ending with a question back to me.",
        },
        {"role": "user", "content": body.text},
    ],
)

ai_response = llm_response.choices[0].message.content
end_time = time.perf_counter() - start_time
print(f"Time taken to generate LLM response: {end_time:.2f}s")

Python (FastAPI) Logs:

Time taken to generate LLM response: 7.75s
INFO:     127.0.0.1:34002 - "POST /api/ai HTTP/1.1" 200 OK
Time taken to generate LLM response: 3.51s
INFO:     127.0.0.1:56204 - "POST /api/ai HTTP/1.1" 200 OK
Time taken to generate LLM response: 4.27s
INFO:     127.0.0.1:49212 - "POST /api/ai HTTP/1.1" 200 OK
INFO:     127.0.0.1:55440 - "OPTIONS /api/ai HTTP/1.1" 200 OK
Time taken to generate LLM response: 15.49s
INFO:     127.0.0.1:55444 - "POST /api/ai HTTP/1.1" 200 OK
Time taken to generate LLM response: 5.25s
INFO:     127.0.0.1:60168 - "POST /api/ai HTTP/1.1" 200 OK
Time taken to generate LLM response: 2.14s
INFO:     127.0.0.1:47268 - "POST /api/ai HTTP/1.1" 200 OK
Time taken to generate LLM response: 3.10s
INFO:     127.0.0.1:58976 - "POST /api/ai HTTP/1.1" 200 OK

I even checked in bash using curl, still inconsistent results.

While on the playground: 0.8s

I’m really impressed with the speed that SambaNova provides and its crucial for my app as well - to be that fast. Would really love to get some insights on this. Am I missing something?

Edit: tested on Llama 3.1 8B, Llama 3.1 70B, Llama 3.2 1B, Llama 3.2 3B
(although for 3.2 models, the inconsistency doesn’t occur that commonly)

1 Like

Hello patripper,

Thanks for reaching out. We will look into this and will work on a solution. If you have any additional details, feel free to share, and we’ll keep you updated.
Thank you

Best Regards,
Rohit Vyawahare

I have experienced the same behaviour.

1 Like

@patripper and @jj1 thank you for participating in the community.

This is being run off the free tier which is a shared pool that uses batching queues . If you get queued that can cause a delay . The 15 s one is definitely an anomaly that should not occur. Do you have an exact timestamp of that one?

-Coby

1 Like

hey @coby.adams @rohit.vyawahare thanks for the prompt replies!

i figured it could be a scaling issue as i did realize this didnt happen when i first tried out the API (it was afternoon IST) but does happen regularly when im working on the app during 7-10PM IST.

i didnt really capture the timestamps yesterday but as its a recurring issue - just did a test with Llama 3.1 8B that includes the Epoch timestamps:

Starting at: 1731505123.8672366
Time taken to generate LLM response: 1.17s
Ended at: 1731505125.0340307
INFO:     127.0.0.1:44114 - "POST /api/ai HTTP/1.1" 200 OK
Starting at: 1731505142.1827333
Time taken to generate LLM response: 2.08s
Ended at: 1731505144.2580693
INFO:     127.0.0.1:45998 - "POST /api/ai HTTP/1.1" 200 OK
Starting at: 1731505161.51673
Time taken to generate LLM response: 5.24s
Ended at: 1731505166.7577014
INFO:     127.0.0.1:49248 - "POST /api/ai HTTP/1.1" 200 OK
Starting at: 1731505191.460784
Time taken to generate LLM response: 13.55s
Ended at: 1731505205.0078735
INFO:     127.0.0.1:39210 - "POST /api/ai HTTP/1.1" 200 OK
Starting at: 1731505353.9976096
Time taken to generate LLM response: 3.59s
Ended at: 1731505357.5896845
INFO:     127.0.0.1:49916 - "POST /api/ai HTTP/1.1" 200 OK
Starting at: 1731505385.800643
Time taken to generate LLM response: 1.32s
Ended at: 1731505387.1212313
INFO:     127.0.0.1:37978 - "POST /api/ai HTTP/1.1" 200 OK
Starting at: 1731505411.3116236
Time taken to generate LLM response: 4.71s
Ended at: 1731505416.0256648
INFO:     127.0.0.1:53706 - "POST /api/ai HTTP/1.1" 200 OK
Starting at: 1731505446.2781637
Time taken to generate LLM response: 5.84s
Ended at: 1731505452.1166391
INFO:     127.0.0.1:55366 - "POST /api/ai HTTP/1.1" 200 OK
Starting at: 1731505486.4222147
Time taken to generate LLM response: 6.91s
Ended at: 1731505493.3284912
INFO:     127.0.0.1:46010 - "POST /api/ai HTTP/1.1" 200 OK
Starting at: 1731508042.2154524
Time taken to generate LLM response: 17.98s
Ended at: 1731508060.190687
INFO:     127.0.0.1:44168 - "POST /api/ai HTTP/1.1" 200 OK

Let me know if you need anything else from my side : )

Although i do think atleast during the hackathon phase, queuing shouldn’t be this long as it does make it hard to build realtime A.I agents (which i think is crucial for building apps with SambaNova, especially for them to shine during/after the hackathon). The purpose of my app is users interacting with a realtime A.I agent in a way where there should be almost no delay.

I was expecting the API to give a similar experience to the playground of actual inference speeds during the hackathon. It feels as if im just using a LLM with regular inference speed like on other providers. To better advocate the product, it would be great if you could make some changes so we can experience its full potential.

Thank you! : )

PS. I didn’t realise i was on my alt account yesterday… : p

@parambirje Lets focus on these two

Starting at: 1731505191.460784 Wednesday, November 13, 2024 5:39:51.460 AM GMT-08:00
Time taken to generate LLM response: 13.55s
Ended at: 1731505205.0078735 Wednesday, November 13, 2024 5:40:05.007 AM GMT-08:00

and

Starting at: 1731508042.2154524 Wednesday, November 13, 2024 6:27:22.215 AM GMT-08:00
Time taken to generate LLM response: 17.98s
Ended at: 1731508060.190687 Wednesday, November 13, 2024 6:27:40.190 AM GMT-08:00

sure @coby.adams let me know the update on these.