Context Length for the Meta-Llama-3.1-405B-Instruct is too small

From Customer:

"Hey there,

I was looking forward to use Sambanova Api for my research, but it seems that the context length is quite limited for the models in my case for the Meta-Llama-3.1-405B-Instruct it just 4096. If there’s a way to access this models with more context length, could you, please help me out? Also, what is the rate limit for the key?

Thank you for your time."

9 Likes

Teams are working their hardest to get this improved and will share when we have updates

7 Likes

As a developer/researcher in the legaltech area, the minimum I need as input is 64k tokens.

I’m working with a system that aims to be more efficient than using knowledge graphs in the legal field, and Samba would be excellent for this. An output from 4k tokens and up would be ideal to provide the solid answers necessary to leverage lawyers’ work.

Thanks guys!

4 Likes

Hello sir,

I would also like to know about the inference rate limit of different models hosted by SambaNova FastAPI. The quick start guide only mentioned “If you need higher rate limits, please reach out to help@sambanova.ai and a team member will discuss options with you.”
https://community.sambanova.ai/t/quick-start-guide/104#p-113-rate-limits-6

I would be nice to know about the current rate and token limit for token models, so that developers can have an expectation on what AI app functions are suitable for testing.

Thank you so much.

3 Likes

Meijisnack,

I Have create an action for someone from the sales staff to reach out to you . Please let me know if you do not hear from them shortly.

To make sure I understand you need a single input context length of 64k ? Or can it be batched in multiple chunks of say 8k ?

1 Like

Ideally, I need a single input context length of at least 64k tokens. Currently, my minimum working input starts in 128k tokens.

While batching in chunks of 8k is a possibility, it could introduce latencies and other complexities that might make it less viable compared to solutions that support longer contexts. Models like gemini flash and GPT-4/mini are significantly more efficient in this regard due to their ability to handle longer inputs without fragmentation.

2 Likes

Daslav I will submit an enhancement request for the same .

1 Like

SambaNova, it is time to step up the 8k context and move away from the pack.

If you (& your open source cohort) want to take on GPT-x and other closed source LLMs, then you have to help us achieve 128k context.

Llama 3.x is proving competitive. And we appreciate the speed. But now, our use cases are calling for context.

If you keep us at 8k indefinitely, then our usage is going to peter out and GPT-x will leave you in the dust.

1 Like

Meta-Llama-3.2-3B-Instruct

Requested generation length 1600 is not possible! The provided prompt is 2756 tokens long, so generating 1600 tokens requires a sequence length of 4356, but the maximum supported sequence length is just 4096!


please see the smallest model also has this limit, im unable to perform code benchmark with 2-3 reference files. thanks

1 Like

@nikhilswami1 @stu and @daslav.rm

I know it is not 100 percent what you are looking for but we have expanded context lengths thus far on Llama 3.1 8 and 70b. As we expand more we will announce it in the release notes as we push to production.

-Coby

3 Likes

@coby.Awesome progress; thanks for listening!

1 Like

significant update regarding ctx windows indeed. this works for now.
the reason i request for smaller model 3b,1b was to research, seeing the limits of token speed, and see if they can handle 2-3 basic tasks in one call, effectively reducing total calls. (multiplexing as they call)
wondering what 8bit quantized llama 3.1 405b will look like! another world record maybe, haha.
Urgent request for developer tier, my consultancy is not enterprise, so bit hesitant to contact sales. Bill me there :wave:.
thanks

1 Like

Hi,

Appreciate your feedback, thanks.
Context lengths have now been increased, and are also now automatically handled behind the scenes.

Llama 3.1 8B model: max sequence length increased from 8k to 16k
Llama 3.1 70B model: max sequence length increased from 8k to 64k

The SN Cloud backend will automatically route requests based on the length of the sequence submitted. Therefore, you no longer need to change the model name to specify different sequence lengths. For example, just call Meta-Llama-3.1-70B-Instruct

Thanks & regards,
Scott

2 Likes

@nikhilswami1 we do have some pay as you go options that sales can present to you until uich time the developer tier is available.

That is certainly encouraging. Thanks.

But you still haven’t addressed the original question. What is the length for the crown jewel Llama 3.1 405B model?! If the model supports a 128K context window but you’re limiting it to a much smaller size, is this due to technical limitations or financial considerations?

If you cannot offer a bigger context to the free tier, could you please offer the developers a pay-as-you-go option?

2 Likes

+1 same question here.

1 Like

@AI-Developer, @stonebeard2002,

Thanks both for the further questions around context length, always appreciated.

We like to announce things that are live rather than upcoming. There are further improvements to this offering in the pipeline - this is all we can reveal just now.

Further, the dev tier is coming soon. For now, as my colleague @coby.adams mentioned, we can arrange a conversation with the sales team to discuss interim PAYG options as required.

@stonebeard2002 - welcome to the community :slight_smile:

1 Like