I was looking forward to use Sambanova Api for my research, but it seems that the context length is quite limited for the models in my case for the Meta-Llama-3.1-405B-Instruct it just 4096. If there’s a way to access this models with more context length, could you, please help me out? Also, what is the rate limit for the key?
As a developer/researcher in the legaltech area, the minimum I need as input is 64k tokens.
I’m working with a system that aims to be more efficient than using knowledge graphs in the legal field, and Samba would be excellent for this. An output from 4k tokens and up would be ideal to provide the solid answers necessary to leverage lawyers’ work.
I would also like to know about the inference rate limit of different models hosted by SambaNova FastAPI. The quick start guide only mentioned “If you need higher rate limits, please reach out to help@sambanova.ai and a team member will discuss options with you.” https://community.sambanova.ai/t/quick-start-guide/104#p-113-rate-limits-6
I would be nice to know about the current rate and token limit for token models, so that developers can have an expectation on what AI app functions are suitable for testing.
Ideally, I need a single input context length of at least 64k tokens. Currently, my minimum working input starts in 128k tokens.
While batching in chunks of 8k is a possibility, it could introduce latencies and other complexities that might make it less viable compared to solutions that support longer contexts. Models like gemini flash and GPT-4/mini are significantly more efficient in this regard due to their ability to handle longer inputs without fragmentation.
Requested generation length 1600 is not possible! The provided prompt is 2756 tokens long, so generating 1600 tokens requires a sequence length of 4356, but the maximum supported sequence length is just 4096!
please see the smallest model also has this limit, im unable to perform code benchmark with 2-3 reference files. thanks
I know it is not 100 percent what you are looking for but we have expanded context lengths thus far on Llama 3.1 8 and 70b. As we expand more we will announce it in the release notes as we push to production.
significant update regarding ctx windows indeed. this works for now.
the reason i request for smaller model 3b,1b was to research, seeing the limits of token speed, and see if they can handle 2-3 basic tasks in one call, effectively reducing total calls. (multiplexing as they call)
wondering what 8bit quantized llama 3.1 405b will look like! another world record maybe, haha.
Urgent request for developer tier, my consultancy is not enterprise, so bit hesitant to contact sales. Bill me there .
thanks
Appreciate your feedback, thanks.
Context lengths have now been increased, and are also now automatically handled behind the scenes.
Llama 3.1 8B model: max sequence length increased from 8k to 16k
Llama 3.1 70B model: max sequence length increased from 8k to 64k
The SN Cloud backend will automatically route requests based on the length of the sequence submitted. Therefore, you no longer need to change the model name to specify different sequence lengths. For example, just call Meta-Llama-3.1-70B-Instruct
But you still haven’t addressed the original question. What is the length for the crown jewel Llama 3.1 405B model?! If the model supports a 128K context window but you’re limiting it to a much smaller size, is this due to technical limitations or financial considerations?
If you cannot offer a bigger context to the free tier, could you please offer the developers a pay-as-you-go option?
Thanks both for the further questions around context length, always appreciated.
We like to announce things that are live rather than upcoming. There are further improvements to this offering in the pipeline - this is all we can reveal just now.
Further, the dev tier is coming soon. For now, as my colleague @coby.adams mentioned, we can arrange a conversation with the sales team to discuss interim PAYG options as required.