Enhancement to the ratelimit message

Dear team,

I think it would be great to see what is the reason for the rate limit, now only the ratelimit returned, however if you maintain ratelimit, than you can give back more detail than just this message:

{‘error’: {‘code’: None, ‘message’: ‘Rate limit exceeded’, ‘param’: None, ‘type’: ‘rate_limit_exceeded’}}

I think it enhances the developer experience.

1 Like

Just one more comment: this may also be a good candidate for the usage screen on the cloud.

1 Like

@hello1

Sorry for the delayed response. @seth.kneeland Has opened up an enhancement request to our product team for this.

-Coby

Thanks, Coby! Recently, I arrived back with my development to use LLMs again to use more of the services.
I was thinking about one question: if the services are paid, why are these rate limits?

I think the lower limits break the experience of the platform.

When I’m using LLM for translation, I might have 10-30000 calls in total until it is complete, and then I have about 1-5 minutes when LLM-specific functions are used, but the initial run might take a few weeks with the actual limits.

@hello1

These are developer tier rate limits we do offer higher rate limits in enterprise tiers . May I inquire which models and which specific RPM/RPH/RPD you think you will require for each specific model ? What would you need as a TTFT, total latency max for each model ? What would be the max context lengths you would require for the same . Could you build your perfect matrix of these requirements for your application(s)?

-Coby

I’m mostly working with Llama 3.1 405b as it is the only model which give accurate answers.
For the translate function 4K context length is sufficient.

The RPM/H/D is a number which depends on the model use - I more can word the requirements. The translation engine extract language relevant fields from a HTML page, which structured for static text extraction, so for the extraction itself, I don’t require LLM.
The pages typically contains 5-50 text blocks for now (max 2-300 for complex functions)

Normally I translate to one language, but if it a global function it multiplies to the enabled languages which is about 100 languages. (The browser used language translated in the first round - it returned to the user, all the others translated in the background, for later faster use)

I developed sequential calls for the actual speed limits, because for such volume it doesn’t help, but if more request can be handled it enhances the user experience, as the translation will complete shorter for the screen. I hope these specification helps to understand the requirements. For the translation I use LLMs because it can also handle context which not used by M4T models.

1 Like

@hello1

Discussing with some internals. So 3.1 405b is getting you better results than 3.3 70b?

-Coby

A way better! I expected the 3.3 would be better, but the results are precise.