[Feature Request] Auto-throttling the embedding generation speed thru the use of x-ratelimit-* headers #381

0x7c13 · 2024-03-23T12:34:12Z

Context / Scenario

I was trying to ingest a large (26MB) PDF file using s Serverless KM instance locally the other day and found that it took really long time for the indexing/embedding to complete. I was trying to profiling the code and realized that the actual extraction process happens really quick.

The reason it took so long is because the GenerateEmbeddingsHandler calls the ITextEmbeddingGenerator in a synchronized foreach loop fashion. We could theoretically convert the existing code to use Parallel.ForEach instead to drastically improve the embedding speed since the embedding for partitionFiles are not logically coupled.

Example:

ConcurrentDictionary<string, DataPipeline.GeneratedFileDetails> newFiles = new();

Parallel.ForEach(uploadedFile.GeneratedFiles, new ParallelOptions { MaxDegreeOfParallelism = ... },
    async (generatedFile, state) => 
    { 
        ...
        newFiles.TryAdd(embeddingFileName,  embeddingFileNameDetails);
    }

However, although it works for me but this is still not an ideal solution since both OpenAI and AzureOpenAI has a built-in rate limiter that prevent clients from abusing the endpoint.

But the point is, even without converting the code to Parallel.ForEach, we could still be seeing 429 errors because there is no guarantee that the rate limit is safe without knowing the context, especially if we run multiple KM instances at the same time which potentially calls the embedding API at the same time as well.

The problem

We could implement our own GenerateEmbeddingsHandler or even a better ITextEmbeddingGenerator impl to do parallel embeddings with the control of 429 errors thru exponential retries, but still this is not an ideal solution since we need to carefully configure the KM or even multiple KM instances to understand the max TPM we could use for the model or the embedding service provider we choose at any given moment.

Luckily, OpenAI service as well as AzureOpenAI service provide the context of rate limiting information as part of the response for both Chat and Embedding REST APIs in the headers:

FIELD	SAMPLE VALUE	DESCRIPTION
x-ratelimit-limit-requests	60	The maximum number of requests that are permitted before exhausting the rate limit.
x-ratelimit-limit-tokens	150000	The maximum number of tokens that are permitted before exhausting the rate limit.
x-ratelimit-remaining-requests	59	The remaining number of requests that are permitted before exhausting the rate limit.
x-ratelimit-remaining-tokens	149984	The remaining number of tokens that are permitted before exhausting the rate limit.
x-ratelimit-reset-requests	1s	The time until the rate limit (based on requests) resets to its initial state.
x-ratelimit-reset-tokens	6m0s	The time until the rate limit (based on tokens) resets to its initial state.

So we could in theory use this atomic information in the response to decide when to scale up and down for the embedding speed to make sure we are using the service at maximum without abusing at the same time. And don't forget, it would be extremely useful under the context of multiple KM instances running at same time. So we don't need to inform the distributed KMs of the current rate limiting knowledge.

We probably don't need to go this far for Chat APIs or Chat use cases but it is very applicable and valuable for embedding scenarios.

Proposed solution

Here are the things that need to be implemented to achieve what I have described above if we are going to do it:

Expose x-ratelimit-* headers in the OpenAIClientCore class in Microsoft.SemanticKernel.Connectors.OpenAI for Embedding APIs (Nice to have for Chat APIs as well) => This requires changes in the SK repo (But I know @dluc you are the architect for both SK and KM so I am not going to create a new issue there:)).
Surfacing above headers/info to the OpenAITextEmbeddingGenerationService and AzureOpenAITextEmbeddingGenerationService
Now, we could go different approaches from here:
- Plan A: Implement rate limiting logic in the GenerateEmbeddingsAsync API of the TextEmbeddingGenerationService itself using the x-ratelimit-* information. Basically, it should wait for sometime before actually invoking the OpenAI API if there aren't many tokens remaining or rate is too high at the moment. And then we could blindly converting the existing foreach loop in the GenerateEmbeddingsHandler to be a Parallel.ForEach loop.
- Plan B: Instead of implementing the logic inside the GenerateEmbeddingsAsync API, we could implement the rate limiting logic inside GenerateEmbeddingsHandler so that we keep GenerateEmbeddingsAsync API lightweighted. But this approach requires rewrite of the embedding logic. Ideally converting the foreach loop into a queue based ingestion loop where we control the flow speed by the x-ratelimit-* information.

Importance

would be great to have

dluc · 2024-03-23T16:47:40Z

hi @0x7c13

We could theoretically convert the existing code to use Parallel.ForEach instead to drastically improve the embedding speed since the embedding for partitionFiles are not logically coupled.

there's an open PR here you might be interested in: #147. However, handling 429 in with parallel calls would require a centralized rate limited. I'm not so sure it's worth the complexity of parallelizing if then we have to slow down because of throttling.

I agree about handling headers returned by OpenAI and Azure, and I think the first step would be improving the SK Embedding generation service, so everyone would benefit, including KM. SK clients internally use the Azure SDK, and there might be a chance that it's only few lines of code, implementing a retry strategy based on those headers. In scenarios where the app disables any retry strategy, asking for a single attempt, we might need to surface the internal http exception that includes all the details.

On a side note, this might already be possible by injecting a custom HTTP client, configured with handlers to follow the retry headers.

/cc @markwallace-microsoft

0x7c13 added the enhancement New feature or request label Mar 23, 2024

microsoft locked and limited conversation to collaborators Jun 5, 2024

dluc converted this issue into discussion #614 Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

[Feature Request] Auto-throttling the embedding generation speed thru the use of x-ratelimit-* headers #381

[Feature Request] Auto-throttling the embedding generation speed thru the use of x-ratelimit-* headers #381

0x7c13 commented Mar 23, 2024

dluc commented Mar 23, 2024 •

edited

This issue was moved to a discussion.

This issue was moved to a discussion.

[Feature Request] Auto-throttling the embedding generation speed thru the use of x-ratelimit-* headers #381

[Feature Request] Auto-throttling the embedding generation speed thru the use of x-ratelimit-* headers #381

Comments

0x7c13 commented Mar 23, 2024

Context / Scenario

The problem

Proposed solution

Importance

dluc commented Mar 23, 2024 • edited

This issue was moved to a discussion.

dluc commented Mar 23, 2024 •

edited