Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Auto-throttling the embedding generation speed thru the use of x-ratelimit-* headers #381

Closed
0x7c13 opened this issue Mar 23, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@0x7c13
Copy link
Member

0x7c13 commented Mar 23, 2024

Context / Scenario

I was trying to ingest a large (26MB) PDF file using s Serverless KM instance locally the other day and found that it took really long time for the indexing/embedding to complete. I was trying to profiling the code and realized that the actual extraction process happens really quick.

The reason it took so long is because the GenerateEmbeddingsHandler calls the ITextEmbeddingGenerator in a synchronized foreach loop fashion. We could theoretically convert the existing code to use Parallel.ForEach instead to drastically improve the embedding speed since the embedding for partitionFiles are not logically coupled.

Example:

ConcurrentDictionary<string, DataPipeline.GeneratedFileDetails> newFiles = new();

Parallel.ForEach(uploadedFile.GeneratedFiles, new ParallelOptions { MaxDegreeOfParallelism = ... },
    async (generatedFile, state) => 
    { 
        ...
        newFiles.TryAdd(embeddingFileName,  embeddingFileNameDetails);
    }     

However, although it works for me but this is still not an ideal solution since both OpenAI and AzureOpenAI has a built-in rate limiter that prevent clients from abusing the endpoint.

But the point is, even without converting the code to Parallel.ForEach, we could still be seeing 429 errors because there is no guarantee that the rate limit is safe without knowing the context, especially if we run multiple KM instances at the same time which potentially calls the embedding API at the same time as well.

The problem

We could implement our own GenerateEmbeddingsHandler or even a better ITextEmbeddingGenerator impl to do parallel embeddings with the control of 429 errors thru exponential retries, but still this is not an ideal solution since we need to carefully configure the KM or even multiple KM instances to understand the max TPM we could use for the model or the embedding service provider we choose at any given moment.

Luckily, OpenAI service as well as AzureOpenAI service provide the context of rate limiting information as part of the response for both Chat and Embedding REST APIs in the headers:

FIELD SAMPLE VALUE DESCRIPTION
x-ratelimit-limit-requests 60 The maximum number of requests that are permitted before exhausting the rate limit.
x-ratelimit-limit-tokens 150000 The maximum number of tokens that are permitted before exhausting the rate limit.
x-ratelimit-remaining-requests 59 The remaining number of requests that are permitted before exhausting the rate limit.
x-ratelimit-remaining-tokens 149984 The remaining number of tokens that are permitted before exhausting the rate limit.
x-ratelimit-reset-requests 1s The time until the rate limit (based on requests) resets to its initial state.
x-ratelimit-reset-tokens 6m0s The time until the rate limit (based on tokens) resets to its initial state.

So we could in theory use this atomic information in the response to decide when to scale up and down for the embedding speed to make sure we are using the service at maximum without abusing at the same time. And don't forget, it would be extremely useful under the context of multiple KM instances running at same time. So we don't need to inform the distributed KMs of the current rate limiting knowledge.

We probably don't need to go this far for Chat APIs or Chat use cases but it is very applicable and valuable for embedding scenarios.

Proposed solution

Here are the things that need to be implemented to achieve what I have described above if we are going to do it:

  • Expose x-ratelimit-* headers in the OpenAIClientCore class in Microsoft.SemanticKernel.Connectors.OpenAI for Embedding APIs (Nice to have for Chat APIs as well) => This requires changes in the SK repo (But I know @dluc you are the architect for both SK and KM so I am not going to create a new issue there:)).
  • Surfacing above headers/info to the OpenAITextEmbeddingGenerationService and AzureOpenAITextEmbeddingGenerationService
  • Now, we could go different approaches from here:
    • Plan A: Implement rate limiting logic in the GenerateEmbeddingsAsync API of the TextEmbeddingGenerationService itself using the x-ratelimit-* information. Basically, it should wait for sometime before actually invoking the OpenAI API if there aren't many tokens remaining or rate is too high at the moment. And then we could blindly converting the existing foreach loop in the GenerateEmbeddingsHandler to be a Parallel.ForEach loop.
    • Plan B: Instead of implementing the logic inside the GenerateEmbeddingsAsync API, we could implement the rate limiting logic inside GenerateEmbeddingsHandler so that we keep GenerateEmbeddingsAsync API lightweighted. But this approach requires rewrite of the embedding logic. Ideally converting the foreach loop into a queue based ingestion loop where we control the flow speed by the x-ratelimit-* information.

Importance

would be great to have

@0x7c13 0x7c13 added the enhancement New feature or request label Mar 23, 2024
@dluc
Copy link
Collaborator

dluc commented Mar 23, 2024

hi @0x7c13

We could theoretically convert the existing code to use Parallel.ForEach instead to drastically improve the embedding speed since the embedding for partitionFiles are not logically coupled.

there's an open PR here you might be interested in: #147. However, handling 429 in with parallel calls would require a centralized rate limited. I'm not so sure it's worth the complexity of parallelizing if then we have to slow down because of throttling.

I agree about handling headers returned by OpenAI and Azure, and I think the first step would be improving the SK Embedding generation service, so everyone would benefit, including KM. SK clients internally use the Azure SDK, and there might be a chance that it's only few lines of code, implementing a retry strategy based on those headers. In scenarios where the app disables any retry strategy, asking for a single attempt, we might need to surface the internal http exception that includes all the details.

On a side note, this might already be possible by injecting a custom HTTP client, configured with handlers to follow the retry headers.

/cc @markwallace-microsoft

@microsoft microsoft locked and limited conversation to collaborators Jun 5, 2024
@dluc dluc converted this issue into discussion #614 Jun 5, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants