New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared, global RLQS client & buckets cache #34009
base: main
Are you sure you want to change the base?
Conversation
…le filter worker threads, and the client interface that the worker threads can call to for unsafe operations. Signed-off-by: Brian Surber <bsurber@google.com>
…lements RateLimitClient for the local worker thread to call. The global client object performs all the thread-unsafe operations against the source-of-truth (safely, by only running them on the main thread) & pushes the results to TLS caches for the local clients to read. Signed-off-by: Brian Surber <bsurber@google.com>
… worker thread's local rl client when write ops are needed (which get passed up to the global client) Signed-off-by: Brian Surber <bsurber@google.com>
…ed resources Signed-off-by: Brian Surber <bsurber@google.com>
…ilter logic, and run through full integration testing. Signed-off-by: Brian Surber <bsurber@google.com>
Single shared rlqs client
Hi @bsurber, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
@bsurber could you resolve the merge conflict please - i think that is what is preventing ci from working |
/assign @tyxia |
Signed-off-by: bsurber <73970703+bsurber@users.noreply.github.com>
@bsurber please fix code format. You can run the /wait |
Signed-off-by: Brian Surber <bsurber@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! Nice work
We have been discussed this for a while. I just add some context here:
Current model is thread local model: RLQS client, quota cache etc are per thread.
The new model (that is introduced here) is global model: RLQS client, quota cache etc are per envoy instance and shared across threads
The motivation behind the global model is consistency (from RLQS server perspective in particular), but it is potentially trading off consistency with contention, especially we should be careful about high QPS multi-thread case.
It will be great to perform the load test before PR is merged. We can kick off the code review though.
Of note, the added load largely won't be on the worker threads, as they only ever touch shared resources to read a pointer from the thread-local cache, increment atomics, and potentially query a shared tokenbucket (but that's the same in the per-worker-thread model). The only new contention is that added by a) the atomics (so minimal), and b) thread-local-storage. Instead, my main concern to test is the added load on the main thread, which has to do write operations against the cache + source-of-truth when the cache is first initialized for each bucket, when sending RLQS usage reports, and when processing RLQS responses into quota assignments then writing them into the source-of-truth + cache. |
Signed-off-by: Brian Surber <bsurber@google.com>
Looks like this needs more test coverage, and also a merge. |
…d testing for some pointer-safety checks. Signed-off-by: Brian Surber <bsurber@google.com>
Signed-off-by: bsurber <73970703+bsurber@users.noreply.github.com>
Signed-off-by: Brian Surber <bsurber@google.com>
Ah, still slightly off the coverage limit there. (Edit: Actually, quite far off, I need to remove some defensive coding to follow Envoy style standards). |
Signed-off-by: Brian Surber <bsurber@google.com>
…This and other minor changes improve code coverage. Signed-off-by: Brian Surber <bsurber@google.com>
Commit Message:
Currently the RLQS client & bucket cache in use by the rate_limit_quota filter is set to be per-thread. This causes each client to only have visibility into a small section of the total traffic seen by the Envoy and multiplicatively increases the number of concurrent, managed streams to the RLQS backend.
This PR will merge the bucket caches to a single, shared map that is thread-safe to access and shared via TLS. Unsafe operations (namely creation of a new index in the bucket cache & setting of quota assignments from RLQS responses) are done by the main thread against a single source-of-truth, then pushed out to worker threads (again via pointer swap + TLS).
Local threads will also no longer have access to their own RLQS clients + streams. Instead, management of a single, shared RLQS stream will be done on the main thread, by a global client object. That global client object will handle the asynchronous generation & sending of RLQS UsageReports, as well as the processing of incoming RLQS Responses into actionable quota assignments for the filter worker-threads to pull from the buckets cache.
Additional Description:
The biggest TODO after submission will be supporting the
reporting_interval
field & handling reporting on different timers if buckets are configured with different intervals.Risk Level: Medium
Testing: