Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared, global RLQS client & buckets cache #34009

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

bsurber
Copy link

@bsurber bsurber commented May 7, 2024

Commit Message:
Currently the RLQS client & bucket cache in use by the rate_limit_quota filter is set to be per-thread. This causes each client to only have visibility into a small section of the total traffic seen by the Envoy and multiplicatively increases the number of concurrent, managed streams to the RLQS backend.

This PR will merge the bucket caches to a single, shared map that is thread-safe to access and shared via TLS. Unsafe operations (namely creation of a new index in the bucket cache & setting of quota assignments from RLQS responses) are done by the main thread against a single source-of-truth, then pushed out to worker threads (again via pointer swap + TLS).

Local threads will also no longer have access to their own RLQS clients + streams. Instead, management of a single, shared RLQS stream will be done on the main thread, by a global client object. That global client object will handle the asynchronous generation & sending of RLQS UsageReports, as well as the processing of incoming RLQS Responses into actionable quota assignments for the filter worker-threads to pull from the buckets cache.

Additional Description:
The biggest TODO after submission will be supporting the reporting_interval field & handling reporting on different timers if buckets are configured with different intervals.

Risk Level: Medium

Testing:

  • New unit testing of both global & local client objects
  • New unit testing of filter logic
  • Updates to existing config unit testing
  • New integration testing for all of the moving parts.

bsurber and others added 6 commits May 3, 2024 23:49
…le filter worker threads, and the client interface that the worker threads can call to for unsafe operations.

Signed-off-by: Brian Surber <bsurber@google.com>
…lements RateLimitClient for the local worker thread to call. The global client object performs all the thread-unsafe operations against the source-of-truth (safely, by only running them on the main thread) & pushes the results to TLS caches for the local clients to read.

Signed-off-by: Brian Surber <bsurber@google.com>
… worker thread's local rl client when write ops are needed (which get passed up to the global client)

Signed-off-by: Brian Surber <bsurber@google.com>
…ed resources

Signed-off-by: Brian Surber <bsurber@google.com>
…ilter logic, and run through full integration testing.

Signed-off-by: Brian Surber <bsurber@google.com>
@bsurber bsurber requested a review from yanavlasov as a code owner May 7, 2024 17:39
Copy link

Hi @bsurber, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #34009 was opened by bsurber.

see: more, trace.

@phlax
Copy link
Member

phlax commented May 7, 2024

@bsurber could you resolve the merge conflict please - i think that is what is preventing ci from working

@adisuissa
Copy link
Contributor

/assign @tyxia

Signed-off-by: bsurber <73970703+bsurber@users.noreply.github.com>
@yanavlasov
Copy link
Contributor

@bsurber please fix code format. You can run the bazel run //tools/code_format:check_format -- fix or using this diff: https://dev.azure.com/cncf/envoy/_build/results?buildId=169874&view=artifacts&pathAsName=false&type=publishedArtifacts

/wait

Signed-off-by: Brian Surber <bsurber@google.com>
Copy link
Member

@tyxia tyxia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! Nice work

We have been discussed this for a while. I just add some context here:
Current model is thread local model: RLQS client, quota cache etc are per thread.
The new model (that is introduced here) is global model: RLQS client, quota cache etc are per envoy instance and shared across threads

The motivation behind the global model is consistency (from RLQS server perspective in particular), but it is potentially trading off consistency with contention, especially we should be careful about high QPS multi-thread case.

It will be great to perform the load test before PR is merged. We can kick off the code review though.

@bsurber
Copy link
Author

bsurber commented May 10, 2024

Of note, the added load largely won't be on the worker threads, as they only ever touch shared resources to read a pointer from the thread-local cache, increment atomics, and potentially query a shared tokenbucket (but that's the same in the per-worker-thread model). The only new contention is that added by a) the atomics (so minimal), and b) thread-local-storage.

Instead, my main concern to test is the added load on the main thread, which has to do write operations against the cache + source-of-truth when the cache is first initialized for each bucket, when sending RLQS usage reports, and when processing RLQS responses into quota assignments then writing them into the source-of-truth + cache.

Signed-off-by: Brian Surber <bsurber@google.com>
@ravenblackx
Copy link
Contributor

Looks like this needs more test coverage, and also a merge.
/wait

…d testing for some pointer-safety checks.

Signed-off-by: Brian Surber <bsurber@google.com>
bsurber and others added 2 commits May 15, 2024 11:38
Signed-off-by: bsurber <73970703+bsurber@users.noreply.github.com>
Signed-off-by: Brian Surber <bsurber@google.com>
@bsurber
Copy link
Author

bsurber commented May 16, 2024

Ah, still slightly off the coverage limit there. (Edit: Actually, quite far off, I need to remove some defensive coding to follow Envoy style standards).

Signed-off-by: Brian Surber <bsurber@google.com>
…This and other minor changes improve code coverage.

Signed-off-by: Brian Surber <bsurber@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants