Shared, global RLQS client & buckets cache #34009

bsurber · 2024-05-07T17:39:13Z

Commit Message:
Currently the RLQS client & bucket cache in use by the rate_limit_quota filter is set to be per-thread. This causes each client to only have visibility into a small section of the total traffic seen by the Envoy and multiplicatively increases the number of concurrent, managed streams to the RLQS backend.

This PR will merge the bucket caches to a single, shared map that is thread-safe to access and shared via TLS. Unsafe operations (namely creation of a new index in the bucket cache & setting of quota assignments from RLQS responses) are done by the main thread against a single source-of-truth, then pushed out to worker threads (again via pointer swap + TLS).

Local threads will also no longer have access to their own RLQS clients + streams. Instead, management of a single, shared RLQS stream will be done on the main thread, by a global client object. That global client object will handle the asynchronous generation & sending of RLQS UsageReports, as well as the processing of incoming RLQS Responses into actionable quota assignments for the filter worker-threads to pull from the buckets cache.

Additional Description:
The biggest TODO after submission will be supporting the reporting_interval field & handling reporting on different timers if buckets are configured with different intervals.

Risk Level: Medium

Testing:

New unit testing of both global & local client objects
New unit testing of filter logic
Updates to existing config unit testing
New integration testing for all of the moving parts.

…le filter worker threads, and the client interface that the worker threads can call to for unsafe operations. Signed-off-by: Brian Surber <bsurber@google.com>

…lements RateLimitClient for the local worker thread to call. The global client object performs all the thread-unsafe operations against the source-of-truth (safely, by only running them on the main thread) & pushes the results to TLS caches for the local clients to read. Signed-off-by: Brian Surber <bsurber@google.com>

… worker thread's local rl client when write ops are needed (which get passed up to the global client) Signed-off-by: Brian Surber <bsurber@google.com>

…ed resources Signed-off-by: Brian Surber <bsurber@google.com>

…ilter logic, and run through full integration testing. Signed-off-by: Brian Surber <bsurber@google.com>

Single shared rlqs client

repokitteh-read-only · 2024-05-07T17:39:18Z

Hi @bsurber, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #34009 was opened by bsurber.

see: more, trace.

phlax · 2024-05-07T17:44:07Z

@bsurber could you resolve the merge conflict please - i think that is what is preventing ci from working

adisuissa · 2024-05-07T19:44:48Z

/assign @tyxia

Signed-off-by: bsurber <73970703+bsurber@users.noreply.github.com>

yanavlasov · 2024-05-08T15:20:24Z

@bsurber please fix code format. You can run the bazel run //tools/code_format:check_format -- fix or using this diff: https://dev.azure.com/cncf/envoy/_build/results?buildId=169874&view=artifacts&pathAsName=false&type=publishedArtifacts

/wait

Signed-off-by: Brian Surber <bsurber@google.com>

tyxia

Thank you for working on this! Nice work

We have been discussed this for a while. I just add some context here:
Current model is thread local model: RLQS client, quota cache etc are per thread.
The new model (that is introduced here) is global model: RLQS client, quota cache etc are per envoy instance and shared across threads

The motivation behind the global model is consistency (from RLQS server perspective in particular), but it is potentially trading off consistency with contention, especially we should be careful about high QPS multi-thread case.

It will be great to perform the load test before PR is merged. We can kick off the code review though.

bsurber · 2024-05-10T19:17:56Z

Of note, the added load largely won't be on the worker threads, as they only ever touch shared resources to read a pointer from the thread-local cache, increment atomics, and potentially query a shared tokenbucket (but that's the same in the per-worker-thread model). The only new contention is that added by a) the atomics (so minimal), and b) thread-local-storage.

Instead, my main concern to test is the added load on the main thread, which has to do write operations against the cache + source-of-truth when the cache is first initialized for each bucket, when sending RLQS usage reports, and when processing RLQS responses into quota assignments then writing them into the source-of-truth + cache.

Signed-off-by: Brian Surber <bsurber@google.com>

ravenblackx · 2024-05-14T15:01:19Z

Looks like this needs more test coverage, and also a merge.
/wait

…d testing for some pointer-safety checks. Signed-off-by: Brian Surber <bsurber@google.com>

Signed-off-by: bsurber <73970703+bsurber@users.noreply.github.com>

Signed-off-by: Brian Surber <bsurber@google.com>

bsurber · 2024-05-16T19:29:44Z

Ah, still slightly off the coverage limit there. (Edit: Actually, quite far off, I need to remove some defensive coding to follow Envoy style standards).

Signed-off-by: Brian Surber <bsurber@google.com>

…This and other minor changes improve code coverage. Signed-off-by: Brian Surber <bsurber@google.com>

bsurber and others added 6 commits May 3, 2024 23:49

Create a new bucket cache type, that can be safely used across multip…

5db1b00

…le filter worker threads, and the client interface that the worker threads can call to for unsafe operations. Signed-off-by: Brian Surber <bsurber@google.com>

Update filter logic to read from the local bucket cache & call to the…

190c915

… worker thread's local rl client when write ops are needed (which get passed up to the global client) Signed-off-by: Brian Surber <bsurber@google.com>

Init functions & build dependencies updated to setup the newly requir…

7e3e3f8

…ed resources Signed-off-by: Brian Surber <bsurber@google.com>

Update unit testing to include testing of both client types & local f…

69c2856

…ilter logic, and run through full integration testing. Signed-off-by: Brian Surber <bsurber@google.com>

Merge pull request #1 from bsurber/single-shared-rlqs-client

a288a7a

Single shared rlqs client

bsurber requested a review from yanavlasov as a code owner May 7, 2024 17:39

phlax assigned yanavlasov May 7, 2024

repokitteh-read-only bot assigned tyxia May 7, 2024

Merge branch 'main' into main

f81588e

Signed-off-by: bsurber <73970703+bsurber@users.noreply.github.com>

repokitteh-read-only bot added the waiting label May 8, 2024

Format & deps fix

ff2bfd0

Signed-off-by: Brian Surber <bsurber@google.com>

repokitteh-read-only bot removed the waiting label May 9, 2024

tyxia reviewed May 10, 2024

View reviewed changes

Fix CI error by adding unreachable return.

d4f0d64

Signed-off-by: Brian Surber <bsurber@google.com>

repokitteh-read-only bot added the waiting label May 14, 2024

Improve test coverage by removing code that is unused+untestable & ad…

ae9201b

…d testing for some pointer-safety checks. Signed-off-by: Brian Surber <bsurber@google.com>

repokitteh-read-only bot removed the waiting label May 14, 2024

bsurber and others added 2 commits May 15, 2024 11:38

Merge branch 'main' into main

9b7af9f

Signed-off-by: bsurber <73970703+bsurber@users.noreply.github.com>

propagate usage shortcuts in merged integration test

7fd5a64

Signed-off-by: Brian Surber <bsurber@google.com>

bsurber added 2 commits May 17, 2024 21:45

Add to error-handling testing to increase coverage

3aca4e2

Signed-off-by: Brian Surber <bsurber@google.com>

Remove defensive coding against future bugs per envoy's style guide. …

1552f31

…This and other minor changes improve code coverage. Signed-off-by: Brian Surber <bsurber@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared, global RLQS client & buckets cache #34009

Shared, global RLQS client & buckets cache #34009

bsurber commented May 7, 2024

repokitteh-read-only bot commented May 7, 2024

phlax commented May 7, 2024

adisuissa commented May 7, 2024

yanavlasov commented May 8, 2024

tyxia left a comment •

edited

bsurber commented May 10, 2024

ravenblackx commented May 14, 2024

bsurber commented May 16, 2024 •

edited

Shared, global RLQS client & buckets cache #34009

Are you sure you want to change the base?

Shared, global RLQS client & buckets cache #34009

Conversation

bsurber commented May 7, 2024

repokitteh-read-only bot commented May 7, 2024

phlax commented May 7, 2024

adisuissa commented May 7, 2024

yanavlasov commented May 8, 2024

tyxia left a comment • edited

Choose a reason for hiding this comment

bsurber commented May 10, 2024

ravenblackx commented May 14, 2024

bsurber commented May 16, 2024 • edited

tyxia left a comment •

edited

bsurber commented May 16, 2024 •

edited