Make trie structure less memory-hogging #33668

ravenblackx · 2024-04-18T19:50:24Z

Commit Message: Make trie structure less memory-hogging
Additional Description: Previous trie structure uses approximately 8kb per unique character of prefix; in a production environment with about 27000 route prefixes, this amounted to consuming approximately 1.5GB of additional RAM. This version of a trie reduces that memory impact by a factor of over 100, at a cost of a small increase in insert-time and a very small increase in lookup-time (still O(1)). Increase in operations are more than paid for by smaller memory footprint = less memory cache destruction, for tables larger than 2000 characters (e.g. 20 entries 100 characters long, 100 entries 20 characters long, etc.)
Trie used for static headers was replaced in #33932 to avoid being subject to the performance impact of this change.
Risk Level: Small. Existing tests still pass, additional tests verify some new edge-case also behaves fine, performance impact is insignificant (a bit worse in cases where it was already very fast, better in cases where it was less fast).
Testing: Added extra unit tests.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Benchmark below was generated with both implementations being tested side-by-side. PR now removes the old 'big' implementation as there is no longer any call for it.

New benchmark (in PR) results, running as bazel run -c opt test/common/common:utility_trie_speed_test -- --benchmark-repetitions=10 --benchmark_out_format=csv --benchmark_out=somefile.csv:

Benchmark	BigMean	SmallMean	Difference (>100% means BigTrieEntry was faster)
bmTrieLookupsRequestHeaders	29.7503	45.5528	153%
bmTrieLookupsResponseHeaders	21.1678	41.4313	195%
bmTrieLookups/10/Mixed	280.286	303.578	108%
bmTrieLookups/100/Mixed	419.074	351.339	83%
bmTrieLookups/1000/Mixed	1282.69	576.878	44%
bmTrieLookups/10000/Mixed	1376.65	768.84	55%
bmTrieLookups/10/8	4.71941	12.7108	269%
bmTrieLookups/100/8	9.93909	17.8588	179%
bmTrieLookups/1000/8	19.2406	29.3718	152%
bmTrieLookups/10000/8	37.3927	34.1951	91%
bmTrieLookups/10/128	511.97	1018.77	198%
bmTrieLookups/100/128	1734.64	749.413	43%
bmTrieLookups/1000/128	2927.7	1083.53	37%
bmTrieLookups/10000/128	4017.93	1319.94	32%

Signed-off-by: Raven Black <ravenblack@dropbox.com>

adisuissa · 2024-04-19T14:22:54Z

Thanks for starting this!
I know that previous attempts to update this data-structure saw some actual time (not the O notation) increase, which seemed to be high for the header map structure.
IIRC there are some header-map benchmarks. Can you run them as well?

Also adding @jmarantz as he has good suggestions in this area.
/assign @jmarantz @adisuissa

jmarantz

sweet! I will take a look at the code today.

WDYT of adding a perf benchmark?

Signed-off-by: Raven Black <ravenblack@dropbox.com>

…case to compiler Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz · 2024-04-23T12:34:35Z

What do you think about adding a benchmark?

/wait

ravenblackx · 2024-04-23T17:01:31Z

What do you think about adding a benchmark?

I added the output of two existing benchmarks to the PR description (one of which didn't work before or after the change). The first one seems likely to be what you were looking for, and looks like it's ~a wash.

jmarantz · 2024-04-24T12:19:29Z

Would you mind formatting the benchmark results in table/spreadsheet? One thing that struck me on a. quick scan is that headerMapImplIterate/50 got noticeably worse, and that (I assume) would be used whenever we serialize headers to upstreams and downstreams. I realize this is a bit of toil but it's not too bad if you use --benchmark_format=csv; then you can import that into a spreadsheet.

How can we reason about whether this is a net win?

jmarantz · 2024-04-24T12:22:02Z

Also should we fix the other benchmark in a separate PR so it can inform us of the impact? I'm sure there was a reason you picked that benchmark :)

I also want to point out (not having looked at code yet) that I would guess your change may offer a meaningful reduction in memory during operation.

adisuissa · 2024-04-24T12:51:07Z

I think that most (if not all) of the tests in the benchmark do not really exercise the trie.
AFAIK the trie is used to quickly access the O(1) headers. Consider adding new targeted benchmark tests that exercise accessing different paths (for example, short and long keys).

ravenblackx · 2024-04-25T15:28:25Z

Thinking one possibility if it turns out this is a noticeable additional time-cost for headers might be to use distinct TrieEntry templates for headers vs routes, because the memory cost of the headers trie is ~fixed and relatively small and (maybe) highly-trafficked, vs. the memory cost of the routes one is, well, up to 1.5GB in a real-world example and most likely only experiences one lookup per request.

I'll have a go at fixing that other benchmark and maybe adding more and putting together a side-by-side comparison (--benchmark_format=csv is a good tip I didn't know about, thanks!)

It'll take a while because this is mostly a side-task when I'm waiting for other things to build/test, and my current main task doesn't have very long pauses.

adisuissa · 2024-04-25T15:52:53Z

Thinking one possibility if it turns out this is a noticeable additional time-cost for headers might be to use distinct TrieEntry templates for headers vs routes, because the memory cost of the headers trie is ~fixed and relatively small and (maybe) highly-trafficked, vs. the memory cost of the routes one is, well, up to 1.5GB in a real-world example and most likely only experiences one lookup per request.

+1
Changing the header-map has a bigger impact radius, so having 2 trie implementations seems safer.

ravenblackx · 2024-04-25T16:46:22Z

Also should we fix the other benchmark in a separate PR so it can inform us of the impact? I'm sure there was a reason you picked that benchmark :)

I had a go, and fixed the thrown exception, but there's just another segfault after that and I don't really know what that benchmark is even trying to achieve, so I'm abandoning that and making an issue for it (#33794) in case someone else has an interest. I'll just make a standalone benchmark for trie performance.

jmarantz · 2024-04-25T17:12:39Z

I'm not entirely sure we should split out the header-map trie. Can we first see what's suboptimal about your trie for those header-map case and try to fix that? Maybe there's a way to have a single good trie implementation.

Failing that, we could split into 2 alternate trie implementations optmiized for different purposes, and well documented as to which new code should choose, depending on whether they are more like header-map or more like route.

In either case, can we have a new benchmark which shows the improvements you have made?

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx · 2024-04-25T21:51:06Z

New benchmark (results in the PR description) suggests the new version is faster for everything but lookups of a pretty small number of small keys - I'm guessing because it makes for fewer cache-misses.

jmarantz · 2024-04-26T02:08:32Z

Can you show the benchmark results as a table where there's a column for 'before' and a column for 'after'? Like I said I know it's a bit of toil but not too bad if you use csv.

DId you do anything to make the Iterate benchmark faster in the new version rather than slower, which it was earlier? Maybe we are getting some inconsistent behavior on your machine. You can use --benchmark_xxx arguments of varous sorts (try --help) to run more iterations and get statistically significant results.

Whenever doing these optimizations I do think it pays to run enough iterations to increase confidence.

Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz · 2024-04-26T14:45:13Z

/wait

ravenblackx · 2024-04-26T16:13:16Z

Benchmarks side-by-sided. Bit weird that headerMapImplIterate is the one benchmark with concerning numbers, but as far as I can tell it doesn't even do anything involving a trie during the benchmark's measured code.

jmarantz · 2024-04-29T02:00:08Z

This looks good for the routes. I am not entirely convinced that we aren't making things a bit worse for header-maps and thus shoiuld have 2 alternate impls with doc explaining when you'd pick which one.

One other minor request: can you add a percent-improvement column, and with negative percents when things get slower? It might be easier to see at a glance what got slower vs faster.

You might also want to use some more advanced --benchmark_xxx options to do a series of interleaved tests and report the std-deviation. It's just different options to pass to the test binary.

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx · 2024-04-29T17:46:09Z

This looks good for the routes. I am not entirely convinced that we aren't making things a bit worse for header-maps and thus shoiuld have 2 alternate impls with doc explaining when you'd pick which one.

Funny, you were against that solution previously. :)
Done it this way because I tend to agree, though I think if the static header map grew just a little bit it would actually be faster to use the smaller trie for that too (and in real-world use that might already be the case because memory-cache-smashing impact in an isolated benchmark may not reflect memory-cache-smashing real-world performance impact.)

You might also want to use some more advanced --benchmark_xxx options to do a series of interleaved tests and report the std-deviation. It's just different options to pass to the test binary.

Made it a mean over 10 executions for the new benchmark; the std-dev was consistently small enough to not really be worth reporting.
The old benchmark is essentially irrelevant now since the change isn't touching header behavior; I added one column to it with the same execution parameters to make sure it was like the "old" column (i.e. that the sugar-only changes were indeed sugar-only), which it is.

adisuissa · 2024-04-29T19:05:42Z

This looks good for the routes. I am not entirely convinced that we aren't making things a bit worse for header-maps and thus shoiuld have 2 alternate impls with doc explaining when you'd pick which one.

Funny, you were against that solution previously. :)

FWIW, it was I who was for introducing this in 2 steps (reducing blast radius).
That said, if there's no impact (or positive impact) for the headers case, then I'd be ok with updating the trie for both.

jmarantz · 2024-04-29T19:09:35Z

My preference is to iterate on the performance for the header-map and use this new impl always. If you think we can get there, I'm all for it.

But I just had a hard time interpreting the data shown as proving it was always at least as fast as it was prviously.

ravenblackx · 2024-04-29T19:40:01Z

My preference is to iterate on the performance for the header-map and use this new impl always. If you think we can get there, I'm all for it.

Yeah, I don't think there's a viable way to get the performance of FastTrieEntry with the implementation of SmallTrieEntry at the scale of the header static lookup table. (I do think it would be possible to get the header static lookup table to be faster and smaller than a trie, but that would be a separate PR since the point here was more to improve the size, and, it turns out, also performance, of the route-prefix trie.)

But I just had a hard time interpreting the data shown as proving it was always at least as fast as it was prviously.

As the benchmark now shows, it is not as fast for small datasets. I guess it might be useful to specifically run the benchmark with a dataset that is the common set of static headers. Not sure where that's defined.

…ix. Eg., for 30-seconds. Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz · 2024-04-29T19:50:12Z

OK I will look at the current state. header map performance is important so I have a hard time suggesitng changes that might make someone's critical path slower.

jmarantz

nice! thanks for the benchmark. A few comments now that I've started to look at code.

jmarantz · 2024-04-30T12:57:55Z

source/common/common/utility.h

 };

 /**
 * A trie used for faster lookup with lookup time at most equal to the size of the key.
 */
-template <class Value> struct TrieLookupTable {
+template <template <class> class EntryType, class Value> struct TrieLookupTable {


I have never seen this template syntax before. Is this different from

template <class EntryType, class Value> ...

the use-case seems pretty common to me (e.g. containers and their iterators) so I suspect it can use the simpler form.

If you do it that way then at the use-sites you have to declare like

TrieLookupTable<WhateverTrieEntry<std::unique_ptr<X>>, std::unique_ptr<X>>

versus this way it's

TrieLookupTable<WhateverTrieEntry, std::unique_ptr<X>>

What do you think of this idiom:

template<class ValueType> class DenseEntry { public: using Value = ValueType; }; template<class ValueType> class SparseEntry { public: using Value = ValueType; }; template<class EntryType> class MyContainer { using Value = typename EntryType::Value; }; MyContainer<SparseEntry<int>> sparse; MyContainer<DenseEntry<bool>> dense;

I'm fine with whawt you have; just wanted to put this out there as a taste test. I'm not sure why I can't directly use SparseEntry::ValueType from MyContainer without the 'using' nickname.

Just to give another alternative, I slightly prefer using explicit aliases instead of letting the user of TrieLookupTable know the possible types of the first parameter.
What I mean is defining the following:

template <typename T> using BigTrieLookupTable = TrieLookupTable<BigTrieEntry, T>; template <typename T> using SmallTrieLookupTable = TrieLookupTable<SmallTrieEntry, T>;

jmarantz · 2024-04-30T13:00:19Z

source/common/common/utility.h

@@ -636,15 +636,100 @@ class WelfordStandardDeviation {
  double m2_{0};
 };

-template <class Value> struct TrieEntry {
+// FastTrieEntry is used for very simple very fast lookups.


should we call this SparseTrieEntry and have the new one be DenseTrieEntry or something like that?

Of course we'd usually want fast, but your tests show that the current one is faster for small keys.

I thought about that and couldn't come up with a way that makes sense - is a SparseTrieEntry suitable for sparse data, or itself sparse?
Maybe BigTrieEntry and ScaledTrieEntry?

source/common/common/utility.h

jmarantz · 2024-04-30T14:25:15Z

test/common/common/utility_trie_speed_test.cc

+  for (auto _ : state) {
+    UNREFERENCED_PARAMETER(_);
+    auto v = trie.find(keys[key_selections[key_index++]]);
+    key_index &= 1023;


some comments around this magic would help.

jmarantz · 2024-04-30T16:38:04Z

Maybe not if it fits in a cache line?

…

On Tue, Apr 30, 2024, 12:34 PM Raven Black ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In source/common/common/utility.h <#33668 (comment)>: > Value value_{}; - std::array<std::unique_ptr<TrieEntry>, 256> entries_; + +private: + std::array<std::unique_ptr<FastTrieEntry>, 256> entries_; +}; + +// A SmallTrieEntry aims to be a good balance of performant and +// space-efficient, by allocating a vector the size of the range of children +// the node contains. This should be good for most use-cases. +// +// For example, a node with children 'a' and 'z' will contain a vector of +// size 26, containing two values and 24 nulls. A node with only one Lookups through a btree would be multiple steps. O(log(num_keys)) is surely worse than O(1) when the 1 is also very simple. — Reply to this email directly, view it on GitHub <#33668 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAO2IPMJVV5SAXPH4EDVPLTY77B2NAVCNFSM6AAAAABGN5RT4OVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMZRHE3TAMZYGQ> . You are receiving this because you were assigned.Message ID: ***@***.***>

ravenblackx · 2024-04-30T20:23:50Z

Maybe not if it fits in a cache line?

The vast majority of vector-based nodes are going to contain vectors of size 1, which (even with the overhead of also containing a start point and the vector's own overhead) would probably still end up smaller than an absl::btree of size 1. I guess I could give it a try after fixing this other stuff.

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx · 2024-04-30T21:21:13Z

Performance tested a btree_map-based TrieEntry; it's similar to the stretching vector, but slightly slower throughout (and significantly slower at the small end).

Signed-off-by: Raven Black <ravenblack@dropbox.com>

adisuissa · 2024-05-06T13:26:48Z

source/common/common/utility.h

+// performance is not critical, or where the contents may be large.
+template <class Value> class BigTrieEntry {
+public:
+  BigTrieEntry* operator[](uint8_t i) const { return entries_[i].get(); }


return type should probably be const BigTrieEntry*

adisuissa · 2024-05-06T13:28:46Z

source/common/common/utility.h

+template <class Value> class SmallTrieEntry {
+public:
+  Value value_{};
+  SmallTrieEntry* operator[](uint8_t i) const {


return type should probably be const.

Doesn't work as const because it's used internally in the add+overwrite case where the value is modified.

Maybe have 2 methods: a const method that returns a const ptr, and a non-const method that returns a non-const ptr?

I don't think that even does anything, if both functions have the same behavior? A client that's asking for a const will get the exact same thing from a non-const-returning function as it would if both functions existed.

And it's effectively a private operator, in that the Trie class never exposes a TrieEntry.

If we go with my suggestion of fixing the static header map first then it can be made more explicit that this is private, in that the entry template could be a private member class of the trie template.

It's essentially to ensure const-correctness (see https://chromium.googlesource.com/chromium/src/+/main/styleguide/c++/const.md for more details).

Ah, that makes more sense, I thought the suggestion was const and non-const just on the return value, const X fn() const; and X fn(); is much more coherent. Will do.

Signed-off-by: Raven Black <ravenblack@dropbox.com>

adisuissa

Thanks!
Just a minor comment.
I'm guessing that runtime guarding it is out of the question, but I just wanted to raise this point as we may be missing how this is going to impact some use-cases.

adisuissa · 2024-05-06T17:10:05Z

source/common/common/utility.h

 };

 /**
 * A trie used for faster lookup with lookup time at most equal to the size of the key.
 */
-template <class Value> struct TrieLookupTable {
+template <template <class> class EntryType, class Value> struct TrieLookupTable {


Just to give another alternative, I slightly prefer using explicit aliases instead of letting the user of TrieLookupTable know the possible types of the first parameter.
What I mean is defining the following:

template <typename T> using BigTrieLookupTable = TrieLookupTable<BigTrieEntry, T>; template <typename T> using SmallTrieLookupTable = TrieLookupTable<SmallTrieEntry, T>;

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx · 2024-05-06T18:18:48Z

I think it would probably make the most sense to land #33915 first, then #33932, then this one can be modified back to only having one version and one layer of template (and naming it back to TrieEntry/TrieLookupTable) because BigTrieEntry will no longer have any customers.
That will also make this PR not touch so many files, since all the customers of TrieLookupTable other than the static header map will be able (and appropriate) to use the new one with no touch at the client end.

ravenblackx · 2024-05-07T14:58:05Z

I think it would probably make the most sense to land #33915 first, then #33932, [...]

Added bonus of this approach, it will fix the msan/tsan problem of OOM-crashing when running the benchmark, because there will no longer be a BigTrieEntry construct to compare, which is the one causing the OOM, which is essentially the thing this PR was intended to fix in the first place. :)

Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz · 2024-05-07T15:14:16Z

source/common/common/utility.h

@@ -687,12 +688,15 @@ template <class Value> class BigTrieEntry {
 template <class Value> class SmallTrieEntry {
 public:
  Value value_{};
-  SmallTrieEntry* operator[](uint8_t i) const {
+  const SmallTrieEntry* operator[](uint8_t i) const {


If this is never used, what happens if you comment it out?

Oh yeah, of course it's used because the find() is const, I'm totally wrong.

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx · 2024-05-23T15:38:28Z

Now that the static header map search isn't a TrieLookupTable, this PR is much simpler because it's back to not needing to support two implementations (and also no longer needs to touch any client-code).

Ready for review again.

ravenblackx · 2024-05-24T21:44:05Z

/retest

Make trie structure less memory-hogging

ddb0b72

Signed-off-by: Raven Black <ravenblack@dropbox.com>

repokitteh-read-only bot assigned adisuissa and jmarantz Apr 19, 2024

jmarantz reviewed Apr 19, 2024

View reviewed changes

ravenblackx added 3 commits April 19, 2024 16:53

Spelling

5a9d39c

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Embarrassing off-by-one error, and remove unnecessary c_str

ca28226

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Initialize min_child_ to satisfy msan, reorder branch to hint likely …

d227b72

…case to compiler Signed-off-by: Raven Black <ravenblack@dropbox.com>

repokitteh-read-only bot added the waiting label Apr 23, 2024

ravenblackx added 2 commits April 25, 2024 20:42

Trie can support non-pointer value types.

151142c

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Benchmarks

02a6c2c

Signed-off-by: Raven Black <ravenblack@dropbox.com>

repokitteh-read-only bot removed the waiting label Apr 25, 2024

Spelling and expand range

4d41919

Signed-off-by: Raven Black <ravenblack@dropbox.com>

repokitteh-read-only bot added waiting and removed waiting labels Apr 26, 2024

Merge branch 'main' into trie_space

70fddd3

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Split TrieEntry implementation into Fast/Small

e3a9d5e

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx requested a review from mattklein123 as a code owner April 29, 2024 17:37

repokitteh-read-only bot removed the waiting label Apr 29, 2024

Fix warning - Value passed to --benchmark_min_time should have a suff…

f03ea15

…ix. Eg., for 30-seconds. Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz reviewed Apr 30, 2024

View reviewed changes

Rename to Big, add benchmarks specifically for static headers

1fa7ee9

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx added 2 commits April 30, 2024 22:08

Make speed test capable of testing any lookups

e5e1d36

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Spelling

54b417a

Signed-off-by: Raven Black <ravenblack@dropbox.com>

adisuissa reviewed May 6, 2024

View reviewed changes

const

f02e4ed

Signed-off-by: Raven Black <ravenblack@dropbox.com>

adisuissa reviewed May 6, 2024

View reviewed changes

not const

8a07a4c

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Const and non-const versions (though only the non-const is ever used)

638e8d6

Signed-off-by: Raven Black <ravenblack@dropbox.com>

jmarantz reviewed May 7, 2024

View reviewed changes

ravenblackx marked this pull request as draft May 10, 2024 16:35

ravenblackx added 2 commits May 23, 2024 15:11

Merge branch 'main' into trie_space

b8e653c

Signed-off-by: Raven Black <ravenblack@dropbox.com>

Unify the implementation

1c47849

Signed-off-by: Raven Black <ravenblack@dropbox.com>

ravenblackx marked this pull request as ready for review May 23, 2024 15:35

Make trie structure less memory-hogging #33668

Are you sure you want to change the base?

Make trie structure less memory-hogging #33668

Conversation

ravenblackx commented Apr 18, 2024 • edited

adisuissa commented Apr 19, 2024

jmarantz left a comment

Choose a reason for hiding this comment

jmarantz commented Apr 23, 2024

ravenblackx commented Apr 23, 2024 • edited

jmarantz commented Apr 24, 2024

jmarantz commented Apr 24, 2024

adisuissa commented Apr 24, 2024

ravenblackx commented Apr 25, 2024

adisuissa commented Apr 25, 2024

ravenblackx commented Apr 25, 2024

jmarantz commented Apr 25, 2024

ravenblackx commented Apr 25, 2024

jmarantz commented Apr 26, 2024

jmarantz commented Apr 26, 2024

ravenblackx commented Apr 26, 2024

jmarantz commented Apr 29, 2024 • edited

ravenblackx commented Apr 29, 2024

adisuissa commented Apr 29, 2024

jmarantz commented Apr 29, 2024

ravenblackx commented Apr 29, 2024

jmarantz commented Apr 29, 2024

jmarantz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmarantz commented Apr 30, 2024 via email

ravenblackx commented Apr 30, 2024

ravenblackx commented Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adisuissa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravenblackx commented May 6, 2024 • edited

ravenblackx commented May 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravenblackx commented May 23, 2024

ravenblackx commented May 24, 2024

ravenblackx commented Apr 18, 2024 •

edited

ravenblackx commented Apr 23, 2024 •

edited

jmarantz commented Apr 29, 2024 •

edited

ravenblackx commented Apr 30, 2024 •

edited

ravenblackx commented May 6, 2024 •

edited