Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make trie structure less memory-hogging #33668

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

ravenblackx
Copy link
Contributor

@ravenblackx ravenblackx commented Apr 18, 2024

Commit Message: Make trie structure less memory-hogging
Additional Description: Previous trie structure uses approximately 8kb per unique character of prefix; in a production environment with about 27000 route prefixes, this amounted to consuming approximately 1.5GB of additional RAM. This version of a trie reduces that memory impact by a factor of over 100, at a cost of a small increase in insert-time and a very small increase in lookup-time (still O(1)). Increase in operations are more than paid for by smaller memory footprint = less memory cache destruction, for tables larger than 2000 characters (e.g. 20 entries 100 characters long, 100 entries 20 characters long, etc.)
Trie used for static headers was replaced in #33932 to avoid being subject to the performance impact of this change.
Risk Level: Small. Existing tests still pass, additional tests verify some new edge-case also behaves fine, performance impact is insignificant (a bit worse in cases where it was already very fast, better in cases where it was less fast).
Testing: Added extra unit tests.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Benchmark below was generated with both implementations being tested side-by-side. PR now removes the old 'big' implementation as there is no longer any call for it.

New benchmark (in PR) results, running as bazel run -c opt test/common/common:utility_trie_speed_test -- --benchmark-repetitions=10 --benchmark_out_format=csv --benchmark_out=somefile.csv:

Benchmark BigMean SmallMean Difference (>100% means BigTrieEntry was faster)
bmTrieLookupsRequestHeaders 29.7503 45.5528 153%
bmTrieLookupsResponseHeaders 21.1678 41.4313 195%
bmTrieLookups/10/Mixed 280.286 303.578 108%
bmTrieLookups/100/Mixed 419.074 351.339 83%
bmTrieLookups/1000/Mixed 1282.69 576.878 44%
bmTrieLookups/10000/Mixed 1376.65 768.84 55%
bmTrieLookups/10/8 4.71941 12.7108 269%
bmTrieLookups/100/8 9.93909 17.8588 179%
bmTrieLookups/1000/8 19.2406 29.3718 152%
bmTrieLookups/10000/8 37.3927 34.1951 91%
bmTrieLookups/10/128 511.97 1018.77 198%
bmTrieLookups/100/128 1734.64 749.413 43%
bmTrieLookups/1000/128 2927.7 1083.53 37%
bmTrieLookups/10000/128 4017.93 1319.94 32%

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@adisuissa
Copy link
Contributor

Thanks for starting this!
I know that previous attempts to update this data-structure saw some actual time (not the O notation) increase, which seemed to be high for the header map structure.
IIRC there are some header-map benchmarks. Can you run them as well?

Also adding @jmarantz as he has good suggestions in this area.
/assign @jmarantz @adisuissa

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet! I will take a look at the code today.

WDYT of adding a perf benchmark?

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
…case to compiler

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@jmarantz
Copy link
Contributor

What do you think about adding a benchmark?

/wait

@ravenblackx
Copy link
Contributor Author

ravenblackx commented Apr 23, 2024

What do you think about adding a benchmark?

I added the output of two existing benchmarks to the PR description (one of which didn't work before or after the change). The first one seems likely to be what you were looking for, and looks like it's ~a wash.

@jmarantz
Copy link
Contributor

Would you mind formatting the benchmark results in table/spreadsheet? One thing that struck me on a. quick scan is that headerMapImplIterate/50 got noticeably worse, and that (I assume) would be used whenever we serialize headers to upstreams and downstreams. I realize this is a bit of toil but it's not too bad if you use --benchmark_format=csv; then you can import that into a spreadsheet.

How can we reason about whether this is a net win?

@jmarantz
Copy link
Contributor

Also should we fix the other benchmark in a separate PR so it can inform us of the impact? I'm sure there was a reason you picked that benchmark :)

I also want to point out (not having looked at code yet) that I would guess your change may offer a meaningful reduction in memory during operation.

@adisuissa
Copy link
Contributor

I think that most (if not all) of the tests in the benchmark do not really exercise the trie.
AFAIK the trie is used to quickly access the O(1) headers. Consider adding new targeted benchmark tests that exercise accessing different paths (for example, short and long keys).

@ravenblackx
Copy link
Contributor Author

Thinking one possibility if it turns out this is a noticeable additional time-cost for headers might be to use distinct TrieEntry templates for headers vs routes, because the memory cost of the headers trie is ~fixed and relatively small and (maybe) highly-trafficked, vs. the memory cost of the routes one is, well, up to 1.5GB in a real-world example and most likely only experiences one lookup per request.

I'll have a go at fixing that other benchmark and maybe adding more and putting together a side-by-side comparison (--benchmark_format=csv is a good tip I didn't know about, thanks!)

It'll take a while because this is mostly a side-task when I'm waiting for other things to build/test, and my current main task doesn't have very long pauses.

@adisuissa
Copy link
Contributor

Thinking one possibility if it turns out this is a noticeable additional time-cost for headers might be to use distinct TrieEntry templates for headers vs routes, because the memory cost of the headers trie is ~fixed and relatively small and (maybe) highly-trafficked, vs. the memory cost of the routes one is, well, up to 1.5GB in a real-world example and most likely only experiences one lookup per request.

+1
Changing the header-map has a bigger impact radius, so having 2 trie implementations seems safer.

@ravenblackx
Copy link
Contributor Author

Also should we fix the other benchmark in a separate PR so it can inform us of the impact? I'm sure there was a reason you picked that benchmark :)

I had a go, and fixed the thrown exception, but there's just another segfault after that and I don't really know what that benchmark is even trying to achieve, so I'm abandoning that and making an issue for it (#33794) in case someone else has an interest. I'll just make a standalone benchmark for trie performance.

@jmarantz
Copy link
Contributor

I'm not entirely sure we should split out the header-map trie. Can we first see what's suboptimal about your trie for those header-map case and try to fix that? Maybe there's a way to have a single good trie implementation.

Failing that, we could split into 2 alternate trie implementations optmiized for different purposes, and well documented as to which new code should choose, depending on whether they are more like header-map or more like route.

In either case, can we have a new benchmark which shows the improvements you have made?

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

New benchmark (results in the PR description) suggests the new version is faster for everything but lookups of a pretty small number of small keys - I'm guessing because it makes for fewer cache-misses.

@jmarantz
Copy link
Contributor

Can you show the benchmark results as a table where there's a column for 'before' and a column for 'after'? Like I said I know it's a bit of toil but not too bad if you use csv.

DId you do anything to make the Iterate benchmark faster in the new version rather than slower, which it was earlier? Maybe we are getting some inconsistent behavior on your machine. You can use --benchmark_xxx arguments of varous sorts (try --help) to run more iterations and get statistically significant results.

Whenever doing these optimizations I do think it pays to run enough iterations to increase confidence.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@jmarantz
Copy link
Contributor

/wait

@ravenblackx
Copy link
Contributor Author

Benchmarks side-by-sided. Bit weird that headerMapImplIterate is the one benchmark with concerning numbers, but as far as I can tell it doesn't even do anything involving a trie during the benchmark's measured code.

@jmarantz
Copy link
Contributor

jmarantz commented Apr 29, 2024

This looks good for the routes. I am not entirely convinced that we aren't making things a bit worse for header-maps and thus shoiuld have 2 alternate impls with doc explaining when you'd pick which one.

One other minor request: can you add a percent-improvement column, and with negative percents when things get slower? It might be easier to see at a glance what got slower vs faster.

You might also want to use some more advanced --benchmark_xxx options to do a series of interleaved tests and report the std-deviation. It's just different options to pass to the test binary.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

This looks good for the routes. I am not entirely convinced that we aren't making things a bit worse for header-maps and thus shoiuld have 2 alternate impls with doc explaining when you'd pick which one.

Funny, you were against that solution previously. :)
Done it this way because I tend to agree, though I think if the static header map grew just a little bit it would actually be faster to use the smaller trie for that too (and in real-world use that might already be the case because memory-cache-smashing impact in an isolated benchmark may not reflect memory-cache-smashing real-world performance impact.)

You might also want to use some more advanced --benchmark_xxx options to do a series of interleaved tests and report the std-deviation. It's just different options to pass to the test binary.

Made it a mean over 10 executions for the new benchmark; the std-dev was consistently small enough to not really be worth reporting.
The old benchmark is essentially irrelevant now since the change isn't touching header behavior; I added one column to it with the same execution parameters to make sure it was like the "old" column (i.e. that the sugar-only changes were indeed sugar-only), which it is.

@adisuissa
Copy link
Contributor

This looks good for the routes. I am not entirely convinced that we aren't making things a bit worse for header-maps and thus shoiuld have 2 alternate impls with doc explaining when you'd pick which one.

Funny, you were against that solution previously. :)

FWIW, it was I who was for introducing this in 2 steps (reducing blast radius).
That said, if there's no impact (or positive impact) for the headers case, then I'd be ok with updating the trie for both.

@jmarantz
Copy link
Contributor

My preference is to iterate on the performance for the header-map and use this new impl always. If you think we can get there, I'm all for it.

But I just had a hard time interpreting the data shown as proving it was always at least as fast as it was prviously.

@ravenblackx
Copy link
Contributor Author

My preference is to iterate on the performance for the header-map and use this new impl always. If you think we can get there, I'm all for it.

Yeah, I don't think there's a viable way to get the performance of FastTrieEntry with the implementation of SmallTrieEntry at the scale of the header static lookup table. (I do think it would be possible to get the header static lookup table to be faster and smaller than a trie, but that would be a separate PR since the point here was more to improve the size, and, it turns out, also performance, of the route-prefix trie.)

But I just had a hard time interpreting the data shown as proving it was always at least as fast as it was prviously.

As the benchmark now shows, it is not as fast for small datasets. I guess it might be useful to specifically run the benchmark with a dataset that is the common set of static headers. Not sure where that's defined.

…ix. Eg., for 30-seconds.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@jmarantz
Copy link
Contributor

OK I will look at the current state. header map performance is important so I have a hard time suggesitng changes that might make someone's critical path slower.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! thanks for the benchmark. A few comments now that I've started to look at code.

};

/**
* A trie used for faster lookup with lookup time at most equal to the size of the key.
*/
template <class Value> struct TrieLookupTable {
template <template <class> class EntryType, class Value> struct TrieLookupTable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never seen this template syntax before. Is this different from

template <class EntryType, class Value>  ...

the use-case seems pretty common to me (e.g. containers and their iterators) so I suspect it can use the simpler form.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do it that way then at the use-sites you have to declare like

TrieLookupTable<WhateverTrieEntry<std::unique_ptr<X>>, std::unique_ptr<X>>

versus this way it's

TrieLookupTable<WhateverTrieEntry, std::unique_ptr<X>>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of this idiom:

template<class ValueType> class DenseEntry {
 public:
  using Value = ValueType;
};
template<class ValueType> class SparseEntry {
 public:
  using Value = ValueType;
};

template<class EntryType> class MyContainer {
  using Value = typename EntryType::Value;
};

MyContainer<SparseEntry<int>> sparse;
MyContainer<DenseEntry<bool>> dense;

I'm fine with whawt you have; just wanted to put this out there as a taste test. I'm not sure why I can't directly use SparseEntry::ValueType from MyContainer without the 'using' nickname.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to give another alternative, I slightly prefer using explicit aliases instead of letting the user of TrieLookupTable know the possible types of the first parameter.
What I mean is defining the following:

template <typename T>
using BigTrieLookupTable = TrieLookupTable<BigTrieEntry, T>;

template <typename T>
using SmallTrieLookupTable = TrieLookupTable<SmallTrieEntry, T>;

@@ -636,15 +636,100 @@ class WelfordStandardDeviation {
double m2_{0};
};

template <class Value> struct TrieEntry {
// FastTrieEntry is used for very simple very fast lookups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call this SparseTrieEntry and have the new one be DenseTrieEntry or something like that?

Of course we'd usually want fast, but your tests show that the current one is faster for small keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that and couldn't come up with a way that makes sense - is a SparseTrieEntry suitable for sparse data, or itself sparse?
Maybe BigTrieEntry and ScaledTrieEntry?

source/common/common/utility.h Show resolved Hide resolved
for (auto _ : state) {
UNREFERENCED_PARAMETER(_);
auto v = trie.find(keys[key_selections[key_index++]]);
key_index &= 1023;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments around this magic would help.

@jmarantz
Copy link
Contributor

jmarantz commented Apr 30, 2024 via email

@ravenblackx
Copy link
Contributor Author

Maybe not if it fits in a cache line?

The vast majority of vector-based nodes are going to contain vectors of size 1, which (even with the overhead of also containing a start point and the vector's own overhead) would probably still end up smaller than an absl::btree of size 1. I guess I could give it a try after fixing this other stuff.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

ravenblackx commented Apr 30, 2024

Performance tested a btree_map-based TrieEntry; it's similar to the stretching vector, but slightly slower throughout (and significantly slower at the small end).

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
// performance is not critical, or where the contents may be large.
template <class Value> class BigTrieEntry {
public:
BigTrieEntry* operator[](uint8_t i) const { return entries_[i].get(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type should probably be const BigTrieEntry*

template <class Value> class SmallTrieEntry {
public:
Value value_{};
SmallTrieEntry* operator[](uint8_t i) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type should probably be const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't work as const because it's used internally in the add+overwrite case where the value is modified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe have 2 methods: a const method that returns a const ptr, and a non-const method that returns a non-const ptr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that even does anything, if both functions have the same behavior? A client that's asking for a const will get the exact same thing from a non-const-returning function as it would if both functions existed.

And it's effectively a private operator, in that the Trie class never exposes a TrieEntry.

If we go with my suggestion of fixing the static header map first then it can be made more explicit that this is private, in that the entry template could be a private member class of the trie template.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's essentially to ensure const-correctness (see https://chromium.googlesource.com/chromium/src/+/main/styleguide/c++/const.md for more details).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes more sense, I thought the suggestion was const and non-const just on the return value, const X fn() const; and X fn(); is much more coherent. Will do.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Just a minor comment.
I'm guessing that runtime guarding it is out of the question, but I just wanted to raise this point as we may be missing how this is going to impact some use-cases.

};

/**
* A trie used for faster lookup with lookup time at most equal to the size of the key.
*/
template <class Value> struct TrieLookupTable {
template <template <class> class EntryType, class Value> struct TrieLookupTable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to give another alternative, I slightly prefer using explicit aliases instead of letting the user of TrieLookupTable know the possible types of the first parameter.
What I mean is defining the following:

template <typename T>
using BigTrieLookupTable = TrieLookupTable<BigTrieEntry, T>;

template <typename T>
using SmallTrieLookupTable = TrieLookupTable<SmallTrieEntry, T>;

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

ravenblackx commented May 6, 2024

I think it would probably make the most sense to land #33915 first, then #33932, then this one can be modified back to only having one version and one layer of template (and naming it back to TrieEntry/TrieLookupTable) because BigTrieEntry will no longer have any customers.
That will also make this PR not touch so many files, since all the customers of TrieLookupTable other than the static header map will be able (and appropriate) to use the new one with no touch at the client end.

@ravenblackx
Copy link
Contributor Author

I think it would probably make the most sense to land #33915 first, then #33932, [...]

Added bonus of this approach, it will fix the msan/tsan problem of OOM-crashing when running the benchmark, because there will no longer be a BigTrieEntry construct to compare, which is the one causing the OOM, which is essentially the thing this PR was intended to fix in the first place. :)

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@@ -687,12 +688,15 @@ template <class Value> class BigTrieEntry {
template <class Value> class SmallTrieEntry {
public:
Value value_{};
SmallTrieEntry* operator[](uint8_t i) const {
const SmallTrieEntry* operator[](uint8_t i) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is never used, what happens if you comment it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, of course it's used because the find() is const, I'm totally wrong.

@ravenblackx ravenblackx marked this pull request as draft May 10, 2024 16:35
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx ravenblackx marked this pull request as ready for review May 23, 2024 15:35
@ravenblackx
Copy link
Contributor Author

Now that the static header map search isn't a TrieLookupTable, this PR is much simpler because it's back to not needing to support two implementations (and also no longer needs to touch any client-code).

Ready for review again.

@ravenblackx
Copy link
Contributor Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants