New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make trie structure less memory-hogging #33668
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Thanks for starting this! Also adding @jmarantz as he has good suggestions in this area. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet! I will take a look at the code today.
WDYT of adding a perf benchmark?
Signed-off-by: Raven Black <ravenblack@dropbox.com>
…case to compiler Signed-off-by: Raven Black <ravenblack@dropbox.com>
What do you think about adding a benchmark? /wait |
I added the output of two existing benchmarks to the PR description (one of which didn't work before or after the change). The first one seems likely to be what you were looking for, and looks like it's ~a wash. |
Would you mind formatting the benchmark results in table/spreadsheet? One thing that struck me on a. quick scan is that headerMapImplIterate/50 got noticeably worse, and that (I assume) would be used whenever we serialize headers to upstreams and downstreams. I realize this is a bit of toil but it's not too bad if you use How can we reason about whether this is a net win? |
Also should we fix the other benchmark in a separate PR so it can inform us of the impact? I'm sure there was a reason you picked that benchmark :) I also want to point out (not having looked at code yet) that I would guess your change may offer a meaningful reduction in memory during operation. |
I think that most (if not all) of the tests in the benchmark do not really exercise the trie. |
Thinking one possibility if it turns out this is a noticeable additional time-cost for headers might be to use distinct TrieEntry templates for headers vs routes, because the memory cost of the headers trie is ~fixed and relatively small and (maybe) highly-trafficked, vs. the memory cost of the routes one is, well, up to 1.5GB in a real-world example and most likely only experiences one lookup per request. I'll have a go at fixing that other benchmark and maybe adding more and putting together a side-by-side comparison ( It'll take a while because this is mostly a side-task when I'm waiting for other things to build/test, and my current main task doesn't have very long pauses. |
+1 |
I had a go, and fixed the thrown exception, but there's just another segfault after that and I don't really know what that benchmark is even trying to achieve, so I'm abandoning that and making an issue for it (#33794) in case someone else has an interest. I'll just make a standalone benchmark for trie performance. |
I'm not entirely sure we should split out the header-map trie. Can we first see what's suboptimal about your trie for those header-map case and try to fix that? Maybe there's a way to have a single good trie implementation. Failing that, we could split into 2 alternate trie implementations optmiized for different purposes, and well documented as to which new code should choose, depending on whether they are more like header-map or more like route. In either case, can we have a new benchmark which shows the improvements you have made? |
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
New benchmark (results in the PR description) suggests the new version is faster for everything but lookups of a pretty small number of small keys - I'm guessing because it makes for fewer cache-misses. |
Can you show the benchmark results as a table where there's a column for 'before' and a column for 'after'? Like I said I know it's a bit of toil but not too bad if you use csv. DId you do anything to make the Iterate benchmark faster in the new version rather than slower, which it was earlier? Maybe we are getting some inconsistent behavior on your machine. You can use Whenever doing these optimizations I do think it pays to run enough iterations to increase confidence. |
Signed-off-by: Raven Black <ravenblack@dropbox.com>
/wait |
Benchmarks side-by-sided. Bit weird that |
This looks good for the routes. I am not entirely convinced that we aren't making things a bit worse for header-maps and thus shoiuld have 2 alternate impls with doc explaining when you'd pick which one. One other minor request: can you add a percent-improvement column, and with negative percents when things get slower? It might be easier to see at a glance what got slower vs faster. You might also want to use some more advanced |
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Funny, you were against that solution previously. :)
Made it a mean over 10 executions for the new benchmark; the std-dev was consistently small enough to not really be worth reporting. |
FWIW, it was I who was for introducing this in 2 steps (reducing blast radius). |
My preference is to iterate on the performance for the header-map and use this new impl always. If you think we can get there, I'm all for it. But I just had a hard time interpreting the data shown as proving it was always at least as fast as it was prviously. |
Yeah, I don't think there's a viable way to get the performance of FastTrieEntry with the implementation of SmallTrieEntry at the scale of the header static lookup table. (I do think it would be possible to get the header static lookup table to be faster and smaller than a trie, but that would be a separate PR since the point here was more to improve the size, and, it turns out, also performance, of the route-prefix trie.)
As the benchmark now shows, it is not as fast for small datasets. I guess it might be useful to specifically run the benchmark with a dataset that is the common set of static headers. Not sure where that's defined. |
…ix. Eg., for 30-seconds. Signed-off-by: Raven Black <ravenblack@dropbox.com>
OK I will look at the current state. header map performance is important so I have a hard time suggesitng changes that might make someone's critical path slower. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! thanks for the benchmark. A few comments now that I've started to look at code.
source/common/common/utility.h
Outdated
}; | ||
|
||
/** | ||
* A trie used for faster lookup with lookup time at most equal to the size of the key. | ||
*/ | ||
template <class Value> struct TrieLookupTable { | ||
template <template <class> class EntryType, class Value> struct TrieLookupTable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have never seen this template syntax before. Is this different from
template <class EntryType, class Value> ...
the use-case seems pretty common to me (e.g. containers and their iterators) so I suspect it can use the simpler form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do it that way then at the use-sites you have to declare like
TrieLookupTable<WhateverTrieEntry<std::unique_ptr<X>>, std::unique_ptr<X>>
versus this way it's
TrieLookupTable<WhateverTrieEntry, std::unique_ptr<X>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of this idiom:
template<class ValueType> class DenseEntry {
public:
using Value = ValueType;
};
template<class ValueType> class SparseEntry {
public:
using Value = ValueType;
};
template<class EntryType> class MyContainer {
using Value = typename EntryType::Value;
};
MyContainer<SparseEntry<int>> sparse;
MyContainer<DenseEntry<bool>> dense;
I'm fine with whawt you have; just wanted to put this out there as a taste test. I'm not sure why I can't directly use SparseEntry::ValueType from MyContainer without the 'using' nickname.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to give another alternative, I slightly prefer using explicit aliases instead of letting the user of TrieLookupTable know the possible types of the first parameter.
What I mean is defining the following:
template <typename T>
using BigTrieLookupTable = TrieLookupTable<BigTrieEntry, T>;
template <typename T>
using SmallTrieLookupTable = TrieLookupTable<SmallTrieEntry, T>;
source/common/common/utility.h
Outdated
@@ -636,15 +636,100 @@ class WelfordStandardDeviation { | |||
double m2_{0}; | |||
}; | |||
|
|||
template <class Value> struct TrieEntry { | |||
// FastTrieEntry is used for very simple very fast lookups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we call this SparseTrieEntry and have the new one be DenseTrieEntry or something like that?
Of course we'd usually want fast, but your tests show that the current one is faster for small keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that and couldn't come up with a way that makes sense - is a SparseTrieEntry suitable for sparse data, or itself sparse?
Maybe BigTrieEntry
and ScaledTrieEntry
?
for (auto _ : state) { | ||
UNREFERENCED_PARAMETER(_); | ||
auto v = trie.find(keys[key_selections[key_index++]]); | ||
key_index &= 1023; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments around this magic would help.
Maybe not if it fits in a cache line?
…On Tue, Apr 30, 2024, 12:34 PM Raven Black ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In source/common/common/utility.h
<#33668 (comment)>:
> Value value_{};
- std::array<std::unique_ptr<TrieEntry>, 256> entries_;
+
+private:
+ std::array<std::unique_ptr<FastTrieEntry>, 256> entries_;
+};
+
+// A SmallTrieEntry aims to be a good balance of performant and
+// space-efficient, by allocating a vector the size of the range of children
+// the node contains. This should be good for most use-cases.
+//
+// For example, a node with children 'a' and 'z' will contain a vector of
+// size 26, containing two values and 24 nulls. A node with only one
Lookups through a btree would be multiple steps. O(log(num_keys)) is
surely worse than O(1) when the 1 is also very simple.
—
Reply to this email directly, view it on GitHub
<#33668 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAO2IPMJVV5SAXPH4EDVPLTY77B2NAVCNFSM6AAAAABGN5RT4OVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMZRHE3TAMZYGQ>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
The vast majority of vector-based nodes are going to contain vectors of size 1, which (even with the overhead of also containing a start point and the vector's own overhead) would probably still end up smaller than an absl::btree of size 1. I guess I could give it a try after fixing this other stuff. |
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Performance tested a btree_map-based TrieEntry; it's similar to the stretching vector, but slightly slower throughout (and significantly slower at the small end). |
Signed-off-by: Raven Black <ravenblack@dropbox.com>
source/common/common/utility.h
Outdated
// performance is not critical, or where the contents may be large. | ||
template <class Value> class BigTrieEntry { | ||
public: | ||
BigTrieEntry* operator[](uint8_t i) const { return entries_[i].get(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type should probably be const BigTrieEntry*
source/common/common/utility.h
Outdated
template <class Value> class SmallTrieEntry { | ||
public: | ||
Value value_{}; | ||
SmallTrieEntry* operator[](uint8_t i) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return type should probably be const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't work as const because it's used internally in the add+overwrite case where the value is modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have 2 methods: a const method that returns a const ptr, and a non-const method that returns a non-const ptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that even does anything, if both functions have the same behavior? A client that's asking for a const will get the exact same thing from a non-const-returning function as it would if both functions existed.
And it's effectively a private operator, in that the Trie class never exposes a TrieEntry.
If we go with my suggestion of fixing the static header map first then it can be made more explicit that this is private, in that the entry template could be a private member class of the trie template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's essentially to ensure const-correctness (see https://chromium.googlesource.com/chromium/src/+/main/styleguide/c++/const.md for more details).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that makes more sense, I thought the suggestion was const and non-const just on the return value, const X fn() const;
and X fn();
is much more coherent. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Just a minor comment.
I'm guessing that runtime guarding it is out of the question, but I just wanted to raise this point as we may be missing how this is going to impact some use-cases.
source/common/common/utility.h
Outdated
}; | ||
|
||
/** | ||
* A trie used for faster lookup with lookup time at most equal to the size of the key. | ||
*/ | ||
template <class Value> struct TrieLookupTable { | ||
template <template <class> class EntryType, class Value> struct TrieLookupTable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to give another alternative, I slightly prefer using explicit aliases instead of letting the user of TrieLookupTable know the possible types of the first parameter.
What I mean is defining the following:
template <typename T>
using BigTrieLookupTable = TrieLookupTable<BigTrieEntry, T>;
template <typename T>
using SmallTrieLookupTable = TrieLookupTable<SmallTrieEntry, T>;
I think it would probably make the most sense to land #33915 first, then #33932, then this one can be modified back to only having one version and one layer of template (and naming it back to TrieEntry/TrieLookupTable) because BigTrieEntry will no longer have any customers. |
Added bonus of this approach, it will fix the msan/tsan problem of OOM-crashing when running the benchmark, because there will no longer be a BigTrieEntry construct to compare, which is the one causing the OOM, which is essentially the thing this PR was intended to fix in the first place. :) |
Signed-off-by: Raven Black <ravenblack@dropbox.com>
source/common/common/utility.h
Outdated
@@ -687,12 +688,15 @@ template <class Value> class BigTrieEntry { | |||
template <class Value> class SmallTrieEntry { | |||
public: | |||
Value value_{}; | |||
SmallTrieEntry* operator[](uint8_t i) const { | |||
const SmallTrieEntry* operator[](uint8_t i) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is never used, what happens if you comment it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, of course it's used because the find() is const, I'm totally wrong.
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Now that the static header map search isn't a TrieLookupTable, this PR is much simpler because it's back to not needing to support two implementations (and also no longer needs to touch any client-code). Ready for review again. |
/retest |
Commit Message: Make trie structure less memory-hogging
Additional Description: Previous trie structure uses approximately 8kb per unique character of prefix; in a production environment with about 27000 route prefixes, this amounted to consuming approximately 1.5GB of additional RAM. This version of a trie reduces that memory impact by a factor of over 100, at a cost of a small increase in insert-time and a very small increase in lookup-time (still O(1)). Increase in operations are more than paid for by smaller memory footprint = less memory cache destruction, for tables larger than 2000 characters (e.g. 20 entries 100 characters long, 100 entries 20 characters long, etc.)
Trie used for static headers was replaced in #33932 to avoid being subject to the performance impact of this change.
Risk Level: Small. Existing tests still pass, additional tests verify some new edge-case also behaves fine, performance impact is insignificant (a bit worse in cases where it was already very fast, better in cases where it was less fast).
Testing: Added extra unit tests.
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Benchmark below was generated with both implementations being tested side-by-side. PR now removes the old 'big' implementation as there is no longer any call for it.
New benchmark (in PR) results, running as
bazel run -c opt test/common/common:utility_trie_speed_test -- --benchmark-repetitions=10 --benchmark_out_format=csv --benchmark_out=somefile.csv
: