Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Compiled" static header maps instead of big trie #33932

Merged
merged 33 commits into from May 23, 2024

Conversation

ravenblackx
Copy link
Contributor

@ravenblackx ravenblackx commented May 2, 2024

Commit Message: "Compiled" static header maps instead of big trie
Additional Description: This is slightly slower for "misses" but significantly faster for "hits". Given that the majority of headers are expected to be matches for the static table, this should be a big win. (Each faster hit pays for 30 slower misses!) This structure also uses significantly less memory than the big trie.

Performance benchmarks - bazel run -c opt test/common/http:header_map_impl_speed_test -- --benchmark_filter="bm*" --benchmark_repetitions=25 --benchmark_display_aggregates_only

Benchmark before after beforeStdDev afterStdDev change
bmHeaderMapImplRequestStaticLookupHits 47.2ns 16.4ns 0.629 0.378 -65.3%
bmHeaderMapImplResponseStaticLookupHits 34.7ns 14.3ns 0.571 0.085 -58.8%
bmHeaderMapImplRequestStaticLookupMisses 6.89ns 6.83ns 0.044 0.034 -0.01%
bmHeaderMapImplResponseStaticLookupMisses 6.40ns 7.31ns 0.028 0.057 +14.2%

Risk Level: Could be high since headers are parsed millions of times per second. But also there's a lot of existing test cases.
Testing: Added tests, existing tests involving headers will also provide coverage.
Docs Changes: n/a
Release Notes: Not yet
Platform Specific Features: n/a

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #33932 was opened by ravenblackx.

see: more, trace.

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx ravenblackx marked this pull request as ready for review May 7, 2024 22:59
@ravenblackx
Copy link
Contributor Author

/retest

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@jmarantz
Copy link
Contributor

/wait (for CI green)

@ravenblackx
Copy link
Contributor Author

/retest

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks amazing though I don't fully understand it yet.

It turns out I had some comments from a while back that I hadn't flushed, so combining those with what I found just now and flushing.

* @param key the key to look up.
*/
Value find(const absl::string_view& key) const {
if (key.size() >= table_.size() || table_[key.size()] == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid the duplicate lookup of table_[key.size()]? It's probably optimized away but I am never certain.

I think for the compiler to optimize it away it would have to see inside table_.size() to ensure it could not mutate key. That seems obvious to humans but idk how smart compilers are with that. I think if you can't see inside table_.size() impl (ie not inlined) then you can't know that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried taking it to a local variable, but that has to be after key.size() being too large has already been checked, which makes it two if-branches, which from benchmarking appears to make both the hit paths ~5% slower.

It's not like these are actual table lookups, they're just offsets, so even if it wasn't optimized away it might still be faster than taking an additional copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it have to be after the first check? Can't you have the first line of the function be the assignment to a temp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_[key.size()] is an out-of-bounds crash if key.size() >= table_.size().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const size_t key_size = key.size();
if (key_size >= table_.size() || table_[key_size] == nullptr) {
...

using FindFn = std::function<Value(const absl::string_view&)>;

public:
using KV = std::pair<std::string, Value>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scanning through the code I'm wondering if there' s some string copying going on here that we might be able to avoid.

Can we have a string-store held in the class, and have the KV that we manipulate in various sub-vectors have either a string_view or a string&?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the input a vector of string_views which eliminates the string copying during compile() except the copy taken for a leaf node (which is kept in case at some point the table wants to be initialized from a transient source rather than global singleton-constant-likes).

* for the string node.
*/
template <class Value> class CompiledStringMap {
using FindFn = std::function<Value(const absl::string_view&)>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standard best practice is to pass string_view as value rather than as a ref. I think it's really close but this was the decision made by smart people some time ago :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about references being faster for this use-case.

std::vector<KV> next_contents;
next_contents.reserve(it - start);
std::copy(start, it, std::back_inserter(next_contents));
nodes[i - best.min] = createEqualLengthNode(next_contents);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this recursive lambda generation looks really cool but I'm wondering if it's the most efficient thing we can do. As a thought experiment I'm wondering what the code would look like if you had to write this in C. I assume it would be possilbe (if a little more verbose). Would it be faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I was doing it in C it would be almost the same - I don't know how much overhead there is in std::function but for the C structure of this I'd be using something like a struct containing a function pointer and a union of the captured data that each of the node-types takes, and that function pointer would take a pointer to that same structure... which seems like it's probably at least about the same as what a lambda is doing.

It might be possible to optimize further with more manual constructs, but I think a 60% speedup is probably good enough, and it could have another pass later if someone wants to try to do better.

source/common/common/compiled_string_map.h Outdated Show resolved Hide resolved
};
/**
* Construct the lookup table. This is a somewhat slow multi-pass
* operation - using this structure is not recommended unless the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this note be in the class comment?

For a moment I thought you were saying that you could use this class, but not call 'compile' and instead fall back to an interpret-mode, but I don't think that's what you are saying :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

auto it = initial.begin();
for (size_t i = 0; i <= longest; i++) {
auto start = it;
while (it != initial.end() && it->first.size() == i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing we don't consider compiling to be important to be fast. But this nested loop seems like it might be slower than needed.

How about this for an algo:

flat_hash_map<int, std::vector<std::reference<Value>> size_to_values_map;
for (const KV& pair : initial) {
  size_to_values_map[pair.second.size()] = pair.second;
}

I realize this creates a data structure which doesn't map the ones you are using, but it seems like a good way to go get the elements organized by size. After that I'm not sure how this would fit into your algo.

I'm having a little trouble following the algorithm also. What causes it to re-examine the entries that don't match size 'i'?

Maybe some more comment would help, or breaking down your compile flow into a few helper methods that might be more self-documenting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was already not really a nested loop, in that the two indexes were moving "in parallel" (i.e. it wasn't n*m, it was n+m), but I've changed it to make that a bit clearer. Though still not much clearer since the std::find_if still looks like it might be an m-length lookup. But it's not!

@jmarantz
Copy link
Contributor

/wait

Signed-off-by: Raven Black <ravenblack@dropbox.com>
…loop

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

/retest

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reading the detail on the code, but I think @adisuissa should review also.

* @param key the key to look up.
*/
Value find(const absl::string_view& key) const {
if (key.size() >= table_.size() || table_[key.size()] == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it have to be after the first check? Can't you have the first line of the function be the assignment to a temp?

// Find the first key whose length differs from the current key length.
// Everything in between is keys with the same length.
auto range_end =
std::find_if(range_start, initial.end(), [l = range_start->first.size()](const KV& e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of 'l' as a variable name :)

can we have 'last'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len. :)

@adisuissa adisuissa self-assigned this May 13, 2024
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is getting a lot more readable though I admit I still don't deeply understand the algorithm.

Left a few more superficial notes.

source/common/common/compiled_string_map.h Outdated Show resolved Hide resolved
source/common/common/compiled_string_map.h Show resolved Hide resolved
source/common/common/compiled_string_map.h Show resolved Hide resolved
* (typically nullptr) if the key was not present.
* @param key the key to look up.
*/
Value find(absl::string_view key) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you found passing by const ref was faster in another case, but not here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly left this one for the sake of leaving the external API unchanged. Feels more reasonable to buck the style guide for internal implementation than for API.

source/common/common/compiled_string_map.h Outdated Show resolved Hide resolved
struct IndexSplitInfo {
uint8_t index, min, max, count;
} best{0, 0, 0, 0};
for (size_t i = 0; i < node_contents[0].first.size(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be faster?

for (size_t i = 0, n = node_contents[0].first.size(); i < n; i++) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would (it's just reading a value from an address), and we don't care about the microoptimization performance here anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make this code more understandable consider naming:
const size_t each_key_length = node_contents[0].first.size(); (or something similar)

consider renaming:i -> key_char_idx, j -> key_idx, v -> ch/val

if (!hits[v]) {
hits[v] = true;
info.count++;
info.min = std::min(v, info.min);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would be faster to write this as:

if (v < info.min) {
 info.min = v;
} else if (v > info.max) {
  info.max = v;
}

It feels to me like what you have is a little more readable, but may have an extra branch. And it will be always executing two writes into info and only 0 or 1 are needed. Not sure if the compiler would do that optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't really care about microoptimizing the speed of any function that isn't find().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I was having trouble figuring out what's find() and what's not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully that's a lot clearer now that there's no lambdas. Now only find() is find(). :)

source/common/common/compiled_string_map.h Outdated Show resolved Hide resolved
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Very interesting data-structure!

source/common/common/compiled_string_map.h Show resolved Hide resolved
source/common/common/compiled_string_map.h Outdated Show resolved Hide resolved
source/common/common/compiled_string_map.h Show resolved Hide resolved
source/common/common/compiled_string_map.h Outdated Show resolved Hide resolved
source/common/common/compiled_string_map.h Show resolved Hide resolved
source/common/common/compiled_string_map.h Outdated Show resolved Hide resolved
Comment on lines 39 to 48
* This structure first jumps to matching length, eliminating in this
* example case apple and pineapple.
* Then the "best split" node is on
* `x-prefix-banana`
* ^
* so the first node has 3 non-miss branches, n, b and r for that position.
* Down that n branch, the "best split" is on
* `x-prefix-banana`
* ^
* which has two branches, n or k.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the "best split" can be anywhere in the string. I think modifying the example to the following, will clarify how this should work.

 * `x-prefix-banana`
 * `x-prefix-banara`
 * `x-prefix-apple`
 * `x-prefix-pineapple`
 * `x-prefix-barana`
 * `x-prefix-banaka`

then the best split would've been in the character before-last.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't be anywhere in the string, only at an index where there's a difference. The example is specifically showing that 3 branches would be "best" over 2 branches. I don't understand what your example clarifies that the existing example doesn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was looking at it from a "trie" perspective.
When I first read the comment, I thought that in each branch, the structure only keeps the "suffix" (after the branch point). I think that updating the example to a case where the branching happens due to a "best-split" that is not the first split, will reduce that confusion to future readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the md file explanation covers this sufficiently (though it doesn't do the second split earlier in the string).

If you'd like me to rearrange the strings in the new example so that it does the "last" index split first and the earlier one second I can do that, but I think the diagram having numbers in it probably makes it clear enough. (And late first would be just as potentially confusing as early first, and having three sequential splits would be getting so long as to be harder to read...)

source/common/common/compiled_string_map.h Show resolved Hide resolved
source/common/common/compiled_string_map.h Outdated Show resolved Hide resolved
source/common/common/compiled_string_map.h Show resolved Hide resolved
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Raven Black <ravenblack@dropbox.com>
@jmarantz
Copy link
Contributor

wdyt of adding a diagram showing the algorithm in action, as an md file or image or whatever?

Signed-off-by: Raven Black <ravenblack@dropbox.com>
@ravenblackx
Copy link
Contributor Author

wdyt of adding a diagram showing the algorithm in action, as an md file or image or whatever?

Done.

@ravenblackx
Copy link
Contributor Author

/retest

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the doc, it explains the data-structure very well!
The code LGTM. The only thing I suggest considering is whether the use of this data-structure should be behind a startup-flag (bootstrap option). I'll leave it to other maintainers to decide.

in the leaf nodes, which could potentially make it larger due to, e.g. containing
`x-envoy-` multiple times where a regular trie only has one 'branch' for that.
However, it also contains fewer nodes due to not having a node for every character,
and compared to the prior fixed-size-node 256-branch trie, each node is a lot smaller;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a minor comment (no AI):
I think the original trie can be converted to using the same "shrinked-range" (only the needed range of characters) technique. So in terms of memory I think this data-structure is more expensive.
That said, from the CPU perf point-of-view, I tend to agree that this data-structure should perform better, although it is not easy for me to say which of the changes in the data-structure (using per-length buckets, splitting not at the prefix vs condensing a trie) provide the most benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing just shrunk-range was actually the original PR from which this one forked off as a prerequisite; it made performance a bit worse (more so than this version) for the miss case and made little difference for the hits (worse for short strings, better for long strings). I think this version is also still better memory-wise, because every pointer is 64 bits, and this trie has far fewer pointers.

Consider the case of even just one entry, x-envoy-banana - in this model it's the root length node and a leaf node, so it's a vector with 13 nullptrs and one pointer (because the length node doesn't do the minimizing thing), a length, and one leaf node containing a value and a string.

In a regular "shrunk" trie, this is 14 nodes each containing a default value (a pointer) and a pointer to the next node, and a vector of chars one character long and a length - so that's 48 pointers, which is significantly larger than the "short path" trie. Even if you cut out the common "x-envoy-" prefix entirely and assume that's shared among so many headers that we can amortize it entirely, it's still 18 pointers just for the 'banana' part of the sequence, which is about the same as the "compiled" trie for the whole thing (which also theoretically shares "common prefixes"!)

Splitting at a more optimal branch point is definitely the main performance enhancer for hits. (It's similar to what gperf does.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct.
I keep forgetting that this is not a compressed-prefix-trie, but one that has per-character branching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, a compressed-prefix trie (assuming this is a name for "a branch is per-character but also matches the longest string common to all its children") would probably be nearly as fast as the "branch at best position" trie, and a similar size, maybe slightly smaller, since it would also only be a few hops to completion. But it would involve more string-compares.

@ravenblackx
Copy link
Contributor Author

+alyssawilk for senior maintainer pass.

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. My largest request was going to be jmarantz review but thankfully you both already have that covered :-)

source/common/common/compiled_string_map.h Show resolved Hide resolved
@jmarantz
Copy link
Contributor

actually I'd love to take one more quick pass; just been slammed. Give me till EOD today?

I agree this is fantastic.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

This is amazing work. I am looking forward to seeing how fast this is in production!

@alyssawilk alyssawilk merged commit c7ce725 into envoyproxy:main May 23, 2024
53 checks passed
@ravenblackx ravenblackx deleted the headers_compiled branch May 23, 2024 15:10
cancecen pushed a commit to cancecen/envoy that referenced this pull request May 23, 2024
Commit Message: "Compiled" static header maps instead of big trie
Additional Description: This is slightly slower for "misses" but significantly faster for "hits". Given that the majority of headers are expected to be matches for the static table, this should be a big win. (Each faster hit pays for 30 slower misses!) This structure also uses significantly less memory than the big trie.

Performance benchmarks detailed in envoyproxy#33932
Risk Level: Could be high since headers are parsed millions of times per second. But also there's a lot of existing test cases.
Testing: Added tests, existing tests involving headers will also provide coverage.
Docs Changes: n/a
Release Notes: Not yet
Platform Specific Features: n/a

Signed-off-by: Raven Black <ravenblack@dropbox.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants