Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xxx regex cache #1770

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from
Draft

xxx regex cache #1770

wants to merge 11 commits into from

Conversation

melvinw
Copy link
Collaborator

@melvinw melvinw commented Dec 19, 2023

not ready for review. needs some cleanup

@melvinw
Copy link
Collaborator Author

melvinw commented Dec 19, 2023

let's see if this blows anything up.

i want to compare this to std::list, too

@melvinw melvinw force-pushed the regcomp-cache branch 8 times, most recently from bd8e1de to 5698121 Compare December 21, 2023 03:51
Copy link
Contributor

@andychu andychu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable, I would like to see the before/after numbers on the 3 workloads we have (parse_help, synthetic with #regexes > cache size, and #regexes < cache size)

And I think we should have unit tests in libc_test.cc, which means moving the declaration into the header

I made a comment about the data layout ... reducing indirection by making CacheEntry a value.

That does make shifting more expensive, though now I kinda wonder about make_heap and operator >

The struct size on 64-bit machines would be 8+8+4 = 20 * (N=100) = 2000 bytes, which doesn't seem horrible

cpp/libc.cc Outdated
}

size_t capacity_;
std::vector<CacheEntry*> access_list_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it can be vector<CacheEntry>

And then the CacheEntry contains a pointer to string, pointer to regex_t, and int hash

Then I feel like there's more locality when doing the find_if.

But then it costs more when we do erase() I suppose, because instead of a pointer, it's a struct with 2 pointers and an int.

cpp/libc.cc Outdated
CacheEntry* TakeEntry(BigStr* pat) {
auto it = std::find_if(access_list_.begin(), access_list_.end(),
[pat](CacheEntry* entry) {
return hash(pat) == entry->pat_hash_ &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to pull out hash(pat) from the loop

Even though it doesn't hash every time, it still checks if it's hashed every time

@andychu
Copy link
Contributor

andychu commented Dec 22, 2023

Overall I think this is fine, I think the main reason to look at any more optimizations (Dict<K, V> or make_heap()) would be if the (# regex > cache size) case exhibits a big slowdown, like 50% slower than not having a cache at all

@melvinw melvinw changed the title xxx xxx regex cache Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants