Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Inst enum into BytesInst and UnicodeInst enums #761

Closed
wants to merge 1 commit into from

Conversation

marmeladema
Copy link
Contributor

While trying to understand better the internals of the crate, I stumbled upon this comment:

/// Regrettably, a regex program either contains Unicode codepoint
/// instructions (Char and Ranges) or it contains byte instructions (Bytes).
/// A regex program can never contain both.
///
/// It would be worth investigating splitting this into two distinct types and
/// then figuring out how to make the matching engines polymorphic over those
/// types without sacrificing performance.
///
/// Other than the benefit of moving invariants into the type system, another
/// benefit is the decreased size. If we remove the `Char` and `Ranges`
/// instructions from the `Inst` enum, then its size shrinks from 40 bytes to
/// 24 bytes. (This is because of the removal of a `Vec` in the `Ranges`
/// variant.) Given that byte based machines are typically much bigger than
/// their Unicode analogues (because they can decode UTF-8 directly), this ends
/// up being a pretty significant savings.

This is what this PR proposes to implement by making Program generic over the instruction type: Program<I>.
All tests seem should pass but benchmarks have not been updated yet.

cc @BurntSushi, I hope this will be useful 馃憤

@marmeladema
Copy link
Contributor Author

Well, tests are passing with rust 1.51.0 but not with 1.28.0 because of lifetime inference issues. This should be fixable if the PR is worthwhile.

@BurntSushi
Copy link
Member

BurntSushi commented Apr 14, 2021

@marmeladema I apologize for leaving that comment and suggesting that I would be okay with a PR like this. :-( In the future, for bigger changes, I do strongly advise that you file an issue first.

I really appreciate the effort towards improving memory usage here. But I'm going to close this out. The primary reason is that #656 is in progress right now, and part of that effort involves a rewrite of the NFA compiler. This comes with a new representation for Inst. In that representation, currently, there is no "Unicode" variant. Everything is just bytes. But, there are also new variants that have Box<[...]> in them, which inflates their size anyway. (I say "currently" because I may add back the Unicode variant. It will be guided by code complexity concerns in addition to, primarily, performance concerns.)

While the existence of #656 doesn't mean we can't also accept improvements in the current code, I think this particular change is quite complex. I regret being so flippant with making everything polymorphic in my comment. In particular, while this undoubtedly improves memory usage, it's quite likely to also increase compile times and bloat binaries. So even if #656 weren't a factor here, there are still possible reasons not to do this I think.

Anyway, thanks again and I regret that you put so much effort into a PR that I'm not going to consider. Would definitely recommend filing an issue first. Or if you want to reach out on Zulip, that would be great too.

@BurntSushi BurntSushi closed this Apr 14, 2021
@marmeladema
Copy link
Contributor Author

@BurntSushi thank you for your detailed answer 馃憤 regex-automata looks very promising

It is true that I have spent quite some time on this, but the real goal was to understand better the architecture of the codebase and learn about the internals of the crate. I think it helped me a lot to do this even if not merged in the end, so all is not lost.

The actual reason of why I am digging into this in the first place is that I am maintaining a production system which uses a lot of regexes and it seems that they can take a substantial amount of memory.

I'll try to reach out to you on zulip to better explain my use cases if that's ok.

@BurntSushi
Copy link
Member

@marmeladema Sounds good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants