Split `Inst` enum into `BytesInst` and `UnicodeInst` enums #761

marmeladema · 2021-04-14T00:52:02Z

While trying to understand better the internals of the crate, I stumbled upon this comment:

/// Regrettably, a regex program either contains Unicode codepoint
/// instructions (Char and Ranges) or it contains byte instructions (Bytes).
/// A regex program can never contain both.
///
/// It would be worth investigating splitting this into two distinct types and
/// then figuring out how to make the matching engines polymorphic over those
/// types without sacrificing performance.
///
/// Other than the benefit of moving invariants into the type system, another
/// benefit is the decreased size. If we remove the `Char` and `Ranges`
/// instructions from the `Inst` enum, then its size shrinks from 40 bytes to
/// 24 bytes. (This is because of the removal of a `Vec` in the `Ranges`
/// variant.) Given that byte based machines are typically much bigger than
/// their Unicode analogues (because they can decode UTF-8 directly), this ends
/// up being a pretty significant savings.

This is what this PR proposes to implement by making Program generic over the instruction type: Program<I>.
All tests seem should pass but benchmarks have not been updated yet.

cc @BurntSushi, I hope this will be useful 👍

marmeladema · 2021-04-14T01:02:57Z

Well, tests are passing with rust 1.51.0 but not with 1.28.0 because of lifetime inference issues. This should be fixable if the PR is worthwhile.

BurntSushi · 2021-04-14T11:58:43Z

@marmeladema I apologize for leaving that comment and suggesting that I would be okay with a PR like this. :-( In the future, for bigger changes, I do strongly advise that you file an issue first.

I really appreciate the effort towards improving memory usage here. But I'm going to close this out. The primary reason is that #656 is in progress right now, and part of that effort involves a rewrite of the NFA compiler. This comes with a new representation for Inst. In that representation, currently, there is no "Unicode" variant. Everything is just bytes. But, there are also new variants that have Box<[...]> in them, which inflates their size anyway. (I say "currently" because I may add back the Unicode variant. It will be guided by code complexity concerns in addition to, primarily, performance concerns.)

While the existence of #656 doesn't mean we can't also accept improvements in the current code, I think this particular change is quite complex. I regret being so flippant with making everything polymorphic in my comment. In particular, while this undoubtedly improves memory usage, it's quite likely to also increase compile times and bloat binaries. So even if #656 weren't a factor here, there are still possible reasons not to do this I think.

Anyway, thanks again and I regret that you put so much effort into a PR that I'm not going to consider. Would definitely recommend filing an issue first. Or if you want to reach out on Zulip, that would be great too.

marmeladema · 2021-04-14T12:51:54Z

@BurntSushi thank you for your detailed answer 👍 regex-automata looks very promising

It is true that I have spent quite some time on this, but the real goal was to understand better the architecture of the codebase and learn about the internals of the crate. I think it helped me a lot to do this even if not merged in the end, so all is not lost.

The actual reason of why I am digging into this in the first place is that I am maintaining a production system which uses a lot of regexes and it seems that they can take a substantial amount of memory.

I'll try to reach out to you on zulip to better explain my use cases if that's ok.

BurntSushi · 2021-04-14T13:04:07Z

@marmeladema Sounds good!

marmeladema force-pushed the split-inst branch from 2102a6c to b45cfc4 Compare April 14, 2021 00:57

Split Inst enum into BytesInst and UnicodeInst enums

a095a00

marmeladema force-pushed the split-inst branch from b45cfc4 to a095a00 Compare April 14, 2021 00:59

BurntSushi closed this Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split `Inst` enum into `BytesInst` and `UnicodeInst` enums #761

Split `Inst` enum into `BytesInst` and `UnicodeInst` enums #761

marmeladema commented Apr 14, 2021

marmeladema commented Apr 14, 2021

BurntSushi commented Apr 14, 2021 •

edited

marmeladema commented Apr 14, 2021

BurntSushi commented Apr 14, 2021

Split Inst enum into BytesInst and UnicodeInst enums #761

Split Inst enum into BytesInst and UnicodeInst enums #761

Conversation

marmeladema commented Apr 14, 2021

marmeladema commented Apr 14, 2021

BurntSushi commented Apr 14, 2021 • edited

marmeladema commented Apr 14, 2021

BurntSushi commented Apr 14, 2021

Split `Inst` enum into `BytesInst` and `UnicodeInst` enums #761

Split `Inst` enum into `BytesInst` and `UnicodeInst` enums #761

BurntSushi commented Apr 14, 2021 •

edited