Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature proposal: find lexicographic bounds for an FSM #316

Open
Udzu opened this issue Feb 16, 2021 · 1 comment
Open

Feature proposal: find lexicographic bounds for an FSM #316

Udzu opened this issue Feb 16, 2021 · 1 comment

Comments

@Udzu
Copy link

Udzu commented Feb 16, 2021

Some data storage systems store records in name order, making it efficient to retrieve records with names in a given range. When filtering such records by regular expression, it might therefore be useful to have a way to generate a range in which all the results are guaranteed lie, to use as a pre-filter.

For example, any match of the regular expression (moomin|troll)+ must lie in the range ("moomin", "trolltrolm"). Note that the last character of the upper bound has been 'rounded up' to ensure that every match, no matter how long, lies before it.

This should be fairly straightforward to implement for a DFA, and only slightly more complex for a NFA (assuming it's been trimmed of any states that do not lead to a match). The interface will need to specify the maximum length of the bounds. The ordering can be lexicographic order on the byte encodings, which in the case of UTF8 also conveniently matches the ordering on Unicode code points. Still, there are a couple of potential UTF8 niggles:

  1. the maximum length might lie inside a multi-byte character
  2. rounding a truncated upper bound to the next byte might result in an invalid byte sequence

The easiest approach is probably to remain encoding-agnostic and simply ignore this, returning byte sequence bounds and letting the client library deal with converting those to valid strings.

@katef
Copy link
Owner

katef commented Mar 4, 2021

This is really interesting, I'm not familiar with the theory here. Can you explain how you find the string trolltrolm from the minimal DFA for your example please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants