Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: fix complete literal optimization issue #1000

Merged
merged 1 commit into from May 25, 2023
Merged

bug: fix complete literal optimization issue #1000

merged 1 commit into from May 25, 2023

Conversation

BurntSushi
Copy link
Member

This bug results in some regexes reporting matches at every position even when it should't. The bug happens because the internal literal optimizer winds up using an "empty" searcher that reports a match at every position. This is technically correct whenever the literal searcher is used as a prefilter (although slower than necessary), but an optimization added later enabled the searcher to run on its own and not as a prefilter. i.e., Without the confirm step by the regex engine. In that context, the "empty" searcher is totally incorrect.

So this bug corresponds to a code path where the "empty" literal searcher is used, but is also in a case where the literal searcher is used directly to find matches and not as a prefilter. I believe at least the following are required to trigger this path:

  • The literals extracted need to be "complete." That is, the language described by the regex is small and finite.
  • There needs to be at least 26 distinct starting bytes among all of the elements in the language described by the regex.
  • There needs to be fewer than 26 distinct ending bytes among all of the elements in the language described by the regex.
  • Possibly other criteria...

The actual fix is to change the code that selects the literal searcher. Indeed, there was even a comment in the erroneous code saying that the path was impossible, but of course, it isn't. We change that path to return None, as it should have long ago. This in turn results in the case outlined above not using a literal searcher and just the regex engine.

Fixes #999

This bug results in some regexes reporting matches at every position
even when it should't. The bug happens because the internal literal
optimizer winds up using an "empty" searcher that reports a match at
every position. This is technically correct whenever the literal
searcher is used as a prefilter (although slower than necessary), but an
optimization added later enabled the searcher to run on its own and not
as a prefilter. i.e., Without the confirm step by the regex engine. In
that context, the "empty" searcher is totally incorrect.

So this bug corresponds to a code path where the "empty" literal
searcher is used, but is also in a case where the literal searcher is
used directly to find matches and not as a prefilter. I believe at
least the following are required to trigger this path:

* The literals extracted need to be "complete." That is, the language
described by the regex is small and finite.
* There needs to be at least 26 distinct starting bytes among all of
the elements in the language described by the regex.
* There needs to be fewer than 26 distinct ending bytes among all of
the elements in the language described by the regex.
* Possibly other criteria...

The actual fix is to change the code that selects the literal searcher.
Indeed, there was even a comment in the erroneous code saying that the
path was impossible, but of course, it isn't. We change that path to
return None, as it should have long ago. This in turn results in the
case outlined above not using a literal searcher and just the regex
engine.

Fixes #999
@BurntSushi BurntSushi merged commit 40d883d into master May 25, 2023
10 checks passed
@BurntSushi BurntSushi deleted the ag/fix-999 branch May 25, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

regex with many literal alternates can erroneously match the empty string
1 participant