Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal output for some common cases #24

Open
RReverser opened this issue Nov 21, 2018 · 4 comments
Open

Suboptimal output for some common cases #24

RReverser opened this issue Nov 21, 2018 · 4 comments

Comments

@RReverser
Copy link

I know that currently regexpu-core operates on low-level range set representation, but thought I'd write down my thoughts on some output optimisations.

It looks like output could be better optimised for no-op transforms and other common cases. Some examples from README:

  1. No-op transform:
rewritePattern('.');
// → '[\\0-\\t\\x0B\\f\\x0E-\\u2027\\u202A-\\uFFFF]'

Given the absense of flags and corresponding transforms, this could be better preserved as just ..

  1. Dot-all transform:
rewritePattern('.', 's', {
  'dotAllFlag': true
});
// → '[\\0-\\uFFFF]'

Doesn't save as much as previous example, but could output [^] instead. This would also allow easily transforming just s while preserving the u flag if one wants that.

rewritePattern('.', 'su', {
  'dotAllFlag': true
});
// → '(?:[\\0-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF])'

I think this could be optimised further as well, but, while trying to come up with a shorter version, found what looks like a bug in current output - #23 - so I guess it's better to fix that one first.

@mathiasbynens
Copy link
Owner

Yeah, regexpu doesn’t have an optimizer pass. For common regular expressions, where things like . are combined with other patterns that expand into character classes, I’m not sure if the savings are worth it post-gzip. Still, I’d certainly consider PRs that add an optimizer pass.

@mathiasbynens
Copy link
Owner

Note that [^] doesn’t work in IE < 9, which was originally supported by regexpu-core and regenerate. We could use something like [\s\S] instead.

@nicolo-ribaudo
Copy link
Collaborator

nicolo-ribaudo commented Sep 11, 2019

I'm trying to optimize the /./u case, while keeping the exactly the current behavior. I have come up with a new regexp, but I would need someone to confirm that it is correct:

Old: (?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])
New: (?:(?![\uD800-\uDFFF]).|(?:.|^)[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF]))

Let's consider one by one the different top-level alternatives separated by |. For simplicity, we first expand any non-top-level | (i.e. the last one), and split them accross different lines:

(?:  [\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]
|    [\uD800-\uDBFF][\uDC00-\uDFFF]
|    [\uD800-\uDBFF](?![\uDC00-\uDFFF])
|    [^\uD800-\uDBFF][\uDC00-\uDFFF]
|    ^[\uDC00-\uDFFF]
)
  • (1st) It is exactly what it matched by . ([\0-\t\x0B\f\x0E-\u2027\u202A-\uFFFF]) without any flags, except for \uD800-\uDFFF. We can rewrite it in a shorter form by matching ". but not if the prev thing is followed by that excluded range":
    Result (1st): (?![\uD800-\uDFFF]).
    In subsequent alternatives, we can only match characters in the \uD800-\uDFFF range, because otherwise they would have been matched by the first alternative.
  • (2nd and 4th) Let's write them together: [\uD800-\uDBFF][\uDC00-\uDFFF]|[^\uD800-\uDBFF][\uDC00-\uDFFF].
    Since the second char is the same, we can collect it: (?:[\uD800-\uDBFF]|[^\uD800-\uDBFF])[\uDC00-\uDFFF]. The part inside the non-capturing group matches everything (because it has both the range and its negation), but keep in mind that it only mathes everything in the \uD800-\uDFFF range, because of the first part of the whole regexp.
    This range is a subset of ., so it can be safely replaced.
    Result: .[\uDC00-\uDFFF]
  • (5th) The last result is really similar to the 5th: we can easily merge them to
    Result (2nd, 4th, 5th): (?:.|^)[\uDC00-\uDFFF]
  • I couldn't find any shorter form of the 3rd

By putting those minimized forms together, we get:

(?:  (?![\uD800-\uDFFF]).
|    (?:.|^)[\uDC00-\uDFFF]
|    [\uD800-\uDBFF](?![\uDC00-\uDFFF])
)

/./us can be optimized similarly, except for using [\s\S] instead of ..

Is it correct?

@mathiasbynens
Copy link
Owner

@nicolo-ribaudo The old and new versions behave differently for inputs like '\n\uDC00' (newline followed by any lone surrogate).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants