Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape function returns regex that is incompatible with 'u' flag #197

Closed
raydog opened this issue Jun 28, 2017 · 9 comments · Fixed by #321
Closed

Escape function returns regex that is incompatible with 'u' flag #197

raydog opened this issue Jun 28, 2017 · 9 comments · Fixed by #321

Comments

@raydog
Copy link

raydog commented Jun 28, 2017

The XRegExp.escape() method documentation currently says:

The result can safely be used at any point within a regex that uses any flags.

However, the escape method emits a bad pattern for unicode regexps. This is because whitespace is currently being escaped (Ie: " " becomes "\ ") and unicode regexps don't support that. This currently fails in Node 6 / 8, and AFAICT (since the documentation is fairly dense...) ES2015 Annex B.1.4 seems to say that the regex 'u' flag should disable the \{any character} -> [{any character}] behaviors that this whitespace escaping thing was depending on.

So this code:

var XRegExp = require('XRegExp');
var re = new XRegExp(XRegExp.escape(" "), "u");

Will throw this error:

/Users/ray/patt/node_modules/XRegExp/xregexp-all.js:3377
        new RegExp(generated.pattern, generated.flags),
        ^

SyntaxError: Invalid regular expression: /\ /: Invalid escape
    at RegExp (<anonymous>)
    at XRegExp (.../node_modules/XRegExp/xregexp-all.js:3377:9)

So either the documentation should be updated, or .escape() should stop escaping whitespace. (Maybe only escape it / stop escaping it with a flag?)

@mathiasbynens
Copy link
Collaborator

Annex B.1.4 does not apply to Unicode RegExps per https://tc39.github.io/ecma262/#sec-regular-expressions-patterns:

However, none of these extensions change the syntax of Unicode patterns recognized when parsing with the [U] parameter present on the goal symbol.

AFAICT, the relevant spec bit is here: https://tc39.github.io/ecma262/#sec-forbidden-extensions

The RegExp pattern grammars in 21.2.1 and B.1.4 must not be extended to recognize any of the source characters A-Z or a-z as IdentityEscape[U] when the U grammar parameter is present.

Where is /\ /u disallowed?

@raydog
Copy link
Author

raydog commented Jun 28, 2017

I'm probably wrong about the section then, since that's just the section that was referenced in the commit where I thought that functionality was removed. (After a closer look, I think it was just a refactor that touched the line.)

But V8 does very specifically disallow the /\ /u thing:

https://chromium.googlesource.com/v8/v8.git/+/e918c4ec464456a374098049ca22eac2107f6223

specifically, the line:

// With /u, no identity escapes except for syntax characters
// are allowed. Otherwise, all identity escapes are allowed.
if (unicode()) {
  return ReportError(CStrVector("Invalid escape"));
}

@raydog
Copy link
Author

raydog commented Jun 28, 2017

Maybe this is disallowed in 21.2.1 by https://tc39.github.io/ecma262/#prod-IdentityEscape ? AFAICT, the bit that would have allowed \{any source-compatible character} is tagged as [~U], so not available in unicode regexps.

@mathiasbynens
Copy link
Collaborator

mathiasbynens commented Jun 29, 2017

You’re absolutely right — https://tc39.github.io/ecma262/#prod-`IdentityEscape` disallows it. (I somehow misread SyntaxCharacter as SourceCharacter before, which makes no sense.)

BTW the change that introduced this behavior happened back when the spec was still a Word document, so there’s no matching commit 🤦‍♂️ Here’s some background: https://bugs.ecmascript.org/show_bug.cgi?id=3157#c9

@slevithan
Copy link
Owner

slevithan commented Jan 2, 2018

@raydog What change would you propose here? I'm not sure how to best address this, since not escaping whitespace would mean the resulting string can't be safely embedded in an XRegExp regex that uses flag x.

(XRegExp.escape was created years before /u was introduced.)

@raydog
Copy link
Author

raydog commented Jan 2, 2018

@slevithan Perhaps, similar to some of the other functions in the library, this function can accept a second flags parameter, and then the function would escape assuming a regexp in that syntax? The "no-argument" case can default to the existing "x" behavior.

Alternatively, a second method can be added for escaping unicode regexps... I think I like the first option more tho.

@jameschensmith
Copy link

Just an update on this ticket. This is still failing on the current Node.js LTS (14). I ran into this while converting a Perl script (which used quotemeta) to JS. Are there any plans to address this, either in this library or in ES?

@slevithan
Copy link
Owner

@jameschensmith This is now fixed in e22a52b and will be in the next release (v5.0.0 -- no timeline yet). The fix should work in all cases (including with flags x and u) and does not require providing an additional argument to XRegExp.escape.

josephfrazier added a commit that referenced this issue Feb 7, 2021
Changes include:

* BREAKING: Handle ES2018 capture names: #247
* BREAKING: Enable `namespacing` feature by default: #316
* BREAKING: Remove Unicode Blocks addon: 4860122
* restore perf tweak that made a meaningful difference in regex construction perf tests: 5f18261
* XRegExp.exec: preserve groups obj that comes from native ES2018 named capture: c4a83e7
* Make XRegExp.exec set groups prop to undefined if there are no named captures (closes #320): 7fea476
* Support optional 'Script=' prefix (from ES2018 syntax) for Unicode script tokens (#225): bb35ead
* XRegExp.matchRecursive: Add delimiter and pos info when unbalanced delimiters are found (closes #293): 9660b90
* XRegExp.escape: Escape whitespace in a way that works with ES6 flag u (fixes #197): e22a52b

To generate this commit, I adapted the steps at #205 (comment)

Here's a fuller list of changes that can be needed with new releases:

> * Version number
>   * Update version number and year in headers, config files, README.
>   * Update version number in `XRegExp.version`.
> * Publish
>   * Publish new git tag. E.g.:
>     * `git tag -a v3.1.0 -m "Release 3.1.0"`.
>     * `git push origin v3.1.0`.
>   * `npm publish`.
josephfrazier added a commit that referenced this issue Feb 8, 2021
Changes include:

* BREAKING: Handle ES2018 capture names: #247
* BREAKING: Enable `namespacing` feature by default: #316
* BREAKING: Remove Unicode Blocks addon: 4860122
* restore perf tweak that made a meaningful difference in regex construction perf tests: 5f18261
* XRegExp.exec: preserve groups obj that comes from native ES2018 named capture: c4a83e7
* Make XRegExp.exec set groups prop to undefined if there are no named captures (closes #320): 7fea476
* Support optional 'Script=' prefix (from ES2018 syntax) for Unicode script tokens (#225): bb35ead
* XRegExp.matchRecursive: Add delimiter and pos info when unbalanced delimiters are found (closes #293): 9660b90
* XRegExp.escape: Escape whitespace in a way that works with ES6 flag u (fixes #197): e22a52b

To generate this commit, I adapted the steps at #205 (comment)

Here's a fuller list of changes that can be needed with new releases:

> * Version number
>   * Update version number and year in headers, config files, README.
>   * Update version number in `XRegExp.version`.
> * Publish
>   * Publish new git tag. E.g.:
>     * `git tag -a v3.1.0 -m "Release 3.1.0"`.
>     * `git push origin v3.1.0`.
>   * `npm publish`.
@josephfrazier
Copy link
Collaborator

@raydog and @jameschensmith, I just released v5.0.0, so you should be able to try it and confirm things are now working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants