Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move named backreferences to a groups object #175

Closed
slevithan opened this issue Apr 17, 2017 · 12 comments
Closed

Move named backreferences to a groups object #175

slevithan opened this issue Apr 17, 2017 · 12 comments

Comments

@slevithan
Copy link
Owner

Right now, looping over named backreferences after executing a regex with named groups is unnecessarily difficult, since XRegExp augments match arrays with the named backreferences as direct properties. It would be nice to give access to a clean object with named groups.

Proposal:

  • Add a new option "namespacing" to XRegExp.install/uninstall/isInstalled.
  • This option would be off by default in the initial version when it is introduced. In the subsequent major version (to enable a breaking change), the "namespacing" option will be on by default, and the ability to turn it off might be removed.
  • When enabled, named backreferences are no longer added as direct properties of match arrays. Rather, they are added to a groups object on result arrays.
    • For String.prototype.replace replacement functions (where named backreferences are added as properties of the first String argument), maybe leave things as they are (no groups object). If https://github.com/tc39/proposal-regexp-named-groups moves beyond the proposal stage, then XRegExp should follow wherever ECMAScript is headed.
  • When enabled, length and __proto__ should no longer be reserved words for named captures. Using __proto__ as a capture name would still lead to undesired results, so don’t do that.

As a workaround for now, here's how you can get a clean object with named groups:

function groups(regex, match) {
    var o = {};
    var x = regex.xregexp;
    (x && x.captureNames || []).forEach(function(name) {
        if (name) {
            o[name] = match[name];
        }
    });
    return o;
}

// Use it
var regex = XRegExp('(?s)(?<char>.)');
var match = XRegExp.exec('test', regex);
groups(regex, match);
// -> {char: 't'}

Moved here from #45 (comment) since that was a long thread.

@slevithan slevithan changed the title Moved named backreferences on result to groups object Move named backreferences to a groups object Apr 26, 2017
@dploeger
Copy link

dploeger commented Dec 13, 2017

I like that, as this would make the type definitions for things like Typescript much better. I currently have to use an "any-cast", which is very bad practice in Typescript:

let pattern = XRegExp('a(?<groupName>.)c')
let match = XRegExp.exec('abc', pattern)
console.log((match as any).groupName)

(because "groupName" is not a defined property in the "RegExpExecArray" type, used by the type definitions for XRegExp over at definitelyTyped) and extending it with dynamic properties is nearly impossible or inducing any-types, again)

@josephfrazier
Copy link
Collaborator

josephfrazier commented Jan 11, 2018

  • When enabled, length and __proto__ should no longer be reserved words for named captures. Using __proto__ as a capture name would still lead to undesired results, so don’t do that.

What's the reasoning for allowing __proto__, if it would still lead to undesired results? I haven't thought too deeply about this issue yet, so I might be missing something, but it seems a little strange on its face.

@dploeger
Copy link

Coming to think about it: wouldn't it be better to redo the whole stuff and not return an augmented array, but an object with the keys:

  • matched: boolean, wether there were matches at all
  • matches: the real, unmodified matches array
  • groups: an object with the capture group names as keys and their matches as values

As for proto: yes, the methods should perhaps disallow certain keywords and throw.

@slevithan
Copy link
Owner Author

slevithan commented Jan 12, 2018

What's the reasoning for allowing __proto__, if it would still lead to undesired results? I haven't thought too deeply about this issue yet, so I might be missing something, but it seems a little strange on its face.

The intention was to use this as an opportunity to simplify the rules of XRegExp's regex syntax. Fewer syntax rules makes it easier to describe, learn, and emulate in other tools. length is a reasonable word to use for a group name, so unintended name collisions are entirely plausible. But no one is going to name a capturing group __proto__ without knowing what they're doing, so this isn't a real world problem. The reason I made __proto__ an error at the regex syntax level was because length was already blocked, and if you already have one banned word there is less complexity added by expanding it to more words.

That there would be a "valid" pattern that you shouldn't use (at least in JS-land) because of undesired results has some precedent. E.g. patterns that can lead to catastrophic backtracking are not blocked at the syntax level by any regex engine.

That said, I don't feel strongly about this, and I'd be fine with it if __proto__ remained blocked when groups is introduced.

Coming to think about it: wouldn't it be better to redo the whole stuff and not return an augmented array, but an object with the keys: [matched, matches, groups]

That seems reasonable. An alternative name for matches that avoids being a bit misleading (since the match object represents a single regex match) is backrefs. And you could offer more convenient access to backrefs[0] via a value string property. Additional existing properties you'd want to include are index and input. These words weren't previously blocked at the syntax level because if you use them as group names, the backreference values will overwrite the values that would otherwise be stored, which would most likely be what you want. A possible additional key is regexp, which could be a reference to the regex object used to generate the match object.

The downside of this (in addition to the breaking change itself) would be divergence from the standard JS RegExp match object spec, making XRegExp a little harder to learn and adjust to.

@dploeger
Copy link

The downside of this (in addition to the breaking change itself) would be divergence from the standard JS RegExp match object spec, making XRegExp a little harder to learn and adjust to.

I actually found it a bit harder to learn, because of the augmented objects. I actually expected an "XRegExp"-object because of the broader functionality. 😁

@josephfrazier
Copy link
Collaborator

josephfrazier commented Jan 14, 2018

The downside of this (in addition to the breaking change itself) would be divergence from the standard JS RegExp match object spec, making XRegExp a little harder to learn and adjust to.

I'm still not too informed/opinionated on this issue generally, but it looks like https://github.com/tc39/proposal-regexp-named-groups is stage 3 now, so I'm definitely in favor of matching that as closely as possible, as described in the first comment in this thread:

For String.prototype.replace replacement functions (where named backreferences are added as properties of the first String argument), maybe leave things as they are (no groups object). If tc39/proposal-regexp-named-groups moves beyond the proposal stage, then XRegExp should follow wherever ECMAScript is headed.

@dploeger
Copy link

I'm still not too informed/opinionated on this issue generally, but it looks like https://github.com/tc39/proposal-regexp-named-groups is stage 3 now, so I'm definitely in favor of matching that as closely as possible, as described in the first comment in this thread:

Yes, definitely! That is the proper way to go.

@josephfrazier
Copy link
Collaborator

Great, I've started working on an implementation, and will likely have a WIP PR open later today.

What's the reasoning for allowing __proto__, if it would still lead to undesired results?

As it turns out, I was mistaken about this causing undesired results, since the groups object is supposed to have a null prototype, so using __proto__ should be just fine. See here for details: tc39/proposal-regexp-named-groups#12 (comment)

@slevithan
Copy link
Owner Author

Implemented by @josephfrazier in #210, now landed.

@yonguelink
Copy link

This should probably be documented somewhere, I had to dig quite a bit to find this issue and understand what was going on in the PR. I'm glad this was implemented tho!

@josephfrazier
Copy link
Collaborator

FWIW, there's a little discussion about the website update process in #205 (comment)

@slevithan
Copy link
Owner Author

The namespacing option is now on by default as of #316.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants