Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named capture group syntax: (?<name>exp) #955

Closed
01mf02 opened this issue Feb 8, 2023 · 20 comments
Closed

Named capture group syntax: (?<name>exp) #955

01mf02 opened this issue Feb 8, 2023 · 20 comments

Comments

@01mf02
Copy link
Contributor

01mf02 commented Feb 8, 2023

Would it be possible to support in addition to the existing syntax (?P<name>exp) for named capture groups also the syntax (?<name>exp)?

My use case for this is that I am currently writing a jq clone called jaq. Recently, I have added support for regular expressions to jaq using the regex crate, which works very well. However, because jq supports only the (?<name>exp) syntax (because of the oniguruma library) and jaq only the (?P<name>exp) syntax (because of the regex crate), it is currently impossible to write regexes with named capture groups that are valid in both jaq and jq.

Apart from this, the (?<name>exp) syntax seems reasonably popular, so apart from my special use case, it might make sense to add support for this syntax. :)

@BurntSushi
Copy link
Member

To get this out of the way: this would be a backwards compatible change because currently all forms of (?<name>exp) are invalid syntax due to < being interpreted as a flag. Since < is of course not a flag, it always fails.

I think the original reason why I didn't do this was to 1) match RE2 syntax and 2) I didn't want two ways of doing the same thing. But I do think my stance has softened somewhat over time on this. Another related example is that I plan to relax the escaping rules in order to make the differences between at least the surface level syntax smaller between other regex engines. (For example, right now \/ is forbidden.)

I'll also say though that what you're facing here is a surface level problem. There are assuredly many other differences between this regex engine and Oniguruma. How will you deal with those? If compatibility is your ultimate goal, then you probably just need to use Oniguruma itself. Or do you see this more as a "let's get some compatibility, but not all of it" sort of situation? The problem there is that there may be many incompatibilities that are totally silent. (I don't have an Oniguruma environment that I can easily test with at the moment.)

I'll note that RE2 specifically only implements (?P<name>exp) syntax and there is this comment in the parser:

  // Check for named captures, first introduced in Python's regexp library.
  // As usual, there are three slightly different syntaxes:
  //
  //   (?P<name>expr)   the original, introduced by Python
  //   (?<name>expr)    the .NET alteration, adopted by Perl 5.10
  //   (?'name'expr)    another .NET alteration, adopted by Perl 5.10
  //
  // Perl 5.10 gave in and implemented the Python version too,
  // but they claim that the last two are the preferred forms.
  // PCRE and languages based on it (specifically, PHP and Ruby)
  // support all three as well.  EcmaScript 4 uses only the Python form.
  //
  // In both the open source world (via Code Search) and the
  // Google source tree, (?P<expr>name) is the dominant form,
  // so that's the one we implement.  One is enough.

I am quite sympathetic to this line of reasoning personally. And chasing this sort of "let's just keep adding alternative forms of everything until we capture all the different ways other regex engines do things" will lead us into undesirable territory.

I also wonder whether you could easily work around this by looking for a (?< and replacing it with a (?P<. You would need to deal with escapes, but I think that might be it? I don't think you'd need to write a full parser. I might be wrong though, I haven't given this a lot of thought.

I'm undecided on this personally. @junyer do you have any thoughts here?

@junyer
Copy link

junyer commented Feb 8, 2023

It sounds like you and I (and @rsc) are aligned here at least philosophically. And now speaking pragmatically, adding support for (?<name>exp) – or anything else – to RE2 shouldn't happen without initiating a three-phase commit protocol with the Go regexp package, RE2/J et cetera. I won't presume to speak for the Rust regex crate, of course, but various Google-related projects won't ever support this unless someone herds those cats successfully... and that someone is very unlikely to be me.

@rsc
Copy link

rsc commented Feb 8, 2023

I still basically agree with what I wrote in the RE2 comment long ago. I could change my mind given evidence of (1) significant usage of .NET forms or (2) significant environments that only support the .NET forms. It sounds like jq might be one such environment. Reading the other link, maybe Java or Boost has (?...) without (?P...)? It's unclear to me.

On the surface syntax issue and \/, RE2 and Go follow the general convention originally set by egrep of backslash-letter being special (so you must know what it means or reject it) and backslash-punctuation always being literal punctuation. So \/ and \_ fall out of that rule without being handled explicitly. The code in RE2 looks like:

  if (c < Runeself && !isalpha(c) && !isdigit(c)) {
    // Escaped non-word characters are always themselves.
    // PCRE is not quite so rigorous: it accepts things like
    // \q, but we don't.  We once rejected \_, but too many
    // programs and people insist on using it, so allow \_.
    *rp = c;
    return true;
  }

@01mf02
Copy link
Contributor Author

01mf02 commented Feb 8, 2023

Thanks for your very detailed answers.

I'll also say though that what you're facing here is a surface level problem. There are assuredly many other differences between this regex engine and Oniguruma. How will you deal with those? If compatibility is your ultimate goal, then you probably just need to use Oniguruma itself. Or do you see this more as a "let's get some compatibility, but not all of it" sort of situation? The problem there is that there may be many incompatibilities that are totally silent. (I don't have an Oniguruma environment that I can easily test with at the moment.)

Yes, I see the situation as "let's get some compatibility, but not all of it". (Using Oniguruma from Rust is not really an option for me, all the more because I already have an implementation of regexes using the regex crate.) At least most regexes that I have seen in jq snippets in the wild are fairly simple, so I believe that regex should interpret them the way a jq user expects it to. By far the largest problem, however, are named capture groups, because there are some jq functions that crucially depend on them, in particular capture. Without the named capture group syntax, it is not possible to use capture the same way in jq and jaq.

I am quite sympathetic to this line of reasoning personally. And chasing this sort of "let's just keep adding alternative forms of everything until we capture all the different ways other regex engines do things" will lead us into undesirable territory.

I agree in principle; however, when searching for "regex named capture group", among the first four matches,
all mention the syntax (?<, whereas only one site (the first one) additionally (not exclusively) mentions the existence of (?P<.
This at least suggests that there might not be such a strong consensus towards the syntax (?P< as the one and only syntax to rule them all.

I also wonder whether you could easily work around this by looking for a (?< and replacing it with a (?P<. You would need to deal with escapes, but I think that might be it? I don't think you'd need to write a full parser. I might be wrong though, I haven't given this a lot of thought.

There might be a lot of tricky cases to handle. Consider:

  • [(?<]
  • \[(?<
  • \\[(?<]
  • \(?<
  • \\(?<
  • ...

Given that I am not a regex expert, I would not trust myself to get this right.

And now speaking pragmatically, adding support for (?<name>exp) – or anything else – to RE2 shouldn't happen without initiating a three-phase commit protocol with the Go regexp package, RE2/J et cetera.

Why is there such a need for synchronisation? Is there some kind of agreement between the Rust regex crate and RE2 to implement precisely the same syntax?

Would it perhaps be possible to have some opt-in option, for example in ParserBuilder, to enable parsing (?< syntax?

@01mf02
Copy link
Contributor Author

01mf02 commented Feb 8, 2023

I could change my mind given evidence of (1) significant usage of .NET forms or (2) significant environments that only support the .NET forms. It sounds like jq might be one such environment. Reading the other link, maybe Java or Boost has (?...) without (?P...)? It's unclear to me.

Regarding Java, I read at least three sites, all of which exclusively mentioned the (?< syntax.

For Boost, the documentation says that the Perl syntax is the default behaviour, and details that this supports (?< and (?'. Again, no mention of (?P<.

@BurntSushi
Copy link
Member

Why is there such a need for synchronisation? Is there some kind of agreement between the Rust regex crate and RE2 to implement precisely the same syntax?

To clarify here, @junyer and @rsc are RE2 maintainers, and RE2, RE2/J and Go's regexp package are all maintained by folks at Google. So those packages I think generally try to stay very strictly aligned.

There is no synchronization promise with those three and the regex crate though. The regex crate does actually have some substantial differences (like the escaping strategy, although I expect that to change in the direction of RE2's) and also support for character class set operations and nested classes and probably a few other minor things. Still though, I value their input and "consistent with RE2" is, overall, something I value. But not over everything else.

@01mf02
Copy link
Contributor Author

01mf02 commented Feb 9, 2023

I see. Thanks for clarifying your synchronisation policy.

Just on the side: I believe that implementing the (?< syntax implies changing only one line in the code, namely replacing if self.bump_if("?P<") by if self.bump_if("?P<") || self.bump_if("?<"). I would gladly volunteer to submit a PR with this change where I would also write a few tests for the new behaviour. But of course only if you agree that this feature is worth having.

If I can do anything else to convince you about the utility of supporting (?<, please let me know. Aside, I also checked that JavaScript uniquely supports (?<. Furthermore, among two of the most popular regex websites, https://regexr.com/ supports only the (?< syntax and https://regex101.com/ supports both (?< and (?P<. From my research, I have gained the impression that the (?P< syntax is actually more the exception than the norm.

@01mf02
Copy link
Contributor Author

01mf02 commented Feb 9, 2023

I found an interesting bit of history from the Python project that explains among others how the syntax (? came up. It goes further on to explain:

Python supports several of Perl's extensions and adds an extension syntax to Perl's extension syntax. If the first character after the question mark is a P, you know that it's an extension that's specific to Python.

So the P in (?P< stands for a Python-specific extension.
In that sense, it reminds me of browser-specific extensions. Like, for example, -moz-animation, which was later standardised and turned into just animation.
I suppose that in the same way that people dropped the -moz-prefix, people dropped the P from (?P< as named capture groups proved to be useful beyond Python.
Now, keeping to allow the P in the syntax may be justified by compatibility reasons (just like -moz-animation is still accepted in some browsers). At the same time, it would be great to also have a way to express named capture groups without the capital P, which perpetuates that they are a Python-specific extension (which they have ceased to be a long time ago).

@BurntSushi
Copy link
Member

@01mf02 The history and original reason for the (?P syntax is indeed interesting, but I think it has almost exactly zero weight on my decision here. Here are the things that matter to me, in no particular order:

  1. Consistency with other regex engines, especially RE2, given the common ancestry.
  2. Keeping the syntax "simple," for some definition of "simple." Having two different syntaxes for accomplishing the same thing is a negative IMO. Basically, what this results in in my experience is that someone learns one syntax, then sees the other syntax and wonders, "wait was I doing it the wrong way? should I switch? what's the difference between them?" We can of course mitigate such things by answering such questions in the docs, but it is remarkably difficult to make such a thing discoverable. It's certainly not something you want to plaster across the introduction, so it tends to get buried in the syntax details. Which is fine... But people are going to get confused. As with other things in this list, I do not value this above everything else. It's just something I consider.
  3. Making the syntax flexible enough to fit into other environments. This is a net positive because it means there's more knowledge transfer from past experience and things tend to "just work" more often than not. I think this is basically what describes your use case here.
  4. There is an overall downside of trying to "make the syntax match other regex engines," because basically other than regex engines that closely and strictly follow an existing specification, no two regex engines behave the same. And so trying to "just make things work" is a long path that doesn't really have an end. I don't think there is a positive of negative here, but it's something to consider.

I think (1) and (2) are where I am at the moment. Unfortunately, there's no real objective criteria to evaluate here.

I am overall leaning towards doing this.

Just on the side: I believe that implementing the (?< syntax implies changing only one line in the code, namely replacing if self.bump_if("?P<") by if self.bump_if("?P<") || self.bump_if("?<"). I would gladly volunteer to submit a PR with this change where I would also write a few tests for the new behaviour. But of course only if you agree that this feature is worth having.

I agree that the patch here is likely quite simple, but it is probably not this simple. Whether (?P<name>expr) or (?<name>expr) is used or not needs to show up in the AST somewhere. So there may be some type definition changes here, and potentially even a breaking change for the regex-syntax crate. (Which is okay. I don't like to do it too often, but I am planning to do one soon.)

@rsc
Copy link

rsc commented Feb 9, 2023

Have we identified any regexp implementations other than onigurama that don't implement (?P<name>...)?

Also, is the suggestion to allow both (?<name>...) and (?'name'...) or just the first?

@BurntSushi
Copy link
Member

BurntSushi commented Feb 10, 2023

Not sure about (?'name'...) but I found these with some quick searching:

I think that's all I could find at the moment. I think the closest thing to a consensus among non-RE2 engines is "support both (?P<name>...) and (?<name>...)." That seemed like the most common thing, but it's not ubiquitous. A lot of engines support one or the other too. The "support both" is perhaps inflated a bit by the ubiquity of PCRE, which is used as the default regex engine in at least a few places (PHP and Julia come to mind).

@BurntSushi
Copy link
Member

Also, is the suggestion to allow both (?<name>...) and (?'name'...) or just the first?

I think the suggestion on the table is just first, as that's what is used by Oniguruma in the context of jq scripts.

The (?'name'...) syntax is one that I very rarely see. I don't think there are any regex engines (that I can recall in my search) that only support (?'name'...).

@c-git
Copy link

c-git commented Feb 10, 2023

I hope this comment doesn't distract too much but I really appreciate how @BurntSushi addresses issues raised, explaining his reasoning and so on. I learn so much from just following along and it usually causes me to think about considerations I might have otherwise missed. I just want to say thank you, I really appreciate the time you put into your responses.

@01mf02
Copy link
Contributor Author

01mf02 commented Feb 10, 2023

I second @c-git in that I also value your very detailed responses, @BurntSushi.

And of course I'm happy to read that you are leaning towards implementing my suggestion.
I second your observation that (?'name'...) is something that you very rarely see. I've probably seen this syntax more often in documentation than used in actual code. So I am for not implementing this, also to keep the syntax "simple".

Unbelievably, I can't find any authoritative reference for Boost's regex library about what kind of named capture support it has, but some examples in the wild suggest it at least supports (?<name>...) syntax.

The documentation of the boost regex module mentions named capture groups only in the Perl syntax flavour, which says that ?< and ?' are supported. No mention of ?P< here.

I agree that the patch here is likely quite simple, but it is probably not this simple. Whether (?P<name>expr) or (?<name>expr) is used or not needs to show up in the AST somewhere. So there may be some type definition changes here, and potentially even a breaking change for the regex-syntax crate. (Which is okay. I don't like to do it too often, but I am planning to do one soon.)

Ah, I see. I suppose it is for round-tripping? If you wish, I could tackle this. I understand that the ast::Group::CaptureName variant would either need to be extended by some bit that indicates the presence of P (would you consider a boolean?), or a new variant (something like CapturePName) could be introduced. What do you think about this?

@BurntSushi
Copy link
Member

Ah, I see. I suppose it is for round-tripping? If you wish, I could tackle this. I understand that the ast::Group::CaptureName variant would either need to be extended by some bit that indicates the presence of P (would you consider a boolean?), or a new variant (something like CapturePName) could be introduced. What do you think about this?

Yes, round-tripping. The point of the AST is that it exhaustively describes the syntax as it is. Lowering it into something simpler and easier to analyze happens in a second pass. (You'll need to make what is likely a trivial change to the AST->HIR translator, also inside of regex-syntax, to accommodate your changes to the AST.)

I think a new variant for GroupKind seems okay? So rename the existing CaptureName to CapturePName and introduce a new CaptureName variant.

@rsc
Copy link

rsc commented Feb 10, 2023

Talked to @junyer a bit, and I think this change make sense to do in RE2 and Go as well. I filed golang/go#58458, and assuming it goes through we'll update RE2 and Go in about a month.

@BurntSushi
Copy link
Member

@rsc SGTM! If y'all add support for it then I definitely will as well. We might not line up timing wise, but I think that's okay!

@01mf02
Copy link
Contributor Author

01mf02 commented Feb 10, 2023

Great! I'm very happy that we seem to have reached a consensus on this issue. :) I have opened a PR with my proposed changes. Have a good weekend!

BurntSushi pushed a commit that referenced this issue Feb 10, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955
BurntSushi pushed a commit that referenced this issue Feb 10, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Feb 18, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Mar 2, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Mar 4, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Mar 5, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Mar 15, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Mar 15, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Mar 15, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Mar 20, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Mar 21, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Apr 15, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Apr 17, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Apr 17, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi pushed a commit that referenced this issue Apr 17, 2023
It turns out that both '(?P<name>...)' and '(?<name>...)' are rather
common among regex engines. There are several that support just one or
the other. Until this commit, the regex crate only supported the former,
along with both RE2, RE2/J and Go's regexp package. There are also
several regex engines that only supported the latter, such as Onigmo,
Onuguruma, Java, Ruby, Boost, .NET and Javascript. To decrease friction,
and because there is somewhat little cost to doing so, we elect to
support both.

It looks like perhaps RE2 and Go's regexp package will go the same
route, but it isn't fully decided yet:
golang/go#58458

Closes #955, Closes #956
BurntSushi added a commit that referenced this issue Apr 20, 2023
1.8.0 (2023-04-20)
==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represent preparatory work
for the second release, which will be even bigger. Namely, this release:

* Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
* Upgrades its dependency on `aho-corasick` to the recently release 1.0
version.
* Upgrades its dependency on `regex-syntax` to the simultaneously released
`0.7` version. The changes to `regex-syntax` principally revolve around a
rewrite of its literal extraction code and a number of simplifications and
optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

* [FEATURE #501](#501):
Permit many more characters to be escaped, even if they have no significance.
More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
escaped. Also, a new routine, `is_escapeable_character`, has been added to
`regex-syntax` to query whether a character is escapeable or not.
* [FEATURE #547](#547):
Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
introduce any new expressive power.
* [FEATURE #595](#595):
Capture group names are now Unicode-aware. They can now begin with either a `_`
or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
`]`. Note that replacement syntax has not changed.
* [FEATURE #810](#810):
Add `Match::is_empty` and `Match::len` APIs.
* [FEATURE #905](#905):
Add an `impl Default for RegexSet`, with the default being the empty set.
* [FEATURE #908](#908):
A new method, `Regex::static_captures_len`, has been added which returns the
number of capture groups in the pattern if and only if every possible match
always contains the same number of matching groups.
* [FEATURE #955](#955):
Named captures can now be written as `(?<name>re)` in addition to
`(?P<name>re)`.
* FEATURE: `regex-syntax` now supports empty character classes.
* FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
to `regex` in the second release.)
* FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
made to it.
* FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
mode. This will be supported in `regex` proper in the second release.
* FEATURE: `regex-syntax` now has proper support for "regex that never
matches" via `Hir::fail()`.
* FEATURE: The `hir::literal` module of `regex-syntax` has been completely
re-worked. It now has more documentation, examples and advice.
* FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

* PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
cases. It's difficult to characterize exactly which patterns this might impact,
but if there are a small number of longish (>= 4 bytes) prefix literals, then
it might be faster than before.

Bug fixes:

* [BUG #514](#514):
Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
* BUGS [#516](#516),
[#731](#731):
Fix a number of issues with printing `Hir` values as regex patterns.
* [BUG #610](#610):
Add explicit example of `foo|bar` in the regex syntax docs.
* [BUG #625](#625):
Clarify that `SetMatches::len` does not (regretably) refer to the number of
matches in the set.
* [BUG #660](#660):
Clarify "verbose mode" in regex syntax documentation.
* BUG [#738](#738),
[#950](#950):
Fix `CaptureLocations::get` so that it never panics.
* [BUG #747](#747):
Clarify documentation for `Regex::shortest_match`.
* [BUG #835](#835):
Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
* [BUG #846](#846):
Add more clarifying documentation to the `CompiledTooBig` error variant.
* [BUG #854](#854):
Clarify that `regex::Regex` searches as if the haystack is a sequence of
Unicode scalar values.
* [BUG #884](#884):
Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
* [BUG #893](#893):
Optimize case folding since it can get quite slow in some pathological cases.
* [BUG #895](#895):
Reject `(?-u:\W)` in `regex::Regex` APIs.
* [BUG #942](#942):
Add a missing `void` keyword to indicate "no parameters" in C API.
* [BUG #965](#965):
Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
* [BUG #975](#975):
Clarify documentation for `\pX` syntax.
BurntSushi added a commit that referenced this issue Apr 20, 2023
1.8.0 (2023-04-20)
==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represent preparatory work
for the second release, which will be even bigger. Namely, this release:

* Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
* Upgrades its dependency on `aho-corasick` to the recently release 1.0
version.
* Upgrades its dependency on `regex-syntax` to the simultaneously released
`0.7` version. The changes to `regex-syntax` principally revolve around a
rewrite of its literal extraction code and a number of simplifications and
optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

* [FEATURE #501](#501):
Permit many more characters to be escaped, even if they have no significance.
More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
escaped. Also, a new routine, `is_escapeable_character`, has been added to
`regex-syntax` to query whether a character is escapeable or not.
* [FEATURE #547](#547):
Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
introduce any new expressive power.
* [FEATURE #595](#595):
Capture group names are now Unicode-aware. They can now begin with either a `_`
or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
`]`. Note that replacement syntax has not changed.
* [FEATURE #810](#810):
Add `Match::is_empty` and `Match::len` APIs.
* [FEATURE #905](#905):
Add an `impl Default for RegexSet`, with the default being the empty set.
* [FEATURE #908](#908):
A new method, `Regex::static_captures_len`, has been added which returns the
number of capture groups in the pattern if and only if every possible match
always contains the same number of matching groups.
* [FEATURE #955](#955):
Named captures can now be written as `(?<name>re)` in addition to
`(?P<name>re)`.
* FEATURE: `regex-syntax` now supports empty character classes.
* FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
to `regex` in the second release.)
* FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
made to it.
* FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
mode. This will be supported in `regex` proper in the second release.
* FEATURE: `regex-syntax` now has proper support for "regex that never
matches" via `Hir::fail()`.
* FEATURE: The `hir::literal` module of `regex-syntax` has been completely
re-worked. It now has more documentation, examples and advice.
* FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

* PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
cases. It's difficult to characterize exactly which patterns this might impact,
but if there are a small number of longish (>= 4 bytes) prefix literals, then
it might be faster than before.

Bug fixes:

* [BUG #514](#514):
Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
* BUGS [#516](#516),
[#731](#731):
Fix a number of issues with printing `Hir` values as regex patterns.
* [BUG #610](#610):
Add explicit example of `foo|bar` in the regex syntax docs.
* [BUG #625](#625):
Clarify that `SetMatches::len` does not (regretably) refer to the number of
matches in the set.
* [BUG #660](#660):
Clarify "verbose mode" in regex syntax documentation.
* BUG [#738](#738),
[#950](#950):
Fix `CaptureLocations::get` so that it never panics.
* [BUG #747](#747):
Clarify documentation for `Regex::shortest_match`.
* [BUG #835](#835):
Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
* [BUG #846](#846):
Add more clarifying documentation to the `CompiledTooBig` error variant.
* [BUG #854](#854):
Clarify that `regex::Regex` searches as if the haystack is a sequence of
Unicode scalar values.
* [BUG #884](#884):
Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
* [BUG #893](#893):
Optimize case folding since it can get quite slow in some pathological cases.
* [BUG #895](#895):
Reject `(?-u:\W)` in `regex::Regex` APIs.
* [BUG #942](#942):
Add a missing `void` keyword to indicate "no parameters" in C API.
* [BUG #965](#965):
Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
* [BUG #975](#975):
Clarify documentation for `\pX` syntax.
crapStone added a commit to Calciumdibromid/CaBr2 that referenced this issue May 2, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.7.3` -> `1.8.1` |

---

### Release Notes

<details>
<summary>rust-lang/regex</summary>

### [`v1.8.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;181-2023-04-21)

\==================
This is a patch release that fixes a bug where a regex match could be reported
where none was found. Specifically, the bug occurs when a pattern contains some
literal prefixes that could be extracted *and* an optional word boundary in the
prefix.

Bug fixes:

-   [BUG #&#8203;981](rust-lang/regex#981):
    Fix a bug where a word boundary could interact with prefix literal
    optimizations and lead to a false positive match.

### [`v1.8.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#&#8203;180-2023-04-20)

\==================
This is a sizeable release that will be soon followed by another sizeable
release. Both of them will combined close over 40 existing issues and PRs.

This first release, despite its size, essentially represents preparatory work
for the second release, which will be even bigger. Namely, this release:

-   Increases the MSRV to Rust 1.60.0, which was released about 1 year ago.
-   Upgrades its dependency on `aho-corasick` to the recently released 1.0
    version.
-   Upgrades its dependency on `regex-syntax` to the simultaneously released
    `0.7` version. The changes to `regex-syntax` principally revolve around a
    rewrite of its literal extraction code and a number of simplifications and
    optimizations to its high-level intermediate representation (HIR).

The second release, which will follow ~shortly after the release above, will
contain a soup-to-nuts rewrite of every regex engine. This will be done by
bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into
this repository, and then changing the `regex` crate to be nothing but an API
shim layer on top of `regex-automata`'s API.

These tandem releases are the culmination of about 3
years of on-and-off work that [began in earnest in March
2020](rust-lang/regex#656).

Because of the scale of changes involved in these releases, I would love to
hear about your experience. Especially if you notice undocumented changes in
behavior or performance changes (positive *or* negative).

Most changes in the first release are listed below. For more details, please
see the commit log, which reflects a linear and decently documented history
of all changes.

New features:

-   [FEATURE #&#8203;501](rust-lang/regex#501):
    Permit many more characters to be escaped, even if they have no significance.
    More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be
    escaped. Also, a new routine, `is_escapeable_character`, has been added to
    `regex-syntax` to query whether a character is escapeable or not.
-   [FEATURE #&#8203;547](rust-lang/regex#547):
    Add `Regex::captures_at`. This filles a hole in the API, but doesn't otherwise
    introduce any new expressive power.
-   [FEATURE #&#8203;595](rust-lang/regex#595):
    Capture group names are now Unicode-aware. They can now begin with either a `_`
    or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints
    can be any sequence of alpha-numeric codepoints, along with `_`, `.`, `[` and
    `]`. Note that replacement syntax has not changed.
-   [FEATURE #&#8203;810](rust-lang/regex#810):
    Add `Match::is_empty` and `Match::len` APIs.
-   [FEATURE #&#8203;905](rust-lang/regex#905):
    Add an `impl Default for RegexSet`, with the default being the empty set.
-   [FEATURE #&#8203;908](rust-lang/regex#908):
    A new method, `Regex::static_captures_len`, has been added which returns the
    number of capture groups in the pattern if and only if every possible match
    always contains the same number of matching groups.
-   [FEATURE #&#8203;955](rust-lang/regex#955):
    Named captures can now be written as `(?<name>re)` in addition to
    `(?P<name>re)`.
-   FEATURE: `regex-syntax` now supports empty character classes.
-   FEATURE: `regex-syntax` now has an optional `std` feature. (This will come
    to `regex` in the second release.)
-   FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications
    made to it.
-   FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF
    mode. This will be supported in `regex` proper in the second release.
-   FEATURE: `regex-syntax` now has proper support for "regex that never
    matches" via `Hir::fail()`.
-   FEATURE: The `hir::literal` module of `regex-syntax` has been completely
    re-worked. It now has more documentation, examples and advice.
-   FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed
    to `utf8`, and the meaning of the boolean has been flipped.

Performance improvements:

-   PERF: The upgrade to `aho-corasick 1.0` may improve performance in some
    cases. It's difficult to characterize exactly which patterns this might impact,
    but if there are a small number of longish (>= 4 bytes) prefix literals, then
    it might be faster than before.

Bug fixes:

-   [BUG #&#8203;514](rust-lang/regex#514):
    Improve `Debug` impl for `Match` so that it doesn't show the entire haystack.
-   BUGS [#&#8203;516](rust-lang/regex#516),
    [#&#8203;731](rust-lang/regex#731):
    Fix a number of issues with printing `Hir` values as regex patterns.
-   [BUG #&#8203;610](rust-lang/regex#610):
    Add explicit example of `foo|bar` in the regex syntax docs.
-   [BUG #&#8203;625](rust-lang/regex#625):
    Clarify that `SetMatches::len` does not (regretably) refer to the number of
    matches in the set.
-   [BUG #&#8203;660](rust-lang/regex#660):
    Clarify "verbose mode" in regex syntax documentation.
-   BUG [#&#8203;738](rust-lang/regex#738),
    [#&#8203;950](rust-lang/regex#950):
    Fix `CaptureLocations::get` so that it never panics.
-   [BUG #&#8203;747](rust-lang/regex#747):
    Clarify documentation for `Regex::shortest_match`.
-   [BUG #&#8203;835](rust-lang/regex#835):
    Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`.
-   [BUG #&#8203;846](rust-lang/regex#846):
    Add more clarifying documentation to the `CompiledTooBig` error variant.
-   [BUG #&#8203;854](rust-lang/regex#854):
    Clarify that `regex::Regex` searches as if the haystack is a sequence of
    Unicode scalar values.
-   [BUG #&#8203;884](rust-lang/regex#884):
    Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute.
-   [BUG #&#8203;893](rust-lang/regex#893):
    Optimize case folding since it can get quite slow in some pathological cases.
-   [BUG #&#8203;895](rust-lang/regex#895):
    Reject `(?-u:\W)` in `regex::Regex` APIs.
-   [BUG #&#8203;942](rust-lang/regex#942):
    Add a missing `void` keyword to indicate "no parameters" in C API.
-   [BUG #&#8203;965](rust-lang/regex#965):
    Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`.
-   [BUG #&#8203;975](rust-lang/regex#975):
    Clarify documentation for `\pX` syntax.

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS42MS4wIiwidXBkYXRlZEluVmVyIjoiMzUuNjYuMyIsInRhcmdldEJyYW5jaCI6ImRldmVsb3AifQ==-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Co-authored-by: crapStone <crapstone01@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1874
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
@BurntSushi
Copy link
Member

This ended up being a very effective feature request. It caused RE2, Go's regexp package and this crate to all start supporting (?<name>expr) syntax in addition to (?P<name>expr). Nicely done @01mf02!

@01mf02
Copy link
Contributor Author

01mf02 commented Aug 14, 2023

Thanks, @BurntSushi!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants