Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder subregex patterns to match valid semver strings?d #826

Open
sebastien-rosset opened this issue Apr 22, 2022 · 6 comments
Open

Reorder subregex patterns to match valid semver strings?d #826

sebastien-rosset opened this issue Apr 22, 2022 · 6 comments

Comments

@sebastien-rosset
Copy link

sebastien-rosset commented Apr 22, 2022

This is a proposal to slightly change the order of the prerelease subregex expressions in the documented regex. This will make it possible to find semver strings inside a body of text.

These strings are valid according to semver v2:

1.20.2----
1.20.2-alpha-0.0.1.123.308
1.20.2-alpha-0.0.1.123-308
1.20.2-alpha-0.0.1.123----308

With the start and end of string anchors, all of these strings match the documented regex: ^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$.

Suppose these semver strings are inside a larger body of text. For example, the text is:

service abc version 1.20.2-alpha-0.0.1.123-308 is starting.
service abc version 1.20.2-alpha-0.0.1.123-308 stopped.

Using a regex, it's possible to iteratively find all occurrences of the semver strings in the body of text. One way to do this is to remove the anchors from the documented regex. However, in that case, the regex find will yield an incomplete value:

1.20.2-alpha-0.0.1.123-308         => regex match is '1.20.2-alpha-0.0.1.123'
1.20.2-alpha-0.0.1.123----308    => regex match is '1.20.2-alpha-0.0.1.123'

The problem occurs because when matching 0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*, there are three sub-regex patterns and the engine matches from left to right, without trying to find the longest match. Hence the match will stop at 123.

Reordering the subregex patterns works: (?<major>0|[1-9]\d*)\.(?<minor>0|[1-9]\d*)\.(?<patch>0|[1-9]\d*)(?:-(?<prerelease>(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0)(?:\.(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0))*))?(?:\+(?<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?

As a side benefit, the regex seems to be a bit more efficient because the engine does not have to backtrack.

@sebastien-rosset
Copy link
Author

Performance wasn't my goal, but the updated regex seems to have slightly better performance:

package main

import (
        "regexp"
        "testing"
)

var rOriginal *regexp.Regexp = regexp.MustCompile(`^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$`)
var rUpdated *regexp.Regexp = regexp.MustCompile(`^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|0|[1-9]\d*)(?:\.(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|0|[1-9]\d*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$`)


func BenchmarkSemverRegexOriginal(b *testing.B) {
        for n := 0; n < b.N; n++ {
                rOriginal.MatchString("1.20.2-alpha-0.0.1.123-308")
        }
}

func BenchmarkSemverRegexUpdated(b *testing.B) {
        for n := 0; n < b.N; n++ {
                rUpdated.MatchString("1.20.2-alpha-0.0.1.123-308")
        }
}
go test -bench=.
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
BenchmarkSemverRegexOriginal-8   	  725662	      1651 ns/op
BenchmarkSemverRegexUpdated-8    	  874554	      1280 ns/op
PASS
ok  	_/tmp/test	2.357s

@jwdonahue
Copy link
Contributor

jwdonahue commented May 3, 2022

Does your regex work correctly against all of the test strings?

https://regex101.com/r/Ly7O1x/3/

The official regex matches a SemVer string as defined by the spec and rejects those that don't match, including non-SemVer prefix and postfix content that might be on the line. I don't see where you've eliminated back tracking, but my regex foo is a tad rusty. We did put in some effort to reduce backtracking, but I seem to recall we could not eliminate it.

Your benchmark code needs some work. For one thing, it lacks a warm-up cycle, and for another, it should iterate many times over each of those loops, but randomly select their order on each pass (or some other arrangement that eliminates warm-up and ordering effects). I don't know much about go, but care should be taken to measure CPU time spent executing the regex bits, not time spent suspended by the operating environment in other code. On modern systems, that usually involves system level perf counters for the respective execution threads. I would also expect to see a lot more test strings. It's easy to optimize a regex for just one test string, eliminate the regex!


I would certainly support reordering (provided it works for all of our test strings), but not removing the anchors. Our tests depend on those, but reordering would reduce the likelihood of someone introducing errors, if/when they need to remove the anchors for their specific application.

Nice catch btw.

@ictus4u
Copy link

ictus4u commented Jun 6, 2022

I was about to make a similar proposal and got here.

I've found that reordering the conditionals in the non-anchors version seems to improve the performance slightly.

I changed the internal order of major, minor, patch and prerelease.

Original:

^(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(?:-((?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$

Modified:

^([1-9]\d*|0)\.([1-9]\d*|0)\.([1-9]\d*|0)(?:-((?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0)(?:\.(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$

It seems to work correctly against the test suite and reports 2406 steps vs 2518 in the original with PCRE and PCRE2.
Please, review https://regex101.com/r/DnUaMG/2

Original: https://regex101.com/r/vkijKf/1/

@ictus4u
Copy link

ictus4u commented Jun 6, 2022

My modified with-anchors version https://regex101.com/r/FFKHyH/1 contains the same reorder proposed by @sebastien-rosset (it passes the tests) in the prerelease plus my changes in major, minor and patch.

^(?P<major>[1-9]\d*|0)\.(?P<minor>[1-9]\d*|0)\.(?P<patch>[1-9]\d*|0)(?:-(?P<prerelease>(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0)(?:\.(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$

Same 2406 steps result.

@jwdonahue
Copy link
Contributor

Fewer steps is good. Not sure it's worth the effort of changing the FAQ for a 4.5% performance increase on something that is rarely used against more than a handful of strings at one time. I wonder if the improvement still holds, if you remove the degenerate forms from the oracles?

My primary motivation for pushing for a standard SemVer regex was driven by my finding several hundred incorrect SemVer implementations on GitHub alone, most of which were regex's. So we went looking for a regex that was clear (as regex's go), correct, and not prone to infinite backtracking or timeouts. So basically, "the one true, recognizably correct, reasonably efficient implementation" of a SemVer regex.

I am not against updating the FAQ for this, but if we do, let's make sure it's the last time we ever modify the regex bits. We need a much larger set of oracles taken from the real world.

@jwdonahue
Copy link
Contributor

See also #788. Apparently the \d* accepts non-ascii digits, which is definitely out of spec, but maybe not unacceptable. I definitely have code that would be broken (I avoid using regex's in production code) if the spec allowed characters outside the range of [0-9][A-Z][a-z].-+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@jwdonahue @sebastien-rosset @ictus4u and others