New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorder subregex patterns to match valid semver strings?d #826
Comments
Performance wasn't my goal, but the updated regex seems to have slightly better performance: package main
import (
"regexp"
"testing"
)
var rOriginal *regexp.Regexp = regexp.MustCompile(`^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$`)
var rUpdated *regexp.Regexp = regexp.MustCompile(`^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|0|[1-9]\d*)(?:\.(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|0|[1-9]\d*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$`)
func BenchmarkSemverRegexOriginal(b *testing.B) {
for n := 0; n < b.N; n++ {
rOriginal.MatchString("1.20.2-alpha-0.0.1.123-308")
}
}
func BenchmarkSemverRegexUpdated(b *testing.B) {
for n := 0; n < b.N; n++ {
rUpdated.MatchString("1.20.2-alpha-0.0.1.123-308")
}
} go test -bench=.
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
BenchmarkSemverRegexOriginal-8 725662 1651 ns/op
BenchmarkSemverRegexUpdated-8 874554 1280 ns/op
PASS
ok _/tmp/test 2.357s |
Does your regex work correctly against all of the test strings? https://regex101.com/r/Ly7O1x/3/ The official regex matches a SemVer string as defined by the spec and rejects those that don't match, including non-SemVer prefix and postfix content that might be on the line. I don't see where you've eliminated back tracking, but my regex foo is a tad rusty. We did put in some effort to reduce backtracking, but I seem to recall we could not eliminate it. Your benchmark code needs some work. For one thing, it lacks a warm-up cycle, and for another, it should iterate many times over each of those loops, but randomly select their order on each pass (or some other arrangement that eliminates warm-up and ordering effects). I don't know much about go, but care should be taken to measure CPU time spent executing the regex bits, not time spent suspended by the operating environment in other code. On modern systems, that usually involves system level perf counters for the respective execution threads. I would also expect to see a lot more test strings. It's easy to optimize a regex for just one test string, eliminate the regex! I would certainly support reordering (provided it works for all of our test strings), but not removing the anchors. Our tests depend on those, but reordering would reduce the likelihood of someone introducing errors, if/when they need to remove the anchors for their specific application. Nice catch btw. |
I was about to make a similar proposal and got here. I've found that reordering the conditionals in the non-anchors version seems to improve the performance slightly. I changed the internal order of Original:
Modified: ^([1-9]\d*|0)\.([1-9]\d*|0)\.([1-9]\d*|0)(?:-((?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0)(?:\.(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$ It seems to work correctly against the test suite and reports 2406 steps vs 2518 in the original with PCRE and PCRE2. Original: https://regex101.com/r/vkijKf/1/ |
My modified with-anchors version https://regex101.com/r/FFKHyH/1 contains the same reorder proposed by @sebastien-rosset (it passes the tests) in the
Same 2406 steps result. |
Fewer steps is good. Not sure it's worth the effort of changing the FAQ for a 4.5% performance increase on something that is rarely used against more than a handful of strings at one time. I wonder if the improvement still holds, if you remove the degenerate forms from the oracles? My primary motivation for pushing for a standard SemVer regex was driven by my finding several hundred incorrect SemVer implementations on GitHub alone, most of which were regex's. So we went looking for a regex that was clear (as regex's go), correct, and not prone to infinite backtracking or timeouts. So basically, "the one true, recognizably correct, reasonably efficient implementation" of a SemVer regex. I am not against updating the FAQ for this, but if we do, let's make sure it's the last time we ever modify the regex bits. We need a much larger set of oracles taken from the real world. |
See also #788. Apparently the \d* accepts non-ascii digits, which is definitely out of spec, but maybe not unacceptable. I definitely have code that would be broken (I avoid using regex's in production code) if the spec allowed characters outside the range of [0-9][A-Z][a-z].-+. |
This is a proposal to slightly change the order of the prerelease subregex expressions in the documented regex. This will make it possible to find semver strings inside a body of text.
These strings are valid according to semver v2:
With the start and end of string anchors, all of these strings match the documented regex:
^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$
.Suppose these semver strings are inside a larger body of text. For example, the text is:
Using a regex, it's possible to iteratively find all occurrences of the semver strings in the body of text. One way to do this is to remove the anchors from the documented regex. However, in that case, the regex
find
will yield an incomplete value:The problem occurs because when matching
0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*
, there are three sub-regex patterns and the engine matches from left to right, without trying to find the longest match. Hence the match will stop at123
.Reordering the subregex patterns works:
(?<major>0|[1-9]\d*)\.(?<minor>0|[1-9]\d*)\.(?<patch>0|[1-9]\d*)(?:-(?<prerelease>(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0)(?:\.(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0))*))?(?:\+(?<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?
As a side benefit, the regex seems to be a bit more efficient because the engine does not have to backtrack.
The text was updated successfully, but these errors were encountered: