Reorder subregex patterns to match valid semver strings?d #826

sebastien-rosset · 2022-04-22T04:58:41Z

This is a proposal to slightly change the order of the prerelease subregex expressions in the documented regex. This will make it possible to find semver strings inside a body of text.

These strings are valid according to semver v2:

1.20.2----
1.20.2-alpha-0.0.1.123.308
1.20.2-alpha-0.0.1.123-308
1.20.2-alpha-0.0.1.123----308

Suppose these semver strings are inside a larger body of text. For example, the text is:

service abc version 1.20.2-alpha-0.0.1.123-308 is starting.
service abc version 1.20.2-alpha-0.0.1.123-308 stopped.

Using a regex, it's possible to iteratively find all occurrences of the semver strings in the body of text. One way to do this is to remove the anchors from the documented regex. However, in that case, the regex find will yield an incomplete value:

1.20.2-alpha-0.0.1.123-308         => regex match is '1.20.2-alpha-0.0.1.123'
1.20.2-alpha-0.0.1.123----308    => regex match is '1.20.2-alpha-0.0.1.123'

The problem occurs because when matching 0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*, there are three sub-regex patterns and the engine matches from left to right, without trying to find the longest match. Hence the match will stop at 123.

As a side benefit, the regex seems to be a bit more efficient because the engine does not have to backtrack.

The text was updated successfully, but these errors were encountered:

sebastien-rosset · 2022-04-22T06:10:59Z

Performance wasn't my goal, but the updated regex seems to have slightly better performance:

package main

import (
        "regexp"
        "testing"
)

var rOriginal *regexp.Regexp = regexp.MustCompile(`^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$`)
var rUpdated *regexp.Regexp = regexp.MustCompile(`^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|0|[1-9]\d*)(?:\.(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|0|[1-9]\d*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$`)


func BenchmarkSemverRegexOriginal(b *testing.B) {
        for n := 0; n < b.N; n++ {
                rOriginal.MatchString("1.20.2-alpha-0.0.1.123-308")
        }
}

func BenchmarkSemverRegexUpdated(b *testing.B) {
        for n := 0; n < b.N; n++ {
                rUpdated.MatchString("1.20.2-alpha-0.0.1.123-308")
        }
}

go test -bench=.
goos: linux
goarch: amd64
cpu: Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
BenchmarkSemverRegexOriginal-8   	  725662	      1651 ns/op
BenchmarkSemverRegexUpdated-8    	  874554	      1280 ns/op
PASS
ok  	_/tmp/test	2.357s

jwdonahue · 2022-05-03T04:13:03Z

Does your regex work correctly against all of the test strings?

https://regex101.com/r/Ly7O1x/3/

The official regex matches a SemVer string as defined by the spec and rejects those that don't match, including non-SemVer prefix and postfix content that might be on the line. I don't see where you've eliminated back tracking, but my regex foo is a tad rusty. We did put in some effort to reduce backtracking, but I seem to recall we could not eliminate it.

Your benchmark code needs some work. For one thing, it lacks a warm-up cycle, and for another, it should iterate many times over each of those loops, but randomly select their order on each pass (or some other arrangement that eliminates warm-up and ordering effects). I don't know much about go, but care should be taken to measure CPU time spent executing the regex bits, not time spent suspended by the operating environment in other code. On modern systems, that usually involves system level perf counters for the respective execution threads. I would also expect to see a lot more test strings. It's easy to optimize a regex for just one test string, eliminate the regex!

I would certainly support reordering (provided it works for all of our test strings), but not removing the anchors. Our tests depend on those, but reordering would reduce the likelihood of someone introducing errors, if/when they need to remove the anchors for their specific application.

Nice catch btw.

ictus4u · 2022-06-06T08:20:00Z

I was about to make a similar proposal and got here.

I've found that reordering the conditionals in the non-anchors version seems to improve the performance slightly.

I changed the internal order of major, minor, patch and prerelease.

Original:

^(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(?:-((?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$

Modified:

^([1-9]\d*|0)\.([1-9]\d*|0)\.([1-9]\d*|0)(?:-((?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0)(?:\.(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$

It seems to work correctly against the test suite and reports 2406 steps vs 2518 in the original with PCRE and PCRE2.
Please, review https://regex101.com/r/DnUaMG/2

Original: https://regex101.com/r/vkijKf/1/

ictus4u · 2022-06-06T09:39:28Z

My modified with-anchors version https://regex101.com/r/FFKHyH/1 contains the same reorder proposed by @sebastien-rosset (it passes the tests) in the prerelease plus my changes in major, minor and patch.

^(?P<major>[1-9]\d*|0)\.(?P<minor>[1-9]\d*|0)\.(?P<patch>[1-9]\d*|0)(?:-(?P<prerelease>(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0)(?:\.(?:\d*[a-zA-Z-][0-9a-zA-Z-]*|[1-9]\d*|0))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$

Same 2406 steps result.

jwdonahue · 2023-05-03T22:58:18Z

Fewer steps is good. Not sure it's worth the effort of changing the FAQ for a 4.5% performance increase on something that is rarely used against more than a handful of strings at one time. I wonder if the improvement still holds, if you remove the degenerate forms from the oracles?

My primary motivation for pushing for a standard SemVer regex was driven by my finding several hundred incorrect SemVer implementations on GitHub alone, most of which were regex's. So we went looking for a regex that was clear (as regex's go), correct, and not prone to infinite backtracking or timeouts. So basically, "the one true, recognizably correct, reasonably efficient implementation" of a SemVer regex.

I am not against updating the FAQ for this, but if we do, let's make sure it's the last time we ever modify the regex bits. We need a much larger set of oracles taken from the real world.

jwdonahue · 2023-05-03T23:24:01Z

See also #788. Apparently the \d* accepts non-ascii digits, which is definitely out of spec, but maybe not unacceptable. I definitely have code that would be broken (I avoid using regex's in production code) if the spec allowed characters outside the range of [0-9][A-Z][a-z].-+.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reorder subregex patterns to match valid semver strings?d #826

Reorder subregex patterns to match valid semver strings?d #826

sebastien-rosset commented Apr 22, 2022 •

edited

sebastien-rosset commented Apr 22, 2022

jwdonahue commented May 3, 2022 •

edited

ictus4u commented Jun 6, 2022 •

edited

ictus4u commented Jun 6, 2022 •

edited

jwdonahue commented May 3, 2023

jwdonahue commented May 3, 2023

Reorder subregex patterns to match valid semver strings?d #826

Reorder subregex patterns to match valid semver strings?d #826

Comments

sebastien-rosset commented Apr 22, 2022 • edited

sebastien-rosset commented Apr 22, 2022

jwdonahue commented May 3, 2022 • edited

ictus4u commented Jun 6, 2022 • edited

ictus4u commented Jun 6, 2022 • edited

jwdonahue commented May 3, 2023

jwdonahue commented May 3, 2023

sebastien-rosset commented Apr 22, 2022 •

edited

jwdonahue commented May 3, 2022 •

edited

ictus4u commented Jun 6, 2022 •

edited

ictus4u commented Jun 6, 2022 •

edited