Large regression in word boundary regexes #567

rctcwyvrn · 2022-07-12T00:25:47Z

I was running the benchmarker for this regex #"<(\w*)\b[^>]*>(.*?)<\/\1>"# which uses \b to match the end of a html tag and noticed it was running really slow

ed842cb

Running
- htmlAll 11.8ms

main

Running
- htmlAll 3.08s

Some amount of regression was expected with the implementation of the new word breaking algorithm but a 300x slowdown seems unacceptable. A quick profile shows that ~99% of the time is spent in AssertFunction, with 90% of that being String._wordIndex(after:) and 10% being Set.insert

cc @Azoy @milseman

The text was updated successfully, but these errors were encountered:

milseman · 2022-07-12T17:18:39Z

@Azoy is this because the SPI is inefficient, or any thoughts on what to do here?

Azoy · 2022-07-12T19:18:04Z

yeah the current implementation of String.isOnWordBoundary in this repo is really inefficient and was fully expecting perf to be pretty bad. Once _nearestWordIndex(atOrBelow:) is fixed, I think this operation will get considerably faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large regression in word boundary regexes #567

Large regression in word boundary regexes #567

rctcwyvrn commented Jul 12, 2022 •

edited

milseman commented Jul 12, 2022

Azoy commented Jul 12, 2022

Large regression in word boundary regexes #567

Large regression in word boundary regexes #567

Comments

rctcwyvrn commented Jul 12, 2022 • edited

milseman commented Jul 12, 2022

Azoy commented Jul 12, 2022

rctcwyvrn commented Jul 12, 2022 •

edited