Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large regression in word boundary regexes #567

Open
rctcwyvrn opened this issue Jul 12, 2022 · 2 comments
Open

Large regression in word boundary regexes #567

rctcwyvrn opened this issue Jul 12, 2022 · 2 comments

Comments

@rctcwyvrn
Copy link
Contributor

rctcwyvrn commented Jul 12, 2022

I was running the benchmarker for this regex #"<(\w*)\b[^>]*>(.*?)<\/\1>"# which uses \b to match the end of a html tag and noticed it was running really slow

ed842cb

Running
- htmlAll 11.8ms

main

Running
- htmlAll 3.08s

Some amount of regression was expected with the implementation of the new word breaking algorithm but a 300x slowdown seems unacceptable. A quick profile shows that ~99% of the time is spent in AssertFunction, with 90% of that being String._wordIndex(after:) and 10% being Set.insert

cc @Azoy @milseman

@milseman
Copy link
Collaborator

@Azoy is this because the SPI is inefficient, or any thoughts on what to do here?

@Azoy
Copy link
Member

Azoy commented Jul 12, 2022

yeah the current implementation of String.isOnWordBoundary in this repo is really inefficient and was fully expecting perf to be pretty bad. Once _nearestWordIndex(atOrBelow:) is fixed, I think this operation will get considerably faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants