Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Fixed greedy matching bug #2032

Merged
merged 13 commits into from Jul 13, 2020

Conversation

RunDevelopment
Copy link
Member

This fixes #1492 and all issues caused by it.

Instead of the old rematching algorithm, which only allowed for at most one match, the new one matches until a certain point is reached (called rematchReach).

The idea is that greedy matches if the change the token stream, won't change the entire token stream but only a small part of it. I call this the reach of the rematching. The idea is to rematch everything within that small part of the token stream.

While being correct, this can also be a lot slower than the old implementation because greedy tokens can expand the current rematching reach which might lead to the whole text being match again and again until the text is matched correctly. So one should keep in mind that greedy now works correctly but also got more expensive.

Notes

I refactored parts of the Core code.

This also fixes another bug greedy matching had. Greedy matching could (in certain edge cases) generate a token stream with adjacent strings. This is a problem because non-greedy matching won't work correctly then.
This was caused by !strarr[k - 1].greedy here. I removed it because it doesn't make any sense to me.
Because this was the only usage of the .greedy property of the Token class, I removed it entirely.

@mAAdhaTTah
Copy link
Member

Yeah, I'd strongly prefer @Golmote take a look at this. It also may be worth getting actual measurements on how much this impacts performance.

tests/core/greedy.js Outdated Show resolved Hide resolved
@RunDevelopment
Copy link
Member Author

I was thinking about adding a benchmark which would compare the local Prism version against the one of Prism's master branch.

Suggestion for what language/code?

@mAAdhaTTah
Copy link
Member

DangerJS would work for this: https://danger.systems/js/

Ideally, I'd like to end up on GitHub actions and drop Travis, but it's still in beta.

@RunDevelopment
Copy link
Member Author

That would be awesome but we still have to implement the benchmark ourselves.

@mAAdhaTTah
Copy link
Member

Only one I know offhand is this: https://github.com/bestiejs/benchmark.js

@Golmote
Copy link
Contributor

Golmote commented Dec 27, 2019

The code looks OK to me. Greedy matching has always been subject to a performance issue I guess. But I can't think of a way to leverage that without rewriting everything from scratch haha.

A benchmark would be nice tho indeed, to get the proper measure of it. 👍

@RunDevelopment
Copy link
Member Author

RunDevelopment commented May 19, 2020

I re-implemented this after #1909. This actually increased the performance quite a bit. Pre #1909, it was on average about 1.3x slower (if I remember correctly) than the non-fixed version.

The average is now 1.06x slower ranging from 1.12x faster to 1.54x slower.
I'm quite happy with this result and think this is ready to get merged.

@Golmote, could you please give this another look?

Benchmark
Found 9 cases with 27 files in total.
Test 2 candidates with Prism.tokenize
Estimated duration: 9m 0s

------------------------------------------------------------

c

  https://raw.githubusercontent.com/git/git/master/mergesort.c (1 kB)
  | RunDevelopment@greedy-fix     0.08ms ±  1%  176smp      
  | PrismJS@master                0.09ms ±  0%  178smp 1.01x
  https://raw.githubusercontent.com/git/git/master/mergesort.h (1 kB)
  | RunDevelopment@greedy-fix     0.02ms ±  1%  179smp 1.00x
  | PrismJS@master                0.02ms ±  0%  181smp      
  https://raw.githubusercontent.com/git/git/master/remote.c (58 kB)
  | RunDevelopment@greedy-fix     2.96ms ±  1%  164smp      
  | PrismJS@master                3.01ms ±  1%  154smp 1.02x
  https://raw.githubusercontent.com/git/git/master/remote.h (10 kB)
  | RunDevelopment@greedy-fix     0.38ms ±  1%  172smp 1.03x
  | PrismJS@master                0.37ms ±  1%  180smp      

------------------------------------------------------------

css

  ../../style.css (7 kB)
  | RunDevelopment@greedy-fix     0.94ms ±  1%  174smp      
  | PrismJS@master                0.96ms ±  1%  171smp 1.02x

------------------------------------------------------------

css!+css-extras (css)

  ../../style.css (7 kB)
  | RunDevelopment@greedy-fix     1.45ms ±  1%  173smp      
  | PrismJS@master                1.49ms ±  1%  170smp 1.03x

------------------------------------------------------------

javascript

  ../../components.json (25 kB)
  | RunDevelopment@greedy-fix     2.04ms ±  1%  174smp      
  | PrismJS@master                2.07ms ±  1%  170smp 1.01x
  ../../package-lock.json (190 kB)
  | RunDevelopment@greedy-fix    11.73ms ±  1%  135smp      
  | PrismJS@master               12.96ms ±  1%  124smp 1.10x
  ../../scripts/utopia.js (11 kB)
  | RunDevelopment@greedy-fix     1.30ms ±  1%  171smp 1.05x
  | PrismJS@master                1.24ms ±  1%  178smp      
  https://cdnjs.cloudflare.com/ajax/libs/prism/1.20.0/prism.js (29 kB)
  | RunDevelopment@greedy-fix     3.00ms ±  0%  169smp      
  | PrismJS@master                3.37ms ±  0%  169smp 1.12x
  https://cdnjs.cloudflare.com/ajax/libs/prism/1.20.0/prism.min.js (14 kB)
  | RunDevelopment@greedy-fix     2.90ms ±  0%  174smp 1.54x
  | PrismJS@master                1.88ms ±  0%  177smp      
  https://code.jquery.com/jquery-3.4.1.js (274 kB)
  | RunDevelopment@greedy-fix    29.31ms ±  0%  111smp 1.01x
  | PrismJS@master               28.89ms ±  0%  112smp      
  https://code.jquery.com/jquery-3.4.1.min.js (86 kB)
  | RunDevelopment@greedy-fix    16.59ms ±  0%  117smp 1.04x
  | PrismJS@master               15.93ms ±  0%  121smp      

------------------------------------------------------------

json

  ../../components.json (25 kB)
  | RunDevelopment@greedy-fix     1.63ms ±  1%  159smp 1.23x
  | PrismJS@master                1.33ms ±  0%  180smp      
  ../../package-lock.json (190 kB)
  | RunDevelopment@greedy-fix     7.81ms ±  1%  138smp 1.09x
  | PrismJS@master                7.15ms ±  1%  149smp      

------------------------------------------------------------

markup

  ../../download.html (4 kB)
  | RunDevelopment@greedy-fix     0.33ms ±  1%  177smp      
  | PrismJS@master                0.34ms ±  1%  174smp 1.02x
  ../../index.html (19 kB)
  | RunDevelopment@greedy-fix     1.79ms ±  1%  167smp 1.01x
  | PrismJS@master                1.78ms ±  1%  168smp      
  https://github.com/PrismJS/prism (154 kB)
  | RunDevelopment@greedy-fix    17.07ms ±  1%  114smp 1.07x
  | PrismJS@master               15.98ms ±  1%  122smp      

------------------------------------------------------------

markup!+css+javascript (markup)

  ../../download.html (4 kB)
  | RunDevelopment@greedy-fix     0.61ms ±  0%  172smp 1.01x
  | PrismJS@master                0.60ms ±  1%  174smp      
  ../../index.html (19 kB)
  | RunDevelopment@greedy-fix     2.43ms ±  1%  163smp 1.02x
  | PrismJS@master                2.39ms ±  1%  176smp      
  https://github.com/PrismJS/prism (154 kB)
  | RunDevelopment@greedy-fix    18.99ms ±  1%  129smp 1.02x
  | PrismJS@master               18.62ms ±  1%  131smp      

------------------------------------------------------------

ruby

  https://raw.githubusercontent.com/rails/rails/master/actionview/lib/action_view/base.rb (12 kB)
  | RunDevelopment@greedy-fix     0.52ms ±  1%  166smp 1.16x
  | PrismJS@master                0.45ms ±  0%  184smp      
  https://raw.githubusercontent.com/rails/rails/master/actionview/lib/action_view/layouts.rb (16 kB)
  | RunDevelopment@greedy-fix     0.64ms ±  1%  170smp 1.16x
  | PrismJS@master                0.56ms ±  0%  185smp      
  https://raw.githubusercontent.com/rails/rails/master/actionview/lib/action_view/template.rb (14 kB)
  | RunDevelopment@greedy-fix     0.82ms ±  1%  177smp 1.12x
  | PrismJS@master                0.73ms ±  0%  181smp      

------------------------------------------------------------

rust

  https://raw.githubusercontent.com/rust-lang/regex/master/src/compile.rs (40 kB)
  | RunDevelopment@greedy-fix     2.38ms ±  1%  163smp 1.15x
  | PrismJS@master                2.06ms ±  0%  176smp      
  https://raw.githubusercontent.com/rust-lang/regex/master/src/lib.rs (28 kB)
  | RunDevelopment@greedy-fix     0.38ms ±  1%  176smp      
  | PrismJS@master                0.40ms ±  0%  183smp 1.05x
  https://raw.githubusercontent.com/rust-lang/regex/master/src/utf8.rs (9 kB)
  | RunDevelopment@greedy-fix     0.62ms ±  1%  171smp 1.33x
  | PrismJS@master                0.47ms ±  0%  183smp      

------------------------------------------------------------

summary
                             best  worst  relative
  RunDevelopment@greedy-fix     9     18     1.06x
  PrismJS@master               18      9     1.00x

A small explanation on why we see this performance range (1.12x faster to 1.54x slower):

The basic idea behind this fix is that when a greedy token changes the token stream, it can only affect other tokens within a specific range. Within that range, we then have to rematch and that can extend the range further causing even more rematching (this can be thought of as the changes propagating through the token stream).
In the worst-worst-case, we might have to rematch the whole token stream but that's what greedy matching is. I can't do anything about this worst-case but it's very unlikely that this will ever happen anyway.

This explains both the speedup and the slowdown.
The fix can be faster because there might not be any changes to be made within the current range. In that case, we can quickly quit the rematching. The old version always has to find one match, the fixed version might get away with zero.
But because the reach can be extended any number of times, we have to do a lot more matching than the old version to be correct. You can see this with prism.js/prism.min.js. The minified file now takes a lot longer to tokenize because comments, strings, templates, and regexes are much more likely to cause rematching there because it's all one line. In un-minified code, strings and regexes can only cause rematching within their line, so it's a lot more local and the rematching reach isn't extended as often.

@yairEO
Copy link

yairEO commented Jul 6, 2020

it has been almost a year, what's going on? I need this fix...

@RunDevelopment
Copy link
Member Author

@yairEO The main problem is that there's nobody to review it. This is a fundamental change to the core matching algorithm, so it better correct. I had to reimplement this after #1909, so I was reluctant to merge this even after @Golmote's approval (that was given before #1909).
But you're right. We should finally get this merged.

@mAAdhaTTah @Golmote
If it's ok, I'll merge this within the next couple of days, so it can be part of the next release. I re-reviewed it and I think it's ready as-is.

@RunDevelopment RunDevelopment merged commit 4028520 into PrismJS:master Jul 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Greedy matching bug
4 participants