New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: replace lookahead by lookaheadCharCode #10371
Conversation
@@ -170,7 +170,7 @@ export default class StatementParser extends ExpressionParser { | |||
case tt._for: | |||
return this.parseForStatement(node); | |||
case tt._function: | |||
if (this.lookahead().type === tt.dot) break; | |||
if (this.lookaheadCharCode() === charCodes.dot) break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the critical path as function
keyword frequency is high.
Build successful! You can test your changes in the REPL here: https://babeljs.io/repl/build/11444/ |
// $FlowIgnore: The skipWhiteSpace ensures to match any string | ||
return thisTokEnd + skip[0].length; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JLHwung In most cases this will not improve the performance. Using regex for this purpose may have opposite effect. I looked at the lexer code and ... well... I understand why you do this.., but you should only need to use current index like this this.input.charCodeAt(this.index)
. Refactor the lexer into a while loop
and you get everything like this for free without any need for regular expressions. You only need to do index + 1
if you hit 0x20
. If you need the length, store the length before and after the iteration in a locale variable. Doing it like this will also get rid of some overhead like [0]
and .length
. For this case you simply use a locale variable at the start of the function
let length = 0
and inside the loop you do this length++
. And your return will be TokEnd + length
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides whitespaces, the skipWhiteSpace
regex also skips any comment line //
and comment block /**
. We could do better by removing the regex and explicitly coding a finite automata for that, but it may worth it only if it is identified as our bottleneck.
The skipWhiteSpace
regex can achieve 3M ops/sec according to jsperf, roughly 1000 cycles on a 3GHz processor. As JavaScript is a high level language, v8 has done a good job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't blindly trust jsperf, but a regex can be super fast if done in a single operation / task, but in the case of Babel parser this have multiple purposes and slow down. In the long run a while loop
is the best solution when it comes to a parser. Even if you super optimize the regular expression.
I just give you a friendly tip. Everyone code differently :) But to illustrate this run a benchmark against Acorn. You see Babel parser is 1x slower (estimated). Reason I guess is that even the small things haven't been optimized when adding new stuff, and changing the small things as you do in this case will have larger impact then you think. In a positive way. Also when it comes to memory usage, and larger files.
I guess the babel parser get trouble while parsing larger files than 2 MB.
You can also run a benchmark against my parser too - Meriyah. I guess Babel parser is 4 -5x slower (estimated). And you can parse 50 MB size files almost with the same performance. That's because I use the while loop
in the lexer and no regexp :)
return ( | ||
!lineBreak.test(this.input.slice(pos, next)) && | ||
!lineBreak.test(this.input.slice(this.state.pos, next)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JLHwung You don't gain performance here. Try replace !lineBreak.test(this.input.slice(this.state.pos, next))
with this
const { input, pos } = this;
const nextChar = input.slice(pos, next);
return ( (nextChar=== 0x10 || nextChar === 0x13 || (nextChar^ 0x2028) <= 1) &&
input.slice(next, next + 8) === "function" && (next + 8 === this.length ||
!isIdentifierChar(input.charCodeAt(next + 8)))
Not ideal either, but an improvement :) Eventually you can use a table lookup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of revisions are not actually meant to improve performance. I refactor the similar codes into a sharing routine nextTokenStart
.
input.slice(pos, next)
is a string so we have to search line break inside, but I know what you mean.
lineBreak.test
is pretty fast and achieve 20M ops/sec according to jsperf. I think we may consider to optimize it only when it is the bottleneck.
Eventually you can use a table lookup
It is a good idea given that isIdentifierChar
is a critical execution path. Recently V8 has also implemented table lookup for identifier character query. I will consider refactor the identifier part in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isIdentifierChar
can be super optimized, but only use a direct lookup for it. Similiar to what I did here. But I recently found a faster way to scan for identifier than what V8 does which include direct table lookup without bitmasks etc.
Btw. I couldn't get babel parser to run in my benchmark, but where do I find a benchmark with it online? And try run this benchmark and see how Meriyah does it vs Acorn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And try run this benchmark and see how Meriyah does it vs Acorn.
I like this benchmark website! And yes! Meriyah is almost twice as fast (warm JIT) as acorn in our benchmark suites, while babel is only half of acorn, 😢.
I couldn't get babel parser to run in my benchmark
I couldn't find the source of your benchmark website, if it is open sourced I can see if there anything I can help to get babel parser running.
where do I find a benchmark with it online
AFAIK we don't have an online benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source for the benchmark is located here and the website is in the root folder.
Estimate 14 days hard work and you could replicate the Babel parser if you write it from scratch. Another 6 - 8 days to get all plug-ins working. And the Babel parser should perform same as Meriyah :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JLHwung Meriyah's REPL is located here in case of interest. Inspired by Babel's REPL because I found out that the REPL was loading the page very slow too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JLHwung You mentioned table lookup table for identifier scanning. I would say that the V8 solution isn't as fast as it could be either, but what you can do is to use a table lookup for the token kinds. Then you know that keywords can only be lower letters. And that no keyword starts with letter 'u'. With this knowledge you can optimize the identifier scanning.
If you use a while loop iteration and checks for IdentifierPart chars you shouldn't need to check what's between "k" and "d" in "keyword". You would only need to check what's comes after the "keyword". That's faster than what V8 does. And you skip unnecessary branching.
I just implemented this in my own lexer refactoring - seen here
I just mentioned it because you mentioned it first, and it could be a good optimization trick for Babel :)
34bd654
to
ee6d6e4
Compare
ee6d6e4
to
4f6dcc4
Compare
b89dc65
to
31a05b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, might be nice, as we increase the number of util/helper functions, to help answer why/when we should use particular utils (as their names might not always give enough) via comments.
Introduced
tokenizer.lookaheadCharCode()
method, which simply skips the whitespaces and comments, until the next non-whitespace charcode. The idea is borrowed from acorn.tokenizer.lookahead()
is expensive in the way that it would cache the whole state and restore afternextToken
. However, in most of timestokenizer.lookahead
is used to compare thetype
of next token only, therefore we can replacetokenizer.lookahead()
bytokenizer.lookaheadCharCode()
which does not interfere with the tokenizer state.Logic difference
The replacement is logically equivalent when the character maps to only one possible token type.
It is okay to replace
tt.parenL
withcharcode === 40
.It will introduce logic difference when replacing
tt.dot
withcharcode === 46
since it may be part oftt.ellipsis
. Howevertt.ellipsis
should not immediately follow thefunction
keyword and would bail inparseExpression
. The same applies when followingimport
andnull
.Benchmark report
Environment
CPU: Intel Core-i7 6820HQ
RAM: 16 GB 2133 MHz LPDDR3
Kernel: Darwin 18.7.0
Node.js: v12.9.0
Control group versions
acorn: v7.0.0
babel-parser: v7.5.5