Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: replace lookahead by lookaheadCharCode #10371

Merged
merged 5 commits into from Oct 8, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion packages/babel-parser/src/parser/expression.js
Expand Up @@ -590,7 +590,7 @@ export default class ExpressionParser extends LValParser {
} else if (this.match(tt.questionDot)) {
this.expectPlugin("optionalChaining");
state.optionalChainMember = true;
if (noCalls && this.lookahead().type === tt.parenL) {
if (noCalls && this.lookaheadCharCode() === charCodes.leftParenthesis) {
state.stop = true;
return base;
}
Expand Down
45 changes: 17 additions & 28 deletions packages/babel-parser/src/parser/statement.js
Expand Up @@ -8,7 +8,7 @@ import {
isIdentifierStart,
keywordRelationalOperator,
} from "../util/identifier";
import { lineBreak, skipWhiteSpace } from "../util/whitespace";
import { lineBreak } from "../util/whitespace";
import * as charCodes from "charcodes";
import {
BIND_CLASS,
Expand Down Expand Up @@ -105,10 +105,7 @@ export default class StatementParser extends ExpressionParser {
if (!this.isContextual("let")) {
return false;
}
skipWhiteSpace.lastIndex = this.state.pos;
const skip = skipWhiteSpace.exec(this.input);
// $FlowIgnore
const next = this.state.pos + skip[0].length;
const next = this.nextTokenStart();
const nextCh = this.input.charCodeAt(next);
// For ambiguous cases, determine if a LexicalDeclaration (or only a
// Statement) is allowed here. If context is not empty then only a Statement
Expand Down Expand Up @@ -170,7 +167,7 @@ export default class StatementParser extends ExpressionParser {
case tt._for:
return this.parseForStatement(node);
case tt._function:
if (this.lookahead().type === tt.dot) break;
if (this.lookaheadCharCode() === charCodes.dot) break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the critical path as function keyword frequency is high.

if (context) {
if (this.state.strict) {
this.raise(
Expand Down Expand Up @@ -223,8 +220,11 @@ export default class StatementParser extends ExpressionParser {
return this.parseEmptyStatement(node);
case tt._export:
case tt._import: {
const nextToken = this.lookahead();
if (nextToken.type === tt.parenL || nextToken.type === tt.dot) {
const nextTokenCharCode = this.lookaheadCharCode();
if (
nextTokenCharCode === charCodes.leftParenthesis ||
nextTokenCharCode === charCodes.dot
) {
break;
}

Expand Down Expand Up @@ -1738,11 +1738,11 @@ export default class StatementParser extends ExpressionParser {
maybeParseExportDeclaration(node: N.Node): boolean {
if (this.shouldParseExportDeclaration()) {
if (this.isContextual("async")) {
const next = this.lookahead();
const next = this.nextTokenStart();

// export async;
if (next.type !== tt._function) {
this.unexpected(next.start, `Unexpected token, expected "function"`);
if (!this.isUnparsedContextual(next, "function")) {
this.unexpected(next, `Unexpected token, expected "function"`);
}
}

Expand All @@ -1757,21 +1757,10 @@ export default class StatementParser extends ExpressionParser {

isAsyncFunction(): boolean {
if (!this.isContextual("async")) return false;

const { pos } = this.state;

skipWhiteSpace.lastIndex = pos;
const skip = skipWhiteSpace.exec(this.input);

if (!skip || !skip.length) return false;

const next = pos + skip[0].length;

const next = this.nextTokenStart();
return (
!lineBreak.test(this.input.slice(pos, next)) &&
this.input.slice(next, next + 8) === "function" &&
(next + 8 === this.length ||
!isIdentifierChar(this.input.charCodeAt(next + 8)))
!lineBreak.test(this.input.slice(this.state.pos, next)) &&
Copy link

@KFlash KFlash Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JLHwung You don't gain performance here. Try replace !lineBreak.test(this.input.slice(this.state.pos, next)) with this

const { input, pos } = this;
const nextChar = input.slice(pos, next);
return ( (nextChar=== 0x10 || nextChar === 0x13 || (nextChar^ 0x2028) <= 1) &&
input.slice(next, next + 8) === "function" && (next + 8 === this.length ||
        !isIdentifierChar(input.charCodeAt(next + 8)))

Not ideal either, but an improvement :) Eventually you can use a table lookup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of revisions are not actually meant to improve performance. I refactor the similar codes into a sharing routine nextTokenStart .

input.slice(pos, next) is a string so we have to search line break inside, but I know what you mean.

lineBreak.test is pretty fast and achieve 20M ops/sec according to jsperf. I think we may consider to optimize it only when it is the bottleneck.

Eventually you can use a table lookup

It is a good idea given that isIdentifierChar is a critical execution path. Recently V8 has also implemented table lookup for identifier character query. I will consider refactor the identifier part in another PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isIdentifierChar can be super optimized, but only use a direct lookup for it. Similiar to what I did here. But I recently found a faster way to scan for identifier than what V8 does which include direct table lookup without bitmasks etc.

Btw. I couldn't get babel parser to run in my benchmark, but where do I find a benchmark with it online? And try run this benchmark and see how Meriyah does it vs Acorn.

Copy link
Contributor Author

@JLHwung JLHwung Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And try run this benchmark and see how Meriyah does it vs Acorn.

I like this benchmark website! And yes! Meriyah is almost twice as fast (warm JIT) as acorn in our benchmark suites, while babel is only half of acorn, 😢.

I couldn't get babel parser to run in my benchmark

I couldn't find the source of your benchmark website, if it is open sourced I can see if there anything I can help to get babel parser running.

where do I find a benchmark with it online

AFAIK we don't have an online benchmark.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source for the benchmark is located here and the website is in the root folder.

Estimate 14 days hard work and you could replicate the Babel parser if you write it from scratch. Another 6 - 8 days to get all plug-ins working. And the Babel parser should perform same as Meriyah :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JLHwung Meriyah's REPL is located here in case of interest. Inspired by Babel's REPL because I found out that the REPL was loading the page very slow too :)

Copy link

@KFlash KFlash Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JLHwung You mentioned table lookup table for identifier scanning. I would say that the V8 solution isn't as fast as it could be either, but what you can do is to use a table lookup for the token kinds. Then you know that keywords can only be lower letters. And that no keyword starts with letter 'u'. With this knowledge you can optimize the identifier scanning.
If you use a while loop iteration and checks for IdentifierPart chars you shouldn't need to check what's between "k" and "d" in "keyword". You would only need to check what's comes after the "keyword". That's faster than what V8 does. And you skip unnecessary branching.

I just implemented this in my own lexer refactoring - seen here

I just mentioned it because you mentioned it first, and it could be a good optimization trick for Babel :)

this.isUnparsedContextual(next, "function")
);
}

Expand Down Expand Up @@ -1833,10 +1822,10 @@ export default class StatementParser extends ExpressionParser {
return false;
}

const lookahead = this.lookahead();
const next = this.nextTokenStart();
return (
lookahead.type === tt.comma ||
(lookahead.type === tt.name && lookahead.value === "from")
this.input.charCodeAt(next) === charCodes.comma ||
this.isUnparsedContextual(next, "from")
);
}

Expand Down
26 changes: 22 additions & 4 deletions packages/babel-parser/src/parser/util.js
Expand Up @@ -4,6 +4,8 @@ import { types as tt, type TokenType } from "../tokenizer/types";
import Tokenizer from "../tokenizer";
import type { Node } from "../types";
import { lineBreak, skipWhiteSpace } from "../util/whitespace";
import { isIdentifierChar } from "../util/identifier";
import * as charCodes from "charcodes";

const literal = /^('|")((?:\\?.)*?)\1/;

Expand All @@ -26,8 +28,15 @@ export default class UtilParser extends Tokenizer {
}

isLookaheadRelational(op: "<" | ">"): boolean {
const l = this.lookahead();
return l.type === tt.relational && l.value === op;
const next = this.nextTokenStart();
if (this.input.charAt(next) === op) {
if (next + 1 === this.input.length) {
return true;
}
const afterNext = this.input.charCodeAt(next + 1);
return afterNext !== op.charCodeAt(0) && afterNext !== charCodes.equalsTo;
}
return false;
}

// TODO
Expand Down Expand Up @@ -60,9 +69,18 @@ export default class UtilParser extends Tokenizer {
);
}

isUnparsedContextual(nameStart: number, name: string): boolean {
const nameEnd = nameStart + name.length;
return (
this.input.slice(nameStart, nameEnd) === name &&
(nameEnd === this.input.length ||
!isIdentifierChar(this.input.charCodeAt(nameEnd)))
);
}

isLookaheadContextual(name: string): boolean {
const l = this.lookahead();
return l.type === tt.name && l.value === name;
const next = this.nextTokenStart();
return this.isUnparsedContextual(next, name);
}

// Consumes contextual keyword if possible.
Expand Down
9 changes: 7 additions & 2 deletions packages/babel-parser/src/plugins/typescript/index.js
Expand Up @@ -19,6 +19,7 @@ import {
BIND_CLASS,
} from "../../util/scopeflags";
import TypeScriptScopeHandler from "./scope";
import * as charCodes from "charcodes";

type TsModifier =
| "readonly"
Expand Down Expand Up @@ -657,7 +658,10 @@ export default (superClass: Class<Parser>): Class<Parser> =>
: this.match(tt._null)
? "TSNullKeyword"
: keywordTypeFromName(this.state.value);
if (type !== undefined && this.lookahead().type !== tt.dot) {
if (
type !== undefined &&
this.lookaheadCharCode() !== charCodes.dot
) {
const node: N.TsKeywordType = this.startNode();
this.next();
return this.finishNode(node, type);
Expand Down Expand Up @@ -1203,7 +1207,8 @@ export default (superClass: Class<Parser>): Class<Parser> =>

tsIsExternalModuleReference(): boolean {
return (
this.isContextual("require") && this.lookahead().type === tt.parenL
this.isContextual("require") &&
this.lookaheadCharCode() === charCodes.leftParenthesis
);
}

Expand Down
29 changes: 15 additions & 14 deletions packages/babel-parser/src/tokenizer/index.js
Expand Up @@ -13,6 +13,7 @@ import {
lineBreakG,
isNewLine,
isWhitespace,
skipWhiteSpace,
} from "../util/whitespace";
import State from "./state";

Expand Down Expand Up @@ -168,6 +169,18 @@ export default class Tokenizer extends LocationParser {
return curr;
}

nextTokenStart(): number {
const thisTokEnd = this.state.pos;
skipWhiteSpace.lastIndex = thisTokEnd;
const skip = skipWhiteSpace.exec(this.input);
// $FlowIgnore: The skipWhiteSpace ensures to match any string
return thisTokEnd + skip[0].length;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JLHwung In most cases this will not improve the performance. Using regex for this purpose may have opposite effect. I looked at the lexer code and ... well... I understand why you do this.., but you should only need to use current index like this this.input.charCodeAt(this.index). Refactor the lexer into a while loop and you get everything like this for free without any need for regular expressions. You only need to do index + 1 if you hit 0x20. If you need the length, store the length before and after the iteration in a locale variable. Doing it like this will also get rid of some overhead like [0] and .length. For this case you simply use a locale variable at the start of the function
let length = 0 and inside the loop you do this length++. And your return will be TokEnd + length.

Copy link
Contributor Author

@JLHwung JLHwung Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides whitespaces, the skipWhiteSpace regex also skips any comment line // and comment block /**. We could do better by removing the regex and explicitly coding a finite automata for that, but it may worth it only if it is identified as our bottleneck.

The skipWhiteSpace regex can achieve 3M ops/sec according to jsperf, roughly 1000 cycles on a 3GHz processor. As JavaScript is a high level language, v8 has done a good job.

Copy link

@KFlash KFlash Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't blindly trust jsperf, but a regex can be super fast if done in a single operation / task, but in the case of Babel parser this have multiple purposes and slow down. In the long run a while loop is the best solution when it comes to a parser. Even if you super optimize the regular expression.
I just give you a friendly tip. Everyone code differently :) But to illustrate this run a benchmark against Acorn. You see Babel parser is 1x slower (estimated). Reason I guess is that even the small things haven't been optimized when adding new stuff, and changing the small things as you do in this case will have larger impact then you think. In a positive way. Also when it comes to memory usage, and larger files.
I guess the babel parser get trouble while parsing larger files than 2 MB.
You can also run a benchmark against my parser too - Meriyah. I guess Babel parser is 4 -5x slower (estimated). And you can parse 50 MB size files almost with the same performance. That's because I use the while loop in the lexer and no regexp :)

lookaheadCharCode(): number {
return this.input.charCodeAt(this.nextTokenStart());
}

// Toggle strict mode. Re-reads the next number or string to please
// pedantic tests (`"use strict"; 010;` should fail).

Expand Down Expand Up @@ -267,13 +280,7 @@ export default class Tokenizer extends LocationParser {
const startLoc = this.state.curPosition();
let ch = this.input.charCodeAt((this.state.pos += startSkip));
if (this.state.pos < this.length) {
while (
ch !== charCodes.lineFeed &&
ch !== charCodes.carriageReturn &&
ch !== charCodes.lineSeparator &&
ch !== charCodes.paragraphSeparator &&
++this.state.pos < this.length
) {
while (!isNewLine(ch) && ++this.state.pos < this.length) {
ch = this.input.charCodeAt(this.state.pos);
}
}
Expand Down Expand Up @@ -439,13 +446,7 @@ export default class Tokenizer extends LocationParser {
let ch = this.input.charCodeAt(this.state.pos);
if (ch !== charCodes.exclamationMark) return false;

while (
ch !== charCodes.lineFeed &&
ch !== charCodes.carriageReturn &&
ch !== charCodes.lineSeparator &&
ch !== charCodes.paragraphSeparator &&
++this.state.pos < this.length
) {
while (!isNewLine(ch) && ++this.state.pos < this.length) {
ch = this.input.charCodeAt(this.state.pos);
}

Expand Down