Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(babel-parser): avoid state.clone() to clone the whole token store #11029

Merged
merged 5 commits into from Jan 20, 2020
Merged

fix(babel-parser): avoid state.clone() to clone the whole token store #11029

merged 5 commits into from Jan 20, 2020

Conversation

3cp
Copy link
Contributor

@3cp 3cp commented Jan 17, 2020

Q                       A
Fixed Issues? No
Patch: Bug Fix? Yes
Major: Breaking Change? No
Minor: New Feature? No
Tests Added + Pass? Yes
Documentation PR Link
Any Dependency Changes? No
License MIT

Fixed the performance issue on large input when turned on option {tokens: true} and typescript plugin which uses quite a few state.clone().

To reproduce the performance issue.

Create a nodejs project with following 2 files:

package.json

{
  "dependencies": {
    "@babel/parser": "latest",
    "typescript": "latest"
  }
}

index.js

var parser = require('@babel/parser');
var fs = require('fs');

var code = fs.readFileSync(require.resolve('typescript'), 'utf8');

var start = new Date().getTime();
var result = parser.parse(code, {
  plugins: ['typescript'],
  tokens: true
});
var spend = new Date().getTime() - start;
console.log('spend ' + (spend/1000).toFixed(2) + 's');
console.log('tokens length: ' + result.tokens.length);

We parse the typescript package main file (7.8m) to demonstrate the performance issue.
After npm i, run node index.js

On my machine:

spend 18.28s
tokens length: 897894

To compare the other two results:

  1. remove tokens: true
spend 3.51s
  1. or remove plugins: ['typescript'],
spend 2.70s
tokens length: 897894

Why the huge difference?

Because babel-parser's typescript plugin uses state.clone() api quite few times. (flow plugin also uses the clone api)
But State class has a token store state.tokens to track all parsed tokens. This means everytime state.clone() is called, it duplicates the huge token store state.tokens.

This is not only a performance issue, it also causes memory spike.
On my mac, roughly read from activity monitor, the nodejs process goes up to around 800m vs 160m when tokens option is off).

The fix

This fix is very simple, move up the token store from State to Tokenizer itself, so that the tokens aren't cloned in copied states. A cleanup logic is added to deal with typescript/flow's try-catch parsing branches.

This fix restored the running time of the code sample above back to 2~3 seconds range.

Copy link
Contributor

@JLHwung JLHwung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR!

@JLHwung JLHwung added pkg: parser PR: Performance 🏃‍♀️ A type of pull request used for our changelog categories labels Jan 17, 2020
Copy link
Member

@nicolo-ribaudo nicolo-ribaudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this example:

const code = `
async < T > (x);
`;

const out = parser.parse(code, {
  plugins: ["typescript"],
  tokens: true,
});

console.log(JSON.stringify(out.tokens, null, 2));

All the tokens are now duplicated. There is a reason if tokens is in state and we are cloning arrays when cloning state: it's to avoid having invalid tokens when we try to parse something, and then go back and try to parse something else.

You can check the first commit which introduced the tokens array: at that time it was even explicitly copied.

No new test was added, as existing tests covered the tokens parser option.

Could you please add a test with tokens and typescript, like the example I provided? 😅

@3cp
Copy link
Contributor Author

3cp commented Jan 17, 2020

@nicolo-ribaudo thx, I now see the duplicated tokens with my fix.

I don't quite understand. From what I saw, the token store is only populated when options.tokens is set to true, I only see mutation happen in Tokenizer. It means for majority use cases where options.tokens is false, the token store is always empty. Parsing does not rely on it.

There must be somewhere in typescript plugin to mutate the token store (only when it is not empty), but I cannot find relevant code. Can you point me to where the token store is touched by typescript plugin?

Also I see only typescript and flow plugins are using tryParser() (it clones state too) and state.clone. I can avoid the two plugins to bypass this issue for now.

Nevertheless, we need to find way to avoid cloning (shallowly currently) the huge token store array. This doesn't scale with input code size.

@3cp
Copy link
Contributor Author

3cp commented Jan 18, 2020

@nicolo-ribaudo I see now what typescript plugin did.
It is not mutating the tokens directly, but keep the old state, try parse something, if it fails, restore state to old state, then try parse something else.
We can do better than this, but require bit of redesign to avoid expensive clone.

@3cp 3cp changed the title fix(babel-parser): avoid state.clone() to clone the whole token store WIP fix(babel-parser): avoid state.clone() to clone the whole token store Jan 18, 2020
3cp added a commit to dumberjs/modify-code that referenced this pull request Jan 18, 2020
This is to give modify-code a chance to bypass a babel performance issue babel/babel#11029.
@3cp 3cp changed the title WIP fix(babel-parser): avoid state.clone() to clone the whole token store fix(babel-parser): avoid state.clone() to clone the whole token store Jan 18, 2020
@3cp
Copy link
Contributor Author

3cp commented Jan 18, 2020

@nicolo-ribaudo pls review the added tests (output.json generated by latest master), and the tiny trick to maintain clean token store.

Fixed the performance issue on large input when turned on option {tokens: true} and typescript plugin which uses quite a few state.clone().
The output.json is generated by old master to make sure no regression.
@JLHwung JLHwung self-requested a review January 19, 2020 04:02
Copy link
Member

@nicolo-ribaudo nicolo-ribaudo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@nicolo-ribaudo
Copy link
Member

Out of curiosity, could you run the benchmark you posted in your PR description using this PR, to see how much it improved performance?

@3cp
Copy link
Contributor Author

3cp commented Jan 20, 2020

Good idea!

#latest 7.8.3
spend 17.06s
tokens length: 897894

#patched
spend 3.27s
tokens length: 897894

@3cp
Copy link
Contributor Author

3cp commented Jan 20, 2020

I ran a rough benchmark for avg of 10 times.

.length = avg 2.55s
.splice avg 2.65s

There might be something else (my system load) to make the splice slight worse. I would say we leave the optimization out for no obvious benefit even for huge input (7.8m file).

@github-actions github-actions bot added the outdated A closed issue/PR that is archived due to age. Recommended to make a new issue label Apr 20, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
outdated A closed issue/PR that is archived due to age. Recommended to make a new issue pkg: parser PR: Performance 🏃‍♀️ A type of pull request used for our changelog categories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants