Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare @babel/core for asynchronicity #10507

Merged
merged 3 commits into from Jan 10, 2020

Conversation

nicolo-ribaudo
Copy link
Member

@nicolo-ribaudo nicolo-ribaudo commented Sep 27, 2019

Q                       A
Fixed Issues? First step for #9888
Patch: Bug Fix?
Major: Breaking Change?
Minor: New Feature?
Tests Added + Pass? Yes
Documentation PR Link
Any Dependency Changes?
License MIT

In this PR, I am refactoring @babel/core to use gensync. It's a tool by @loganfsmyth which allows defining functions as generators which can then be run either synchronously or asynchronously, pushinc control to the edges of the program.

For example, it allows doing this:

const fs = require("fs-for-gensync");

const readJSON = gensync(function* (path) {
  const contents = yield* fs.readFile(path, "utf8"); // Sync or async? It doesn't matter
  return JSON.parse(contents);
});

readJson.sync("file.json"); // internally calls fs.readFileSync(path, encoding)
readJson.async("file.json"); // internally calls fs.readFile(path, encoding, cb)

By doing so, we will be able to easily allow some parts of Babel to be asynchronous. I identified 4 possible parts:

  • .pre functions of plugins
  • .post functions of plugins
  • Configuration functions (i.e. in babel.config.json)
  • Configuration file loading

The latter is necessary because, starting from node v12.11.0, is a user has "type": "module" in their package.json we won't be able to load babel.config.js using require but we'll need to use import(). (nodejs/modules#389)

Since JavaScript doesn't have native algebraic effects[1], which would have allowed me to use yield only in those 4 places, we have to use yield* in the whole call stack. The last two parts are deep inside @babel/core, and that's why yield* spread almost everywhere.

This PR already makes I/O actions asynchronous when using the async functions.

This PR aligns the API of all the functions which could interact with the disk: now parse, transform, transformFile, transformFromAst, loadPartialConfig and loadOptions all have three version:

  • fooSync, which synchronously returns the result;
  • fooAsync, which returns a promise;
  • foo, which accepts a callback. From compatibility reasons, some of these behave synchronously when the callback is not provided.

TODO:

[1] I learned exactly what they are while preparing this PR


Performance differences: Not significant.
In order to test this PR, I compiled every file inside packages/*/src (405 files) both synchronously and asynchronously, using the following script. It's always run using node demo.js --sync && node demo.js --sync && node demo.js --sync && node demo.js --sync && node demo.js --async && node demo.js --async && node demo.js --async && node demo.js --async.

"use strict";

const load = name => require(`./packages/babel-${name}`);

const babel = load("core");
const fs = require("fs");

function walkSync(dir, filelist) {
  const files = fs.readdirSync(dir);
  filelist = filelist || [];
  files.forEach(file => {
    if (fs.statSync(dir + "/" + file).isDirectory()) {
      filelist = walkSync(dir + "/" + file, filelist);
    } else {
      filelist.push(dir + "/" + file);
    }
  });
  return filelist;
}

const files = [];

const packages = fs.readdirSync("packages");
for (const name of packages) {
  if (name === "README.md" || name.startsWith("babel-runtime")) continue;
  files.push.apply(files, walkSync("packages/" + name + "/src"));
}


const start = process.hrtime();
global.results = {};

if (process.argv.includes("--sync")) {
  process.env.SYNC_OR_ASYNC = "sync";
  for (const file of files) {
    global.results[file] = babel.transformFileSync(file);
  }
  console.log("Sync: " + process.hrtime(start));
} else {
  process.env.SYNC_OR_ASYNC = "async";
  let i = 0;
  for (const file of files) {
    i++;
    babel.transformFile(file, (err, out) => {
      if (err) {
        console.log(err);
        process.exit(1);
      }
      global.results[file] = out;
  
      if (--i === 0) console.log("Async: " + process.hrtime(start));
    });
  }
}
Results

NODE_ENV is always "test"

  • master branch, with api.cache.never():

    Sync: 5,886560734
    Sync: 5,885043757
    Sync: 5,867874524
    Sync: 5,803315282
    Async: 5,949791895
    Async: 5,789683786
    Async: 5,806937770
    Async: 5,853401118
    
  • core-async branch, with api.cache.never():

    Sync: 5,960720491
    Sync: 5,894450713
    Sync: 5,891051295
    Sync: 5,946484727
    Async: 5,913818398
    Async: 5,929368030
    Async: 6,11599499
    Async: 6,58814871
    
  • master branch, with api.cache.using(() => process.env.SYNC_OR_ASYNC):

    Sync: 5,883534231
    Sync: 4,840278891
    Sync: 4,994791415
    Sync: 4,897095093
    Async: 4,874467892
    Async: 4,824938870
    Async: 4,826639093
    Async: 4,851304427
    
  • core-async branch, with api.cache.using(() => process.env.SYNC_OR_ASYNC):

    Sync: 4,866956272
    Sync: 4,820806372
    Sync: 4,857302419
    Sync: 4,783584388
    Async: 5,869067030
    Async: 5,931280575
    Async: 6,17525481
    Async: 5,993042102
    

Conclusion: When running synchronously, the performance is almost the same. On the other hand, when running asynchronously this branch is consistently a bit slower than master. I think that it's because gensync waits one tick for each yielded value, the same as await behaves in async functions.

@nicolo-ribaudo nicolo-ribaudo added PR: Internal 🏠 A type of pull request used for our changelog categories pkg: core labels Sep 27, 2019
@buildsize
Copy link

buildsize bot commented Sep 27, 2019

File name Previous Size New Size Change
babel-preset-env.js 2.77 MB 2.83 MB 59.72 KB (2%)
babel-preset-env.min.js 1.67 MB 1.69 MB 22.47 KB (1%)
babel.js 2.95 MB 3.01 MB 59.73 KB (2%)
babel.min.js 1.63 MB 1.65 MB 22.48 KB (1%)

@nicolo-ribaudo
Copy link
Member Author

nicolo-ribaudo@3f640b2 (not included in this PR) shows how with the new architecture it is straightforward to allow some parts of Babel to be async.

@@ -45,10 +47,10 @@ type SimpleContext = {
caller: CallerMetadata | void,
};

export default function loadFullConfig(
export default gensync<[any], ResolvedConfig | null>(function* loadFullConfig(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the any instead of mixed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because I get errors about objects not being compatible with mixed:

const config = loadConfig.sync({
babelrc: false,
configFile: false,
plugins: [blockHoistPlugin],
});

export function loadOptions(opts: {}): Object | null {
const config = loadFullConfig.sync(opts);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, seems like a Flow bug, but fair enough.


const cache = new CacheConfigurator(data);

const value = yield* handler(arg, cache);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One big complexity here is that, if this is async, you can end up with multiple calls to handler running in parallel. In my opinion, we should only ever allow one call to handler at a time for a given arg.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about how they could run in parallel. If it is async, handler's effects will be yielded and no other code will be run until they are completed, including other handler calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, if you do

var doThing = makeWeakCache(function*(arg) {
  yield* delay(100);
  return {};
});

var obj = {};
const [r1, r2] = yield* gensync.all[doThing(obj), doThing(obj)]);

console.log(r1 === r2);

would be true in sync code and false in async code with your current implementation because the first would run then populate the cache with the r1 object, but then the second one would finish and replace the callCache data with r2. So really the second call to see that there is already a call in-progress and wait for it to finish before checking if the cache entry matches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this explanation enough to make my concern clearer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry, I just didn't have time to update it! There is a reason if this PR is marked as "draft" 😛

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loganfsmyth I was implementing this, and I'm not sure that it is something that we want, since it can cause a few problems:

  1. If a function is called recursively (like a plugin loading another plugin via inherits), it will cause a deadlock: the outer function needs the inner one's return value, but the inner one won't run until the outer one is finished.

  2. It will make async calls to Babel slower. Suppose that we have this code:

    var maybeDelay = gensync({
      sync: () => {},
      async: t => delay(t),
    });
    
    var doThing = gensync(makeWeakCache(function*(arg) {
      yield* maybeDelay(100);
      return {};
    }));
    
    doThing.async();
    await delay(50);
    doThins.sync();

    here doThins.sync() runs while doThing.async is paused by the delay call. If we can't call doThing.sync until doThing.async is finished, the only option we have is to throw an error, because it can't wait.

    maybeDelay might seem like a function crafted only to create this example, without real-world applications, but it is exactly like any fs.* operation.

I think that randomly throwing errors caused by race conditions is a too hight cost for to be introduced only for performance reasons. I think that we have two options:

  1. Leave is as-is, allowing two different async calls to happen in parallel
  2. Always invalidate the cache when switching between sync/async. I already had to do it when interacting with an either sync or async cache.invalidate call, but it shouldn't be observable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big issue for me is that I could really see this being a noticable perf hit for something that starts compiling a bunch of files in parallel, like @babel/cli. If you do something like

var results = Promise.all(files.map(filename => babel.transformFileAsync(filename)));

then you'll likely end up calling every plugin and preset function for every single file, because config resolving will all be kicked off in parallel and each file's process will kick off an operation that then won't get the cached valid of the previous file compilation get, because the operation will have started before the previous one finished. I haven't actually tested that though, so if you want to verify that could help the discussion. Maybe I'm missing something.

If a function is called recursively (like a plugin loading another plugin via inherits), it will cause a deadlock: the outer function needs the inner one's return value, but the inner one won't run until the outer one is finished.

We should probably disallow that anyway, if we don't already, since it would be an unresolvable problem in sync code too. I agree that recursion could be a concern, but I'm not sure it's one that wouldn't already be an issue in sync cases in general too.

If we can't call doThing.sync until doThing.async is finished, the only option we have is to throw an error, because it can't wait.

I think it's probable find to have entirely separate caching for sync and async calls to avoid this conflict. In fact we likely explicitly want that, because plugins could return different results depending on whether the system is sync or async.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like you need a mutex.


if (cachedValue) {
for (const { value, valid } of cachedValue) {
if (valid(data)) return value;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should consider allowing this to be async too? We're currently stuck with the one statSync because it is sync-only. We shouldn't expose gensync to consumers, but we could repeat the .then check here too and allow returning a promise, if we wanted.

Not needed now and easy to expand later, but curious for your thoughts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change it because I still have to finish refactoring the caching logic, but I will do it 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@loganfsmyth
Copy link
Member

This is exactly the usage I had in mind when designing gensync, so I'm happy to hear that it seems to have worked pretty smoothly overall.

packages/babel-core/package.json Outdated Show resolved Hide resolved
@@ -12,7 +15,8 @@ import {
} from "./config-chain";
import type { UnloadedDescriptor } from "./config-descriptors";
import traverse from "@babel/traverse";
import { makeWeakCache, type CacheConfigurator } from "./caching";
import { makeWeakCache, type CacheConfigurator } from "./caching-a";
import { makeWeakCache as makeWeakCacheSyncSync } from "./caching";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makeWeakCacheSyncSync => makeWeakCacheSync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to refactor it, but there is a difference: Sync allows an async cache invalidation function and a sync cached function, while with SyncSync both are sinchronous.

});
}

export const isAsync = gensync<[any], any>({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can avoid any? I tried really hard to never use any in the core codebase, which is why I keep mentioning it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can definitely be removed, I don't know why I used it


type MaybePromise<T> = T | Promise<T>;

export default function forward<ActionArgs: mixed[], ActionReturn, Return>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have some comments in these, since I don't think it's very clear what their expected usecases would be.


export function isThenable(val: mixed): boolean %checks {
return (
/*:: val instanceof Promise && */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky!


const cache = new CacheConfigurator(data);

const value = yield* handler(arg, cache);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this explanation enough to make my concern clearer?

Comment on lines 165 to 180
* 1. If there is a valid cache associated to the current "arg" parameter,
* a. RETURN the cached value
* 2. If there is a ConfigLock associated to the current "arg" parameter,
* a. Wait for that lock to be released
* b. GOTO step 1
* 3. If there is a FinishLock associated to the current "arg" parameter representing a valid cache,
* a. Wait for that lock to be released
* b. RETURN the value associated with that lock
* 4. Acquire to locks: ConfigLock and FinishLock
* 5. Start executing the cached function
* 6. When the cache if configured, release ConfigLock
* 7. Wait for the function to finish executing
* 8. Release FinishLock
* 9. Send the function result to anyone waiting on FinishLock
* 10. Store the result in the cache
* 11. RETURN the result
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loganfsmyth What do you think abuot this algorithm, to solve the concurrency issue?
It is sub optimal in cases like the following one (it will take 2s instead of 1s), but I don't think that they are common:

const fn = makeStrongCache(function* (arg) {
  yield* wait(1000);
  cache.never();
}).async;

fn(); fn();

I initially tried to force cache.* to be called synchronously (before any yield), but then found out that it doesn't even work for our internal code: https://github.com/babel/babel/pull/10507/files#diff-83f25d98662c0f8e8ad7cc7d1ed06f8eR32

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the performance of this change, and there is a problem.
I have compiled every src file of this repo four times (405 x 4 files) and there are the results:

cache.forever() cache.never()
async 15.372567052 s 470.434076412 s
sync 15.862861386 s 19.743430078 s

As you can see, there is a big problem with cache.never() + async.
It is related to re-compiling the same file. If I compile every file once instead of four times, it doesn't have this big difference:

cache.never()
async 10.391877456 s
sync 6.149023702 s

If I compile the same file multiple times with cache.never(), the time initially grows linearly (as expected), and then it slows down:

# async sync
100 2.308958374 s 2.207794832 s
200 4.242482485 s 3.747946110 s
400 10.356814312 s 6.574130764 s
800 50.817692998 s 11.663716923 s
1600 466.833293529 s 22.983728867 s

As you can see, the numbers for 1x1600 files are very similar to the number for 405x4 files (first table), and the numbers for 1x400 files are very similar to 405x1 (second table).
This shows that it doesn't matter if we are always transpiling the same file or not: when compiling more than 400 files the async version becomes super slow. My guess is that it is caused by all the promises getting created, but I don't know neither how to verify it, neither if it can be fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loganfsmyth We can solve this big performance problem if we disallow the cache to be configured asynchronously. Synchronously configuring the cache with an async validator would still be allowed:

// waitFor transforms a "Promise<T> | T" into a "Gensync<T>"

makeWeakCache(function* (arg) {
  // Allowed, because cache.using is called synchronously

  const val = yield* waitFor(
    cache.using(async () => {
      await wait(100);
      return process.env.FOO;
    })
  );
});


makeWeakCache(function* (arg) {
  // Disallowed, because cache.using is called asynchronously

  yield* waitFor(wait(100));
  const val = cache.using(() => process.env.FOO);
});

In the future, this change will be observable to our users because they won't be able to use await before configuring the cache.

In order to implement this behaviour I had to modify gensync's internals, because I couldn't find a way "from the outside" of knowing if a generator has been suspended. I will open a PR soon because I edited it in my node_modules and I fear to loose those changes, but if you have a better solution feel free to propose it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the new numbers with cache.never():

# async sync sync master
100 2.219376366 s 2.256803449 s 2.143184428 s
200 3.714403922 s 3.641993385 s 3.591130842 s
400 6.338499966 s 6.638144645 s 6.299714871 s
800 11.772867713 s 12.198001927 s 11.539516664 s
1600 21.811404623 s 22.632490371 s 21.436522896 s

@nicolo-ribaudo
Copy link
Member Author

Ready for review!

@nicolo-ribaudo nicolo-ribaudo marked this pull request as ready for review October 7, 2019 13:40
@nicolo-ribaudo
Copy link
Member Author

nicolo-ribaudo commented Oct 16, 2019

Note that node 13.0.0 will be released soon (nodejs/node#29504), and v13.1.0 will have unflagged modules (https://twitter.com/jasnell/status/1184265267951493120?s=20), around the beginning of November.

As soon as this PR is merged, I will open another PR to make it possible to use js config files in packages using es modules.

Copy link
Contributor

@JLHwung JLHwung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely crafted!

packages/babel-core/src/gensync-utils/fs.js Show resolved Hide resolved
@github-actions github-actions bot added the outdated A closed issue/PR that is archived due to age. Recommended to make a new issue label Apr 11, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
outdated A closed issue/PR that is archived due to age. Recommended to make a new issue pkg: core PR: Internal 🏠 A type of pull request used for our changelog categories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants