Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: parse yaml with custom nom-based parser written in rust #4807

Draft
wants to merge 94 commits into
base: master
Choose a base branch
from

Conversation

paul-soporan
Copy link
Member

@paul-soporan paul-soporan commented Sep 1, 2022

What's the problem this PR addresses?

This PR is part of the effort of solving #1463 and creating a better successor to https://github.com/paul-soporan/enhanced-yaml.

This PR only reimplements the parser. Preserving comments and styling will be a follow-up.


Currently. we parse:

  • YAML using js-yaml
  • v1 lockfiles (which are almost YAML) using a custom parser written using PegJS

Originally, everything was parsed using the custom peg-based parser, but it was very slow, so parsing via js-yaml was introduced in #183.

Unfortunately, js-yaml is a very complex project that includes some features that we don't intend to use / support in the YAML files we handle, meaning that it would be very complicated to fork to support our needs, which currently are:

  • Blazing fast parsing speed, since the configuration and the lockfile are parsed on almost every yarn invocation
  • Preserving comments and styling - [Bug] Preserve comments and styling in YAML configuration #1463
  • Supporting legacy lockfiles in the same package, which are a weird superset of YAML (and sometimes even valid YAML, but parsed differently)
  • Keeping the bundle size as small as possible, meaning that we can't include 2 different parsers (a fast one and a slow one that preserves comments)
  • If not native JS, it has to be compiled to WASM due to our portability requirements. We can't use native node addons to speed up native code.

How did you fix it?

Before implementing the current parser, I did a quick test of existing parsers to see if I could reuse any of them.
This is a quick comparison of 100 lockfile parses, compared to js-yaml:

- enhanced-yaml (uses yaml): 7.5x slower
- serde_yaml (compiled to wasm): 2.7x slower
- existing peg parser: 4.4x slower

Because of this, I decided to take a look at the nom Rust crate, a parser combinator framework that makes it very easy to implement parsers in a similar way to Peg.
To quote @merceyz, nom is a "performant, maintained, and compile-time checked "alternative" to Peg.js".

nom has the advantage of being a zero-cost abstraction implemented in a language full of zero-cost abstractions, which makes it incredibly fast and able to compete even with handwritten parsers.

At the moment, the parser implementation mostly matches the PegJS one (but without the legacy lockfie parsing).

After optimizing the compiled WASM binary as much as possible, I've arrived at the following numbers:

Raw benchmarks

See #4807 (comment) for updated numbers.

Single lockfile parse

  • nom: 37ms (winner)
  • js-yaml@3 w/ FAILSAFE_SCHEMA: `39ms
  • js-yaml@4 w/ FAILSAFE_SCHEMA: `46ms

nom manages to beat js-yaml in this case even with the WASM boundary overhead.

10 lockfile parses

  • nom: 309ms
  • js-yaml@3 w/ FAILSAFE_SCHEMA: 245ms (winner)
  • js-yaml@4 w/ FAILSAFE_SCHEMA: 304ms

Here nom matches js-yaml@4 (which has apparently regressed quite a lot) and is not much slower than js-yaml@3 (only 1.26x 1.06x slower at parsing 10 lockfiles, which isn't even something that happens in practice).

Bundle size

Doesn't regress too much: 2.68 MB -> 2.72 MB 2.71 MB. The binary is also mostly glue code and other stuff that can be shared between multiple parsers, which means that once we reimplement the other parsers too, the delta will be even smaller.

Should also be noted that I'm optimizing everything for speed (-O3 in rustc and -O4 in wasm-opt).

Boot time

WASM files are generally loaded very fast, so it shouldn't affect boot time. Also, once we reimplement all parsers using nom, they'll all be loaded together so it should bring the load time down since currently we also need to load the remaining peg parsers too and they don't share any code.

Concrete benchmark

Here I'm testing how much it takes for yarn exec echo foo to execute. This takes into account:

  • the boot time of the parser
  • the time it takes to parse the configuration files
  • the time it takes to parse the lockfiele

Before:

Benchmark 1: YARN_IGNORE_PATH=1 node packages/yarnpkg-cli/bundles/yarn.js exec echo foo
  Time (mean ± σ):     416.4 ms ±   9.7 ms    [User: 541.8 ms, System: 44.6 ms]
  Range (min … max):   404.3 ms … 433.9 ms    10 runs  

After:

Benchmark 1: YARN_IGNORE_PATH=1 node packages/yarnpkg-cli/bundles/yarn.js exec echo foo
  Time (mean ± σ):     416.0 ms ±   4.1 ms    [User: 556.4 ms, System: 44.3 ms]
  Range (min … max):   410.4 ms … 421.5 ms    10 runs

Result: Unchanged.

Conclusion

As you can see, nom is amazing and I intend to reimplement both the existing shell parser and the resolutions one too if everything goes according to plan.

What's left to do

  • Ensure that it works on big-endian systems
  • Implement support for parsing legacy lockfiles; should probably be gated behind an option so that it doesn't affect the performance of parsing regular YAML files
  • Implement support for these since they're very useful:
    • flow sequences ([foo, bar])
    • flow mappings ({ foo: bar })
    • single-quoted scalars
    • foo # Aside comments
    • block scalars?
  • Possibly implement support for anchors and aliases; not sure how useful they are; should probably do a GitHub search to see how many people are using them inside configuration files
  • Experiment with a few more optimizations
  • Add a lot of tests to make sure that we don't regress anything we don't intend to regress

Checklist

  • I have set the packages that need to be released for my changes to be effective.
  • I will check that all automated PR checks pass before the PR gets reviewed.

@paul-soporan
Copy link
Member Author

paul-soporan commented Sep 3, 2022

A few more notes:

  • compiling the parser to a native node addon using napi makes it 40% faster than the WASM version; unfortunately we can't use it due to our portability requirements
  • generally nom parsers that work on &[u8] are faster than ones that work on &str, I need to test whether it would improve things in our case
  • it looks like wasm-bindgen doesn't support BE architectures at all - all of the glue code assumes LE
  • rustc supports Profile Guided Optimizations (PGOs) natively; not sure whether they work for WASM targets too, but it might be interesting to explore

@paul-soporan
Copy link
Member Author

Update: I changed the parser to work on &[u8] instead of &str and it's now 20% faster than before.

For a single lockfile parse it's about 30% faster than js-yaml while for 10 lockfile parses it's about 6% slower.

The bundle size also went down from 2.72 MB to 2.71 MB.

@belgattitude
Copy link

Impressive @paul-soporan 👀

@paul-soporan paul-soporan added infra: pending update A bot will merge master into this PR and removed infra: pending update A bot will merge master into this PR labels Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants