feat!: parse yaml with custom nom-based parser written in rust #4807

paul-soporan · 2022-09-01T23:56:56Z

What's the problem this PR addresses?

This PR is part of the effort of solving #1463 and creating a better successor to https://github.com/paul-soporan/enhanced-yaml.

This PR only reimplements the parser. Preserving comments and styling will be a follow-up.

Currently. we parse:

YAML using js-yaml
v1 lockfiles (which are almost YAML) using a custom parser written using PegJS

Originally, everything was parsed using the custom peg-based parser, but it was very slow, so parsing via js-yaml was introduced in #183.

Unfortunately, js-yaml is a very complex project that includes some features that we don't intend to use / support in the YAML files we handle, meaning that it would be very complicated to fork to support our needs, which currently are:

Blazing fast parsing speed, since the configuration and the lockfile are parsed on almost every yarn invocation
- js-yaml has unmatched parsing speed in the JS ecosystem, no other package I could find even came close
  - more details at [Bug] Preserve comments and styling in YAML configuration #1463 (comment)
Preserving comments and styling - [Bug] Preserve comments and styling in YAML configuration #1463
Supporting legacy lockfiles in the same package, which are a weird superset of YAML (and sometimes even valid YAML, but parsed differently)
Keeping the bundle size as small as possible, meaning that we can't include 2 different parsers (a fast one and a slow one that preserves comments)
If not native JS, it has to be compiled to WASM due to our portability requirements. We can't use native node addons to speed up native code.

How did you fix it?

Before implementing the current parser, I did a quick test of existing parsers to see if I could reuse any of them.
This is a quick comparison of 100 lockfile parses, compared to js-yaml:

- enhanced-yaml (uses yaml): 7.5x slower
- serde_yaml (compiled to wasm): 2.7x slower
- existing peg parser: 4.4x slower

Because of this, I decided to take a look at the nom Rust crate, a parser combinator framework that makes it very easy to implement parsers in a similar way to Peg.
To quote @merceyz, nom is a "performant, maintained, and compile-time checked "alternative" to Peg.js".

nom has the advantage of being a zero-cost abstraction implemented in a language full of zero-cost abstractions, which makes it incredibly fast and able to compete even with handwritten parsers.

At the moment, the parser implementation mostly matches the PegJS one (but without the legacy lockfie parsing).

After optimizing the compiled WASM binary as much as possible, I've arrived at the following numbers:

Raw benchmarks

See #4807 (comment) for updated numbers.

Single lockfile parse

nom: 37ms (winner)
js-yaml@3 w/ FAILSAFE_SCHEMA: `39ms
js-yaml@4 w/ FAILSAFE_SCHEMA: `46ms

nom manages to beat js-yaml in this case even with the WASM boundary overhead.

10 lockfile parses

~~nom: 309ms~~
js-yaml@3 w/ FAILSAFE_SCHEMA: 245ms (winner)
js-yaml@4 w/ FAILSAFE_SCHEMA: 304ms

Here nom matches js-yaml@4 (which has apparently regressed quite a lot) and is not much slower than js-yaml@3 (only ~~1.26x~~ 1.06x slower at parsing 10 lockfiles, which isn't even something that happens in practice).

Bundle size

Doesn't regress too much: 2.68 MB -> ~~2.72 MB~~ 2.71 MB. The binary is also mostly glue code and other stuff that can be shared between multiple parsers, which means that once we reimplement the other parsers too, the delta will be even smaller.

Should also be noted that I'm optimizing everything for speed (-O3 in rustc and -O4 in wasm-opt).

Boot time

WASM files are generally loaded very fast, so it shouldn't affect boot time. Also, once we reimplement all parsers using nom, they'll all be loaded together so it should bring the load time down since currently we also need to load the remaining peg parsers too and they don't share any code.

Concrete benchmark

Here I'm testing how much it takes for yarn exec echo foo to execute. This takes into account:

the boot time of the parser
the time it takes to parse the configuration files
the time it takes to parse the lockfiele

Before:

Benchmark 1: YARN_IGNORE_PATH=1 node packages/yarnpkg-cli/bundles/yarn.js exec echo foo
  Time (mean ± σ):     416.4 ms ±   9.7 ms    [User: 541.8 ms, System: 44.6 ms]
  Range (min … max):   404.3 ms … 433.9 ms    10 runs

After:

Benchmark 1: YARN_IGNORE_PATH=1 node packages/yarnpkg-cli/bundles/yarn.js exec echo foo
  Time (mean ± σ):     416.0 ms ±   4.1 ms    [User: 556.4 ms, System: 44.3 ms]
  Range (min … max):   410.4 ms … 421.5 ms    10 runs

Result: Unchanged.

Conclusion

As you can see, nom is amazing and I intend to reimplement both the existing shell parser and the resolutions one too if everything goes according to plan.

What's left to do

Checklist

I have read the Contributing Guide.

I have set the packages that need to be released for my changes to be effective.

I will check that all automated PR checks pass before the PR gets reviewed.

paul-soporan · 2022-09-03T00:13:48Z

A few more notes:

compiling the parser to a native node addon using napi makes it 40% faster than the WASM version; unfortunately we can't use it due to our portability requirements
generally nom parsers that work on &[u8] are faster than ones that work on &str, I need to test whether it would improve things in our case
it looks like wasm-bindgen doesn't support BE architectures at all - all of the glue code assumes LE
rustc supports Profile Guided Optimizations (PGOs) natively; not sure whether they work for WASM targets too, but it might be interesting to explore

paul-soporan · 2022-09-03T22:20:22Z

Update: I changed the parser to work on &[u8] instead of &str and it's now 20% faster than before.

For a single lockfile parse it's about 30% faster than js-yaml while for 10 lockfile parses it's about 6% slower.

The bundle size also went down from 2.72 MB to 2.71 MB.

…er-via-nom

belgattitude · 2022-09-06T20:01:39Z

Impressive @paul-soporan 👀

…er-via-nom

paul-soporan added 5 commits September 2, 2022 01:40

feat!: parse yaml with custom nom-based parser written in rust

7a01f1c

chore: dedupe

8fa276c

fix: fix parseSyml return type

981cc44

fix: use any

73ffe86

build: enable more verbose logging

74e8327

paul-soporan added 2 commits September 3, 2022 04:03

refactor: extract input type to make it easier to change it

728167a

perf: make parser work on &[u8] instead of &str

2c06b23

paul-soporan added 5 commits September 4, 2022 01:28

Merge remote-tracking branch 'origin/master' into paul/feat/yaml-pars…

0c64066

…er-via-nom

fix: use fold_many1 because fold_many0 swallows errors

3baf40c

feat: better errors via nom-supreme

759b0f3

refactor: remove duplicate import

f6adeb2

feat: implement support for flow sequences

82b96bc

paul-soporan added 15 commits September 7, 2022 21:38

feat: allow any expression at the top level

478d75a

test: add a few basic tests

87b9174

feat: implement support for empty double quoted scalars

2aaf99e

feat: implement support for single-quoted scalars

8fcae78

feat: implement support for empty single quoted scalars

e50e542

feat: implement support for flow mappings

fed02e8

feat: allow line endings in flow mappings

54d3ed0

feat: allow line endings in flow sequences

c019213

feat: allow flow collections to contain each other

1f42a37

feat: allow trailing commas inside flow sequences

3b8b874

test: add some tests for block nodes

abaf066

feat: allow flow collections to contain compact mappings

fce56ce

perf: don't use the json! macro

5feef5f

refactor: don't use unnecessary to_owned

366bd35

refactor: reuse flow_expression

7439c57

paul-soporan added 25 commits October 26, 2022 22:32

feat: support leading ":" in scalars

29fa974

refactor: tweaks

668b342

fix: improve colon handling

91ec305

fix: support implicit plain scalar keys ending in ":"

95e7edd

test: tweaks

399e4f9

refactor: don't use serde_json::Value

c9bdff9

perf: use a Cow

949bcf7

perf: use Cow in IndexMap too

d3454f6

refactor: use IndexMap::from

21391f4

perf: use FNV hasher for HashMaps

ac8eed4

chore: update serde_json

4a1d1ae

chore: re-enable preserve_order

e535fb1

perf: use Fx hasher

f4142eb

perf: use cow_replace

2795d3f

perf: use Cow in escaped_transform

46fa264

style: use consistent quoting in Cargo.toml

586d131

perf: use latest wasm-opt

312def7

Merge branch 'master' into paul/feat/yaml-parser-via-nom

9e89ac2

chore: revert unneeded change

9e41bfa

chore: update rust toolchain

58e173c

chore: update binaryen

bbc9a3b

chore: update deps

ce6537d

Merge remote-tracking branch 'origin/master' into paul/feat/yaml-pars…

baaed60

…er-via-nom

chore: update rust toolchain and binaryen

30abaf5

chore: update wasm-pack

2077b93

paul-soporan added infra: pending update A bot will merge master into this PR and removed infra: pending update A bot will merge master into this PR labels Jul 3, 2023

paul-soporan added 3 commits July 3, 2023 04:52

Merge branch 'master' into paul/feat/yaml-parser-via-nom

bbcd942

chore: fix lint

800f084

Merge branch 'master' into paul/feat/yaml-parser-via-nom

423a62c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: parse yaml with custom nom-based parser written in rust #4807

feat!: parse yaml with custom nom-based parser written in rust #4807

paul-soporan commented Sep 1, 2022 •

edited

paul-soporan commented Sep 3, 2022 •

edited

paul-soporan commented Sep 3, 2022

belgattitude commented Sep 6, 2022

feat!: parse yaml with custom nom-based parser written in rust #4807

Are you sure you want to change the base?

feat!: parse yaml with custom nom-based parser written in rust #4807

Conversation

paul-soporan commented Sep 1, 2022 • edited

What's the problem this PR addresses?

How did you fix it?

Raw benchmarks

Single lockfile parse

10 lockfile parses

Bundle size

Boot time

Concrete benchmark

Conclusion

What's left to do

Checklist

paul-soporan commented Sep 3, 2022 • edited

paul-soporan commented Sep 3, 2022

belgattitude commented Sep 6, 2022

paul-soporan commented Sep 1, 2022 •

edited

paul-soporan commented Sep 3, 2022 •

edited