Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add large blob storage to Cache #7198

Merged
merged 24 commits into from Nov 23, 2021
Merged

Add large blob storage to Cache #7198

merged 24 commits into from Nov 23, 2021

Conversation

lettertwo
Copy link
Member

@lettertwo lettertwo commented Oct 27, 2021

New methods on Cache

  • hasLargeBlob(key: string): Promise<boolean>
  • getLargeBlob(key: string): Promise<Buffer>
  • setLargeBlob(key: string, contents: Buffer | string): Promise<void>

In FSCache, these are basically aliases for has(), getBlob() and setBlob(), but they use a modified key to differentiate large blobs from other cached values.

In LMDBCache, they interact with the configured filesystem rather than the LMDB store.

Caching RequestGraph as a large blob

In testing #6922 with a very large Parcel project, rebuilds would hard crash while attempting to deserialize the cached RequestGraph:

Assertion failed: (length <= imp::kMaxLength && "too large buffer"), function NewBuffer, file ../node_modules/nan/nan.h, line 890.

It seems this is because lmdb-store decodes binary data from the database into a Node Buffer which has a hard-coded upper limit on the size of the buffer for various reasons.

To work around this limit, the RequestGraph is now cached using these new large blob methods.

Caching other graphs as large blobs

Other graphs (AssetGraph, BundleGraph) are also cached as large blobs by updating the RequestTracker getRequestResult() and writeToCache() methods to use large blob caching, under the assumption that only graph requests are made with cache keys (which seems to be the mechanism by which RequestTracker decides to cache a result).

Streams

Large assets are already threaded through Parcel as streams, and now the LMDBCache also streams those values to and from the underlying FS. (Previously, streams were buffered and then written to the LMDB store).

Fallback to FS caching for large values

In addition to the new methods that allow explicitly caching large blobs, LMDBCache has also been updated to automatically fallback to FS cache when a value will exceed the max Node buffer size. Auto fallback has been removed in favor of marking streams as large blobs and using the cache {get, set}Stream methods, with the blob methods becoming the default (previously, the stream methods were always used, even if the asset was not streamed).

Questions

  • What other large entities should be treated this way? AssetGraph and BundleGraph?

All the graphs are now treated as large blobs

  • What about large assets like video files?

large assets are now streamed in and out of the cache

@height
Copy link

height bot commented Oct 27, 2021

Link Height tasks by mentioning a task ID in the pull request title or commit messages, or description and comments with the keyword link (e.g. "Link T-123").

💡Tip: You can also use "Close T-X" to automatically close a task when the pull request is merged.

@parcel-benchmark
Copy link

parcel-benchmark commented Oct 27, 2021

Benchmark Results

Kitchen Sink ✅

Timings

Description Time Difference
Cold 1.79s -8.00ms
Cached 283.00ms +19.00ms ⚠️

Cold Bundles

Bundle Size Difference Time Difference
dist/legacy/parcel.7cdb0fad.webp 102.94kb +0.00b 552.00ms +30.00ms ⚠️
dist/modern/parcel.7cdb0fad.webp 102.94kb +0.00b 553.00ms +32.00ms ⚠️

Cached Bundles

Bundle Size Difference Time Difference
dist/legacy/parcel.7cdb0fad.webp 102.94kb +0.00b 521.00ms -34.00ms 🚀

React HackerNews ✅

Timings

Description Time Difference
Cold 9.08s -75.00ms
Cached 420.00ms +21.00ms

Cold Bundles

No bundle changes detected.

Cached Bundles

No bundle changes detected.

AtlasKit Editor ✅

Timings

Description Time Difference
Cold 1.00m -208.00ms
Cached 1.41s -7.00ms

Cold Bundles

No bundle changes detected.

Cached Bundles

No bundle changes detected.

Three.js ✅

Timings

Description Time Difference
Cold 6.36s +56.00ms
Cached 368.00ms +52.00ms ⚠️

Cold Bundles

No bundle changes detected.

Cached Bundles

No bundle changes detected.

Click here to view a detailed benchmark overview.

@thebriando
Copy link
Member

Is there any perf difference between caching the graph to LMDB vs writing it to FS?

@lettertwo
Copy link
Member Author

Is there any perf difference between caching the graph to LMDB vs writing it to FS?

Nothing noticeable at the macro level (build times are comparable), but maybe I can capture some startup/shutdown metrics.

@lettertwo lettertwo marked this pull request as ready for review October 29, 2021 23:17
@lettertwo
Copy link
Member Author

lettertwo commented Oct 29, 2021

Ran some profiling builds to see how this change might effect performance:

Cold Build Warm Build Cold Start Warm Start
serialize -165 ms +438 ms +461 ms -261 ms
deserialize +242 ms +382 ms +118 ms -98 ms
init +1 ms -71 ms +1 ms -97 ms
build -924 ms -2 ms +1298 ms +78 ms
save - - -6 ms +11 ms
shutdown +115 ms -162 ms -24 ms +365 ms
✨ Built in -0.75 s -0.30 s +0.95 s +0.07 s

The above shows the difference between the same build with and without caching the request graph as a large blob. They seem to be pretty comparable overall, with the differences between them looking similar to the differences between any two builds regardless of caching mechanism.

For context: This build typically runs in about ~135s in prod and ~70s in dev.

@mischnic
Copy link
Member

mischnic commented Nov 2, 2021

LMDBCache will always write to the "real" FS(= NodeFS) anyway, so do we actually need to make the FS configurable? When would you want to use LMDB on the real FS but write the large blobs to a MemoryFS?

Also currently the change to the LMDBCache constructor would be a breaking change

@mischnic
Copy link
Member

mischnic commented Nov 2, 2021

What about large assets like video files?

For that, a stream-based API is better anyway to reduce memory usage (this is why transformers, packagers and optimizers have a stream-based API in the first place). That would at the least require (s)etLargeStream methods in the cache. Not sure how straight forward it is to detected in core which method to use for large enough assets and bundles.

`LMDBCache` is only used when the FS is `NodeFS` anyway.
@lettertwo
Copy link
Member Author

LMDBCache will always write to the "real" FS(= NodeFS) anyway, so do we actually need to make the FS configurable? When would you want to use LMDB on the real FS but write the large blobs to a MemoryFS?

Also currently the change to the LMDBCache constructor would be a breaking change

I see what you mean about not needing to configure the FS for LMDBCache (it is only used when the FS is NodeFS anyway). I updated the PR to not change the LMDBCache constructor API.

@lettertwo
Copy link
Member Author

What about large assets like video files?

For that, a stream-based API is better anyway to reduce memory usage (this is why transformers, packagers and optimizers have a stream-based API in the first place). That would at the least require (s)etLargeStream methods in the cache. Not sure how straight forward it is to detected in core which method to use for large enough assets and bundles.

👍 For the purposes of this PR (solving the Node buffer size limit), would it make sense to just assert that any buffer being written to LMDBCache is under that size?

@lettertwo
Copy link
Member Author

👍 For the purposes of this PR (solving the Node buffer size limit), would it make sense to just assert that any buffer being written to LMDBCache is under that size?

I've opted instead to auto fallback to using FS cache when any blob being written to LMDBCache might exceed the Node buffer size.

This is kind of a punt on solving for large media, but currently, large media is already buffered completely into memory to be cached by LMDBCache, so this will at least let things continue to work if something very large is streamed into cache, though it doesn't deliver the memory savings that a proper stream API would allow.

lettertwo and others added 3 commits November 3, 2021 17:33
* v2:
  Fix RangeError in `not export` error with other file type (#7295)
  Apply sourcemap in @parcel/transformer-typescript-tsc (#7287)
  Fix side effects glob matching (#7288)
  Fix changelog headings
  v2.0.1
  Changelog for v2.0.1
  Resolve GLSL relative to the importer, not the asset (#7263)
  fix: add @parcel/diagnostic as dependency of @parcel/transformer-typescript-types (#7248)
  Fixed missing "Parcel" export member in Module "@parcel/core" (#7250)
Before, all assets were streamed from cache regardless of size, but by
marking assets with content streams as large blobs when being written to
the cache, we can default to reading the assets into memory from cache,
and only stream the assets that were marked as large blobs.
@lettertwo
Copy link
Member Author

I've opted instead to auto fallback to using FS cache when any blob being written to LMDBCache might exceed the Node buffer size.

I reverted this fallback behavior. Now, LMDBCache uses FS read/write streams instead of converting between stream and buffer, and I added some isLargeBlob signals for cases where streams are being written to cache, which is then used to read them from cache as streams.

So, large assets should now be streamed to/from cache rather than being buffered in memory.

* upstream/v2:
  Upgrade Flow to 0.164.0 (#7297)
  Chore: fix typo initialise -> initialize in core/Parcel.js (#7309)
  Register symbols even without scopehoisting (#7222)
Copy link
Member

@devongovett devongovett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Going to perf test it later

packages/core/cache/src/LMDBCache.js Outdated Show resolved Hide resolved
This prevents `FSCache.has` and `FSCache.get` from finding large blobs,
as large blobs should be retrieved via `FSCache.getLargeBlob`
or `FSCache.getStream`.
@devongovett
Copy link
Member

Tested and saw no perf regression anymore.

@devongovett devongovett merged commit cce9e0c into v2 Nov 23, 2021
@devongovett devongovett deleted the lettertwo/cache-large-blob branch November 23, 2021 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants