Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test crate to compile DataFusion with wasm-pack #7633

Merged
merged 8 commits into from Sep 25, 2023

Conversation

jonmmease
Copy link
Contributor

@jonmmease jonmmease commented Sep 23, 2023

Which issue does this PR close?

Closes #177

As a follow on to #7625 and inspired by @alamb comment in #7624 (comment), this PR adds a wasmtest crate that depends on the DataFusion crates that are current WASM-compatible and uses wasm-pack to compile it. This wasm-pack build is added as a GitHub action to make sure that we don't inadvertently break the WASM compatibility of crates that are currently compatible.

I also added a datafusion/wasmtest/datafusion-wasm-app directory containing a small standalone app that loads and runs the wasmtest crate. This isn't used in CI, but makes it easy to play with WASM support during development.

I also added some notes to the README in wasmtest regarding the difficulties I've run into when trying to get the remaining DataFusion crates compiling to WASM. In particular, the native dependencies of the parquet crate seem to be the issue.

I put the wasmtest crate in datafusion/wasmtest since I saw the sqllogictest crate was in there and this seemed like a related idea, but happy to move it somewhere else.

Is there anything I should do to make sure wasmtest doesn't get published during releases?

@jonmmease jonmmease marked this pull request as draft September 23, 2023 13:39
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonmmease -- I tried out the instructions and they worked great!

The CI run also looks good: https://github.com/apache/arrow-datafusion/actions/runs/6284314634/job/17065561427?pr=7633

Though it does take a non trivial amount of time 🤔

[INFO]: Installing wasm-bindgen...
[INFO]: Optimizing wasm binaries with `wasm-opt`...
[INFO]: License key is set in Cargo.toml but no LICENSE file(s) were found; Please add the LICENSE file(s) to your project directory
[INFO]: :-) Done in 11m 48s
[INFO]: :-) Your wasm pkg is ready to publish at /__w/arrow-datafusion/arrow-datafusion/datafusion/wasmtest/pkg.

Screenshot 2023-09-25 at 7 37 02 AM

Is there anything I should do to make sure wasmtest doesn't get published during releases?

This is probably best answered by @andygrove -- the instructions for doing so are here https://github.com/apache/arrow-datafusion/blob/main/dev/release/README.md#publish-on-cratesio and since involve publishing each crate separately I think things will be ok.

I wonder if we could move datafusion/wasmtest to datafusion-examples/wasmtest ? That would make this both easier to discover for others who wanted to use DataFusion in WASM as well as make it clearer this shouldn't be distributed with DataFusion

We could move this as a follow on PR

datafusion/wasmtest/README.md Show resolved Hide resolved
- `datafusion-physical-expr`
- `datafusion-sql`

The difficulty with getting the remaining DataFusion crates compiled to WASM is that they have non-optional dependencies on the [`parquet`](https://docs.rs/crate/parquet/) crate with its default features enabled. Several of the default parquet crate features require native dependencies that are not compatible with WASM, in particular the `lz4` and `zstd` features. If we can arrange our feature flags to make it possible to depend on parquet with these features disabled, then it should be possible to compile the core `datafusion` crate to WASM as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold do you have any thoughts about finagling the parquet crate's dependencies so it can compile, by default, on wasm? Should we perhaps change datafusion to disable the parquet default features?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it is the compression codecs that have issues with WASM, disabling these by default I think would be surprising for users. Further I'm not sure how useful parquet support would be given that only InMemory object_store is supported on WASM, although I may have some time to look into this over the next couple of days

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think we'd want DataFusion's default build to disable the default parquet features, but if we could arrange things so that depending on the datafusion core crate with default-features=false would either remove the parquet dependency all together, or disable the default parquet features, then I think we could get things at least compiling for wasm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility is to make parquet support itself entirely optional given that not all DataFusion users want / need such support. I filed #7653 to track that

I also filed #7652 to track the idea of compiling all the crates for wasm.

I gathered all WASM related tickets I can find here: #7651

@alamb
Copy link
Contributor

alamb commented Sep 25, 2023

I wonder if running DataFusion in the browser via WASM would be a good feature to add to the documentation

Perhaps https://arrow.apache.org/datafusion/user-guide/introduction.html#use-cases or a section in LIBRARY USER GUIDE 🤔

Also, Maybe it would be worth adding a ticket to track getting all of datafusion running in the browser 🤔

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@jonmmease
Copy link
Contributor Author

Thanks for the review and for trying it out @alamb!

I think moving the wasmtest crate to an example makes sense. I'll take a look at doing that later today.

Though it does take a non trivial amount of time 🤔

One thing I can do is change wasm-pack build to wasm-pack build --dev, which disables the use of wasm-opt which takes a bit of time to optimize the resulting wasm binary.

Also, Maybe it would be worth adding a ticket to track getting all of datafusion running in the browser 🤔

I tried disabling the default parquet features locally, and I did get things compiling. But unfortunately I wasn't able to successfully construct a SessionContext without panicing. I didn't get too far into investigating this, but I think one of this issues is that std::time::Instant::now() isn't compatible with WASM by default (See https://internals.rust-lang.org/t/is-std-instant-on-webassembly-possible/18913), though there are replacements available in other crates.

It would be great to get to the point of being able to use all of DataFusion in the browser, similarly to DuckDB-Wasm. But my sense is that there's still a fair bit of work required.

@jonmmease
Copy link
Contributor Author

jonmmease commented Sep 25, 2023

I think moving the wasmtest crate to an example makes sense. I'll take a look at doing that later today.

I don't see a clear way to do this after all. The wasmtest example needs to be a full crate, but the existing examples are all standalone *.rs files. So I think I'll leave it for now.

I added --dev to wasm-pack build, so we'll see how much that helps with CI build time.

Update: It helped a lot! Down to under 2 minutes.
Screenshot 2023-09-25 at 1 12 18 PM

@alamb
Copy link
Contributor

alamb commented Sep 25, 2023

2 min looks good 😍

Screenshot 2023-09-25 at 1 59 47 PM

I plan to merge this later this afternoon after I file a follow on ticket about proper WASM support to capture some information from this PR. Thanks @jonmmease and @tustvold

- `datafusion-physical-expr`
- `datafusion-sql`

The difficulty with getting the remaining DataFusion crates compiled to WASM is that they have non-optional dependencies on the [`parquet`](https://docs.rs/crate/parquet/) crate with its default features enabled. Several of the default parquet crate features require native dependencies that are not compatible with WASM, in particular the `lz4` and `zstd` features. If we can arrange our feature flags to make it possible to depend on parquet with these features disabled, then it should be possible to compile the core `datafusion` crate to WASM as well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility is to make parquet support itself entirely optional given that not all DataFusion users want / need such support. I filed #7653 to track that

I also filed #7652 to track the idea of compiling all the crates for wasm.

I gathered all WASM related tickets I can find here: #7651

@alamb alamb merged commit 6ead16e into apache:main Sep 25, 2023
23 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 25, 2023

Thanks again @jonmmease

Ted-Jiang pushed a commit to Ted-Jiang/arrow-datafusion that referenced this pull request Oct 7, 2023
* Add test crate to compile DataFusion with wasm-pack

* Add datafusion-wasm-app to exercise wasm build

* prettier

* Add license headers

* tomlfmt

* tomlfmt

* Update datafusion/wasmtest/README.md

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* --dev for faster build

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFusion does not support wasm32-unknown-unknown target
3 participants