Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate WASM as a HAL executable format #2863

Open
benvanik opened this issue Aug 13, 2020 · 30 comments
Open

Investigate WASM as a HAL executable format #2863

benvanik opened this issue Aug 13, 2020 · 30 comments
Assignees
Labels
enhancement ➕ New feature or request hal/cpu Runtime Host/CPU-based HAL backend next-gen ✨ Forward-looking work built on IREE's foundation runtime Relating to the IREE runtime library

Comments

@benvanik
Copy link
Collaborator

benvanik commented Aug 13, 2020

There's a bunch of reasons to be interested in wasm as a distribution format for executables. This issue will track notes on the feasibility of this approach, how it could be implemented in IREE, and some open questions.

At a high level we can treat each HAL executable as a WASM binary with multiple entry points (exports) as we do with dylibs for the LLVM AOT backend, store the wasm binary embedded in the module as we do with all other executables, and have a host-derived HAL backend that uses a wasm runtime library to load/cache/invoke the wasm. With this approach we can likely reuse the current LLVM AOT compiler target with different target/linker options almost verbatim.

IREE Implementation

Compiler

We can reuse the existing LLVM target backend with changes only to how we setup the compiler and serialize the resulting binary - LLVMAOTTarget and LLVMIRTarget are two examples that already exist.

  • Add a new LLVMWASMTarget based on LLVMAOTTarget that configures LLVM, links to a wasm module, and embeds it in an executable
  • Likely no need for a custom flatbuffer schema as the .wasm module contains all of the information we need (exports list, etc)
    • May still be useful if we need to configure/choose between wasm engines and define whether certain features are used (SIMD, wasm64, etc)

Work to link multiple executables together (#1587) such that we ideally end up with a single executable per module with multiple entry points will be very useful here to reduce overhead (only one wasm runtime needed, etc).

Runtime

A majority of the runtime work is identical to the existing dylib, llvmjit, and vmla HAL drivers. All of these share code in iree/hal/host/ for things like work scheduling.

  • Add a new wasm backend based on llvmjit or dylib
    • May want to nest like iree/hal/wasm/[runtime]/, as we'll definitely be ending up with multiple runtimes (at least JavaScriptCore for iOS and something for embedded, likely)
  • iree::hal::ExecutableCache can be implemented to support offline preparation, caching of intermediate (bitcode/native) binaries, etc
  • The iree::hal::HostExecutable subclass can hold the handles to the wasm runtime and the exported symbols
    • Modules can be initialized once directly from the provided buffers (that are mmapped or otherwise already in memory)
  • Dispatches are prepared via PrepareDispatch once per dispatch and as demonstrated here can have state that is shared across all tiles
    • This amortizes arg marshaling into the runtime as all tiles receive the same args (besides workgroup xyz)
  • Dispatches (per tile) are invoked with the shared args (buffer bindings/push constants) and the workgroup xyz of the tile as here
    • Multiple tiles from the same dispatch are dispatched concurrently from multiple threads
    • We can implement shared memory semantics in that the wasm memory space can have a region shared across tiles and use atomic ops to safely work across them; otherwise we should rely on the stack to keep each tile independent (this matches GPU behavior, so something we need to be solving there anyway)

A custom iree::hal::Allocator will be required as wasm runtimes can only access a single contiguous memory address range and we need to suballocate within that if we want to ensure zero-copy behavior. This most closely aligns with the DEVICE_LOCAL|HOST_VISIBLE memory type in that the device here (the wasm runtime) can cheaply manipulate the memory and that HOST_LOCAL|DEVICE_VISIBLE memory would require a copy to use. There's likely some other gotchas here to play with. See open questions below. See the WAMR example.

Toys

WASM Runtime Notes

There are a few big WASM-specific runtimes that have various levels of build time, runtime, and architecture support. This is excluding any that simply wrap v8/JSC, as we don't need arbitrary JS, WASI, and other complex bridging layers and instead just need access to the global memory and export table. Directly using system libraries (such as JavaScriptCore) may be the only option in some environments while in others on platforms with the ability to allocate executable pages have a lot more freedom.

There's a lot of runtimes: https://github.com/appcypher/awesome-wasm-runtimes
Many experimental or specialty (a blockchain wasm vm, etc). I've listed the most popular/relevant/still-active ones here and excluded any (such as WAVM) that require LLVM in their deployment.

v8

Much more full-featured than we need, but also one of the fastest/most ubiquitous runtimes. Not sure if there's a minimal build that only includes that required for wasm - the runtime+jit+etc can be several MB.

JavaScriptCore

The only real option (besides interpreted) on iOS. Supports WebAssembly on device and in simulator. Can't find a signal as to when SIMD will be supported (likely after first spec MVP published).

See open questions below; unclear how JIT is supported on appstore releases.

wasmtime

One of the bigger/more complete runtimes. Currently only targets x86-64 and aarch64 (on linux). They claim new backends are planned but it's unclear the timeline.

Wasmer

WAMR

Focused on breadth of architectures and small size, looking pretty similar to our needs: x86-32/64, armv7, aarch64, mips, with/without MMU. Recompiles the WASM to a custom target-specific AOT format that can be done either automatically or offline. Would work well with our pipeline cache model (translate and cache AOT binary, load that via mmmap for execution).

wasm3

A pure interpreter using a custom bitcode format and threaded interpreter (like IREE's VM). (Mostly) pure C with no executable page requirement so it'll run just about anywhere. Take the performance breakdown with a big grain of salt (from beginning of the year).

  • Less developed than WAMR
  • Interpreter only (using fast threaded dispatch, so pretty good, but ~= WAMR)
  • No SIMD or other useful optimizations (mutable globals, bulk memory ops)

Open Questions

SIMD spec op availability

SIMD spec: https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md
We should confirm we have access to the core NN-friendly ops that are required. There's a proposal to add integer dotprod instructions but it looks like @bjacob commented on the spec here noting that the most useful dotprod op form is still missing: WebAssembly/simd#127 (comment)

iOS

It's extremely difficult to tell but it seems like JavaScriptCore on iOS when used by an application can JIT and load WebAssembly. Whether this requires special entitlements is unclear (oh Apple). Recent issues indicate that the global context has a WebAssembly object that works and that the iOS simulator supports it as well (https://trac.webkit.org/changeset/264801/webkit). Workarounds that involve using WebKit (WKWebView) are a no-go as they run JSC out of process, cannot share memory, and can only marshal strings across the API.

Multiple memory spaces

WASM was defined to support multiple memory spaces (linear regions of addressable memory) - think x86 segments (what's old is new again!). This is interesting to us as the actual fixed-size heap required for wasm can then be fixed to a maximum of our shared memory size (accessible from multiple concurrent invocations) and buffers can be passed in/out via other memories.

Unfortunately this isn't supported in MVP (or AFAICT any current runtime), though the multi-memory spec proposal is active and extends things to support what we'd need.

Without this we must ensure that all buffers are allocated from the single wasm memory region. This is not difficult to accomplish (via a custom iree::hal::Allocator) and since the same behavior is needed for GPUs it's possible we can share code (something like VMA, if not VMA itself). The scheduling code we emit in the compiler for allocations can help here as the same behavior we'll want for devices with discrete memory (out-of-process CPU/GPU/TPU) we'll want for WASM, so for example the ringbuffer used for most transients can be allocated directly from wasm memory.

wasm64

Though provisionally speced, 64-bit wasm is not really making traction just yet. The major limitation there is a 4GB address space (or possibly smaller depending on the runtime, which may use some bits for tracking). multi-memory would alleviate some of the pressure here as we could add multiple chunks, but at the point that we are streaming through GB of data in a single dispatch we've probably got other issues. Since SPIR-V also has 32-bit limitations I think this is fine.

@benvanik benvanik added enhancement ➕ New feature or request runtime Relating to the IREE runtime library labels Aug 13, 2020
@benvanik benvanik added this to Ideas in Runtime Development via automation Aug 13, 2020
@syrusakbary
Copy link

syrusakbary commented Aug 14, 2020

That's a good analysis @benvanik! (sorry for sneak in the issue, just found it after searching for mentions of Wasmer in Github)

On the Wasmer side we just landed a big refactor that added a lot of interesting new features for Wasmer 1.0:

  • Support for SIMD in cranelift (and LLVM)
  • Support for Multivalue returns
  • Includes offline compilation
  • Full-support the standard wasm-c-api. See examples here
  • Super fast startup times when using the Native engine ~ 20ns startup time for "heavy" Wasm modules (vs 300ms in wasmtime or 200ms in v8)
  • Support for multiple compilation strategies (JIT or AOT). Note: This is going to be crucial for targeting iOS (we will have an example working soon!).
  • Super fast compilation times
  • Full support for almost any chipset (not only x86_64 or ARM, but any other supported by LLVM).
  • Headless mode: when you precompile the modules with the AOT approach, the compilers are no longer needed and you can run or embed Wasmer in the most lightweight way possible (that is: only the runtime/vm, but no compilers attached)

We are also thinking on a C++ layer on top of the wasm-c-api to make the integration even easier.

As a plus side, Dart (also from Google) and Swift have also integrated Wasmer into their codebase :)

Note: wasmtime development might be affected by this https://twitter.com/tschneidereit/status/1293868141953667074

@benvanik
Copy link
Collaborator Author

Thanks for the notes @syrusakbary! It's often hard to distill progress and it's really helpful to have a contributor give a brain dump like that :)

It looks like https://github.com/wasmerio/wasmer/tree/master/lib/c-api is much more recent than https://github.com/wasmerio/wasmer-c-api, which is great! I didn't catch that and have updated the notes above. Not even sure how I found the other link :)

Having working iOS examples I think would be very interesting for a lot of people - most of my searching around the net yielded a lot of "can I use wasm on iOS?" and a bunch of shrugs, so having something to point out would really help get people motivated to try out wasmer.

Do you happen to know mechanically how wasmer uses LLVM? (as in, static library, shared library, shell exec on tool binaries, etc) One concern we have with a library using LLVM is that we need to track LLVM head very closely and LLVM doesn't have the most well-defined API - keeping things building in sync without ending up compiling in 2 versions of LLVM (and the pain associated/preventing such) is a worry :) I see some references to Inkwell but am not familiar with it.

To support additional architectures, is there more needed than ensuring they are supported in config? https://github.com/wasmerio/wasmer/blob/master/lib/compiler-llvm/src/config.rs#L148-L172 (not sure if there are hidden dependencies on features only available in certain configurations, etc)

Thanks again for chiming in!

@stellaraccident
Copy link
Collaborator

I started a PR to create a new codegen target for WASM on the IREE side, but didn't get far beyond getting it to build. I was going to start with just doing naive, scalar codegen to get things plumbed. Then was thinking of proceeding to figure out how to create externs for specific high level operations that we want to provide manually and possibly making a quasi-compiled VMLA like thing.

@stellaraccident
Copy link
Collaborator

Just pulling down the wasmer C-API, it looks like it properly anonymizes the symbols for its LLVM dep. The shared library is 6.4MB on x64 linux. So seems workable out of the box without needing to worry about LLVM version conflicts.

@benvanik
Copy link
Collaborator Author

Nice!

@syrusakbary
Copy link

syrusakbary commented Aug 15, 2020

To support additional architectures, is there more needed than ensuring they are supported in config?

Just adding them in config.rs along with the architecture feature in the cargo config should suffice :)

@bhack
Copy link

bhack commented Aug 17, 2020

/cc @abrown if interested

@abrown
Copy link

abrown commented Aug 17, 2020

@bhack thanks for the cc, I never would have seen this 😄. A couple of thoughts:

  • in Wasmtime, SIMD support for the old x86-64 backend is complete (spec tests pass) but there is ongoing work to port that to the new backend; from watching the aarch64 SIMD work in the new backend, I would say @jgouly and @akirilov-arm are almost done but they might have a more precise update.
  • in WAMR, @xwang98 could tell you more but I thought we discussed at some point adding SIMD support
  • along a different line of thought: you might be interested in looking at wasi-nn--it's a WASI API for accessing ML functionality from Wasm and could accept encodings like IREE. I'm investigating what it would take to implement it using OpenVINO IR and I would appreciate your thoughts on whether the API needs changes before it can accept other IR formats (e.g. IREE).

/cc @mingqiusun, @jlb6740, @rrwinterton

@bhack
Copy link

bhack commented Aug 17, 2020

I think you could be also quite interested in the status of Flexible vectors /cc @nicolasvasilache WebAssembly/flexible-vectors#7

@stellaraccident
Copy link
Collaborator

It's really fun to see the interest in things in this direction, and I'm wondering if we want to host some kind of more collaborative group discussion about ways forward?

From the IREE side, a good integration with WebAssembly is something we've had our eyes on from the beginning and (aside from some early prototypes) it was something that we were holding off on until the MLIR-based tooling was further along (purely out of a desire to get that level of the stack right before forking off in this direction). The potential benefits to deployability, portability, and security are what can bridge the gap for ML systems between compilers and runtimes, allowing us to have the best of both worlds.

In any case, this issue represents our belief that the time is right for our project to go there.

Speak up if there is any interest in a broader discussion on this.

@abrown
Copy link

abrown commented Aug 17, 2020

Speak up if there is any interest in a broader discussion on this

I think that's a great idea--send me a link and I'll be there.

@bhack
Copy link

bhack commented Aug 17, 2020

gap for ML systems between compilers and runtimes, allowing us to have the best of both worlds.

@stellaraccident Do you will also the exploration avantgrade for TFRT on this topic?

@stellaraccident
Copy link
Collaborator

@bhack Maybe there were some typos? Not quite parsing the question...

@bhack
Copy link

bhack commented Aug 17, 2020

What I meant about runtimes.. has TFRT its own plan on this topic? Or this could be explored by IREE and eventually contributed back.

@stellaraccident
Copy link
Collaborator

What I meant about runtimes.. has TFRT its own plan on this topic? Or this could be explored by IREE and eventually contributed back.

IREE is still positioned pretty well to be a delegate under TFRT to compile and execute subgraphs that conform to its limitations. It hasn't been on either project's critical path to do that work (aside from a POC integration I did last year to convince myself that it was feasible), but we try not to lose line of site to the option.

Afaik, TFRT isn't really targeting solutions at this level at present, but I generally wouldn't be surprised if systems that need portable execution and distribution end up finding WebAssembly to be a reasonable way to achieve that. Balancing that, of course, is that for HPC code, it still seems a bit early and in need of some more bake time (i.e. fixed width SIMD is not fully launched and still has gaps with respect to what is needed to achieve the best performance, and most people are looking towards scalable vectors as the next tier). Definitely interesting times...

@bhack
Copy link

bhack commented Aug 18, 2020

Yes interesting times expecially about the portability impact on in browser and out of browser runtimes: https://blog.tensorflow.org/2020/03/introducing-webassembly-backend-for-tensorflow-js.html

@benvanik
Copy link
Collaborator Author

benvanik commented Aug 18, 2020

Such a proposal is exactly what we would like to prevent from happening :)
It's (very roughly) equivalent to if someone proposed image maps today even though we have canvas and javascript: a limited scope solution to a specific problem in time that can be done in much more principled ways. We'd much rather see investment in closing the gap between native and WASM performance on the CPU by way of a handful of SIMD ops added to the WASM SIMD spec that benefits everyone than a new spec that will inherit and cement the issues of current ML runtimes.

@stellaraccident
Copy link
Collaborator

stellaraccident commented Aug 18, 2020

For our work, we'll be focusing on compilers and making appropriately low-level representations and tools performant, portable and secure. Fixed function/op-based solutions still have a place in ML for the time being, but they come with significant challenges that we are no longer willing to accept -- and we'd like the lower level tooling to grow to fill the gap. Predictions are pointless, so I won't make any estimates as to when the switch happens :) But that's the tack we're taking -- and I don't think we're talking about years of work.

@bhack
Copy link

bhack commented Aug 18, 2020

it's always nice to understand in which direction the different forces are pushing 😉

@bhack
Copy link

bhack commented Aug 21, 2020

I've mentioned this thread in two github tickets/threads related to the next W3C machine learning virtual workshop in September so you can find the reference here if you want to comment.

@benvanik
Copy link
Collaborator Author

Tentative Q1 target for this.

@benvanik benvanik added the next-gen ✨ Forward-looking work built on IREE's foundation label Nov 22, 2020
@benvanik benvanik moved this from To do to In progress in WebAssembly HAL Backend Nov 28, 2020
@benvanik benvanik moved this from In progress to To do in WebAssembly HAL Backend Nov 28, 2020
@benvanik benvanik added the hal/cpu Runtime Host/CPU-based HAL backend label Nov 28, 2020
@benvanik benvanik moved this from To do to In progress in WebAssembly HAL Backend Nov 28, 2020
@bhack
Copy link

bhack commented Feb 18, 2021

@ScottTodd
Copy link
Collaborator

I have a functional WASM HAL backend for IREE at #5096 using WAMR that can run MNIST, BERT, and our other supported models. It's very slow right now, probably due to how it naïvely allocates/copies memory. When trying to clean up that memory allocation, we ran into a blocking issue trying to map between WAMR's memory allocation APIs and how IREE models drivers/devices/executables though: #5137. A few options for moving forward are mentioned at the bottom of that issue.

@Cypher1
Copy link

Cypher1 commented Jul 16, 2021

Hi, just wondering what the status of this bug is and what your dependencies / needs are with respect to multi memory & SIMD. I have done a little digging and I think the following summarises the status of some projects mentioned in the above:

State:

  • v8: SIMD seems available (as of this month Shipping progress WebAssembly/simd#480 )
  • JavaScriptCore: Couldn't find info on it's readiness for SIMD +/ iOS
  • wasmtime: Last mention of android I can find is wasmtime release 0.19.0 (2020-07-14) which states only partial support.
  • WAMR: SIMD support seems to be available but only for x86 (WAMR_BUILD_SIMD)
  • wasm3: no known change, no known simd support or bulk memory optimisations.

Questions:
R.E.: v8

  • How large a runtime would be acceptable for your use cases? I believe there are efforts to decrease its size but would love to know what the target would need to be.

Thanks in advance, really cool project.

@xwang98
Copy link

xwang98 commented Jul 16, 2021

@Cypher1 The WAMR supports SIMD for AARCH64 now.

commit: bytecodealliance/wasm-micro-runtime@46db353

@syrusakbary
Copy link

Nice work @xwang98!

Small addition: Wasmer also fully supports SIMD since 2.0 (and multi memory, reference types, and even runs on Android!)

@benvanik
Copy link
Collaborator Author

benvanik commented Jul 16, 2021

@Cypher1 hi! I think our only major blocker now is the memory issue (#5137) - which we believe is mostly just the APIs exposed by all the engines assuming they allocate and own the memory instance for each instantiated module. Multi-memory would be nice as it would let us partition the local scratch memory from the shared bulk storage memory, but I think if the engines allowed independent memory creation and assignment we could make things work even without multi-memory. In browser land (where we'd also love to run with wasm) we'd want to be able to use a SharedArrayBuffer across multiple loaded wasm modules.

SIMD reaching maturity is exciting! Our resident SIMD+GEMM guru @bjacob strongly feels like we need a few more instructions to reach reasonable performance (WebAssembly/simd#127 (comment) + WebAssembly/relaxed-simd#9) - it looks like one dot went in but I'm not sure about the details there. If we can get past the blocking memory issue then having something working that we could measure and test with new instructions would make for easier progress on any such additions to the instruction set. The motivator is that a proper dot instruction can yield 3-4x performance improvements in GEMM and would be worth just about any effort to get implemented given how much GEMM dominates most (non-classic vision) ML models (in speech/translation/sequence-to-sequence text models GEMM is often 90%+ of the total execution time!).

We'd love to have v8 wired up as well as the other engines (wasmtime/wamr/etc). Our main issues are around build system/toolchain complexity - any dep we added to the core code would need to be something that we could make work with both cmake and bazel and build across the major platforms (mac/win/linux). This is one reason why we were investigating the embedded engines to start - no/optional JITs that have more platform-specific behavior, no alternative languages/toolchains (rust), no custom build systems (gn), etc. A good alternative that would be worth exploring to work around this is putting each engine in its own shared object that keeps it out of the main build (like a plugin/extension) and we are fairly well setup to handle that code-wise with just a few tweaks. Ideally we'd be able to use the wasm-c-api for everything when it properly supports multi-memory/independent memory allocation/importing memory/etc as in #5137 - then we could just build the engine using its own build system/toolchain and load it at runtime with no complex dependency management/toolchain/build goo.

We are still really excited about getting wasm working - both standalone (android/ios/desktop/etc) and on the web - and would be happy to refresh our prototypes with new APIs or try out proposals!

@benvanik
Copy link
Collaborator Author

benvanik commented Jul 16, 2021

Oh the other thing we need to investigate is the best approach to multithreading in the various engines. Ideally we'd be able to load a module and then call in to it concurrently from multiple threads (by assigning the wasm stack pointer to unique thread-specific locations). That lets us have N threads without needing to instantiate the same module N times. Worst case we do a full N threads*M modules load, but it'd be much better if we didn't have to. I believe at least one engine we looked into stored invocation state on the module instance preventing this from being possible without fully reloading the module. It's the equivalent to if you had to dlopen the same shared object and get a unique instance of it for every thread you wanted to call functions from, which is not great :)

@benvanik
Copy link
Collaborator Author

benvanik commented Jul 16, 2021

I realized I never sketched it out, but here's what we want to do:
image

  • dynamically load one or more wasm modules containing small kernel functions
    • load/unload of modules happen while the threads and memory are allocated
  • allocate a growable block of memory we can suballocate buffers from
  • have N threads that have a list of commands to run
    • thread count (today) is fixed, and we can reserve stacks for each from the shared memory
    • each command has a kernel to run from one of the wasm modules on one or more buffers suballocated from the shared memory
    • multiple threads may be working on the same buffer regions (either exclusively or via atomic operations/etc)

If you substitute wasm_module_t with dlopen'ed ELF and wasm_memory_t with malloc/free you have what we currently do today. I think it maps well to wasm land, so long as we can use the particular engines in this decomposed way.

Assuming that fixed-size memory is better for certain engines (JIT'ed bounds checks etc) multi-memory may let us keep the stacks in a fixed-size wasm_memory_t with the buffers in a growable one. A less optimal (but totally workable) option would be that we would have to have a wasm_instance_t per module per thread - depending on how expensive those are for each engine (hopefully not much).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ➕ New feature or request hal/cpu Runtime Host/CPU-based HAL backend next-gen ✨ Forward-looking work built on IREE's foundation runtime Relating to the IREE runtime library
Projects
No open projects
Status: No status
Development

No branches or pull requests

9 participants