Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributing CLI tools as self-contained binaries #32

Closed
boneskull opened this issue Jun 11, 2019 · 20 comments
Closed

distributing CLI tools as self-contained binaries #32

boneskull opened this issue Jun 11, 2019 · 20 comments

Comments

@boneskull
Copy link
Collaborator

@ahmadnassri brought this up at JSConf EU. would like to invite him to a meeting to chat about it, as I think it falls within our scope.

@bengl
Copy link
Member

bengl commented Jun 11, 2019

Here is some prior art from forever ago: https://github.com/bmeck/noda-loader

@joyeecheung
Copy link
Member

joyeecheung commented Jun 11, 2019

I believe at the moment to do something like this the user has to hack the entry point somehow (e.g. using _third_party_main.js in the https://github.com/bmeck/noda-loader mentioned above) and build the binary from source. I am working on a way to customize the entry point from the embdder API (nodejs/node#23265) so that it's at least covered by our cctests (WIP), that might be a reasonable first step to get rid of this hack (still requires compiling C++, but it only needs to link to the library instead of compiling everything from scratch).

One thing we need to take into account is the integration of V8 snapshot and code cache. Ideally toolchain for this should provide support for snapshot and code cache somehow so that the bundle can start up as fast as possible. Integration for user land snapshot and code cache builder may be a prerequisite for the toolchain - I can imagine the design of those APIs affecting the design of the APIs required by the toolchain. For example, the V8 snapshot API imposes many restrictions on the scripts executed before the snapshot is captured, thus implying that there should be certain phases in building the bundle when the snapshot is supported.

@sam-github
Copy link

I think that there would be willingness on the part of core to make changes to accomadate the building of an app bundler, but someone would have to step up to do the work on the bundler, and make some specific requests of core (or PRs!) that would make the bundling work easier.

Besides snapshot and code cache, I think that patching in a VFS would also be a common requirement, because at the least require (and also fs.open for bundled assets?) would have to redirect to files bundled into the executable. Or, @joyeecheung , do I misunderstand what you mean by code cache? Is it the lib/**/*.js files we compiled into node? I think its different, a V8 optimization.

Personally, I'd be 👍 to even going so far as to having standard runtime feature like node --bundle /some/app -o app.exe option, but that would take a really robust user-land implementation first (with features added incrementally to core to make building it easier).

This also might interact with WebPack, I think I've heard some people suggesting it as a way to build self-contained CLI apps, but I don't know enough about it to understand whether that is actually possible, or perhaps I've even misremembered.

@joyeecheung
Copy link
Member

joyeecheung commented Jun 11, 2019

@sam-github By code cache I meant the binary blobs passed into the V8 API along with the source code when compiling scripts. Although for support of this I think it may be more transparent than the snapshot (e.g. we can just use an internal cache map in C++ land for internalBinding('contextify').compileFunction(). The resolution from specifier to the actual source code can be then separated from this both in build time or in run time).

@refack
Copy link

refack commented Jun 11, 2019

FTR we currently run several optimization on the stdlib code.

  1. We transcode the .js file to const uint16/8_t vars[] in tools/js2c.py and embed that into the binary. This part is enabled for single file like the _third_party_main.js entry point, with
    ./conifugre ... --link-module=gaga.js.
  2. We pass that code to V8 to parse, then serialize the V8 "code cache" and embed that into the binary, so that parsing and compilation of JS code is minimal, that's in tools/code_cache.
  3. Optionally we serialize the complete state of V8 as a "snapshot" and embed that into the binary, that's in tools/snapshot.

@ahmadnassri
Copy link

happy to chat more and provide more context in next meeting 👍 thank you for opening the issue @boneskull

@ahmadnassri
Copy link

some context:

as a CLI developer, I'd like to be able to ensure my users are getting pre-compiled instances of my application, with minimal overhead (only bundle the needed internal dependencies for my app) and with controlable target distribution (i.e. pre-compiled binary, so I don't have to rely on the user's environment and compatibility)

examples in the wild that attempt to address this:

https://github.com/zeit/pkg
https://github.com/nexe/nexe
https://github.com/igorklopov/enclose

as per @joyeecheung all of these (since I last inspected them) rely on hacking the entry point, then proceed to compiled node itself as per-usual, with targetted distributions for different environments (win, darwin, linux, etc ...)

I'm not an expert on binary packaging and distribution in this context, but I'll attempt to further articulate my thinking:

the main point to make here perhaps is: while these tools are "getting the job done" ... they are actually compiling and distributing node itself, while bundling some node_modules and hacking an entry point...

vs.

Idea 1 node's official binary can somehow wrap an app's folder with knowledge of it's entry point (i.e. keeping the app development process separate from the compiling and packaging process)

then with an additional command, re-package a secondary binary with the app's logic included?

idea 2 provide build tooling and configuration that app developers can customize and optimize for best use of their CLI app ... including the ability to only include relevant APIs to the CLI ... e.g. I don't need http and crypto, turn those off, just need fs, and few other internals ...

@boneskull
Copy link
Collaborator Author

here’s a new one https://blog.cribl.io/2019/07/08/going-native/

@ledbit
Copy link

ledbit commented Jul 8, 2019

As the author of the blog post that @boneskull posted above I'll list some of our requirements and what I think should not be done as part of native bundling work:

Native bundling should be:

  1. responsible for embedding the application into the binary
  2. ensure that the bundled applications behaves exactly the same as when running node app.js (not yet possible with js2bin)
  3. able to embed/bundle precompiled .js (snapshots or anything that can make the load time faster)
  4. able to support native add-ons - would require platform specific recompilation

Native bundling should not be:

  1. responsible for packing the application or it's dependencies (use webpack, rollup etc to get a single .js file)
  2. packing any data files - ship those as part of the application's package/archive

@ledbit
Copy link

ledbit commented Jul 9, 2019

@joyeecheung, @refack - a couple of questions around snapshots:

  1. are there any restrictions in what .js can be snapshot?
  2. is it fair to assume that there are no real space savings between .js and a snapshot (but there are startup cost savings)

I've looked at tools/code_cache and gen/node_code_cache.cc and is seems possible to do something similar to js2bin along the lines of :

  1. have a static array with magic content with predefined size, say 1MB
  2. add it to code_cache
  code_cache.emplace("magic_main",
    std::make_unique<v8::ScriptCompiler::CachedData>(
      magic_main,
      static_cast<int>(arraysize(magic_main)), policy
    )
  );
  1. then allow users to be able to run something like node --embed bundledApp.js --output=MyApp, which would:
    a. compile bundledApp.js
    b. read current node and replace the magic array with the content from (a)
    c. write that out to a different file

When executing MyApp there would be logic to detect if magic_main was present and if so control be passed to it.

Thoughts/ideas?

@mhdawson
Copy link
Member

mhdawson commented Jul 9, 2019

@ledbit quick question. Are different linux variants supported (ex power or s390x?)

@ledbit
Copy link

ledbit commented Jul 10, 2019

Right now the list of supported platforms (in js2bin) is the most common plats I was able to easily get CI/CD going, but the same method should work anywhere NodeJS compiles. I'd recommend moving this convo to the js2bin repo

@boneskull
Copy link
Collaborator Author

Is addressing shortcomings in the "embeddable API" is a prerequisite to an ergonomic way to bundle self-contained binaries?

@joyeecheung I'm ignorant about the state of things on that front, so I don't understand from your comment whether exposing more/better embeddable APIs is needed to get to "code cache & snapshots", or whether "code cache & snapshots" is wholly a performance improvement, or if it would also positively impact DX?

How would "embedded Node.js", code cache & snapshots be inline with or orthogonal to @ledbit's strategy? In other words, if these were available, would @ledbit change js2bin's approach to leverage them?

@boneskull
Copy link
Collaborator Author

This landed: https://bellard.org/quickjs/

Interesting but also just a JS runtime. It apparently compiles to binary executables

@joyeecheung
Copy link
Member

joyeecheung commented Jul 11, 2019

are there any restrictions in what .js can be snapshot?

There are several restrictions, in Node.js core these are enforced with assertions though in user land the restrictions may be relaxed a bit. For example:

  • There should be no asynchronous tasks in flight, either in the microtask queue, nextTick queue, or the libuv queue. It's not impossible to restore from a pending queue in the snapshot but it would be very tricky as there are many states you need to maintain consistency for. It's easier to hack on this if you are both the distributor and the author of the app since then you'll know more about how to restore things back and not run into subtle bugs, but harder for Node.js core as it has no idea what the user is going to do with it and what to pay attention to, then it needs to pay attention to everything.
  • Command line arguments and environment variables should not be accessed when building Node core's environment-independent snapshot. For user land snapshots, it may not be entirely necessary, but it depends on whether the user application want to accept configuration from those and how difficult it is to maintain consistency with different set of configurations during build time and run time.
  • Any external references, e.g. addresses of C++ functions that you pass into V8's templates and interceptors, must be known and registered into V8 when loading the snapshot, so that it can properly re-bind them together when deserializing the JS heap from the snapshot. For Node.js core, it means either looking for them with an human eye and hard-coding them into some C++ source code that pushes these addresses into a v8::Isolate::CreateParams.external_references, or generating this code with some scripts that look for bindings in the code base. Bundlers probably have to figure out a way to automatically generate code that register these bindings before compiling the bundle.

is it fair to assume that there are no real space savings between .js and a snapshot (but there are startup cost savings)

There would only be a space overhead in exchange for startup performance. The JS engine cannot just discard the original source code because it still needs to lazily recompile things and allow user to see them for debugging purposes.

@joyeecheung
Copy link
Member

joyeecheung commented Jul 11, 2019

When executing MyApp there would be logic to detect if magic_main was present and if so control be passed to it.

You are describing how _third_party_main.js currently works. The user has to recompile Node.js itself with the original source code plus a custom _third_party_main.js placed inside lib/ under the source directory (which Node.js magically knows to pick up if it exists and use as entry point), as Node.js does not itself come with a C++ compiler to read current node and replace the magic array with the content from (a) and write that out to a different file.

To glue the user code and Node.js core together into one binary, you something that generates an executable with one binary and a bunch of other text files - for a C++ compiler on, say, Linux, that's just statically linking a library with another object compiled from some source code (which may come from something like js2c and some glue code that uses the embedder API), then write them out as ELF. If we want to offer the whole thing in our toolchain, the most plausible option that I can think of is some kind of SDK that includes a C++ compiler already.

@joyeecheung
Copy link
Member

joyeecheung commented Jul 11, 2019

..exposing more/better embeddable APIs is needed to get to "code cache & snapshots", or whether "code cache & snapshots" is wholly a performance improvement, or if it would also positively impact DX?

As mentioned in #32 (comment) above effectively you need a C++ compiler to generate an executable like this. What Node.js core could do is to make it less painful - say, instead of having to build Node.js core, you can just statically link to some prebuilt Node.js library. In stead of hacking around the entry point and monkey patching fs, you can start Node.js with an embedder API that allows you to hook into the entry point (with code knows how to find and execute user code) and a virtual file system. libnode is currently far away from being able to enable that (we can't even test it in our cctests).

It's less of a big deal to include code cache customization in such embedder APIs but it would be trickier for snapshots for the reasons mentioned in #32 (comment) - we could make these optional in the embdder API for bundling, but until we get these done it would be better to keep that API experimental because breaking changes may be often.

@ledbit
Copy link

ledbit commented Jul 14, 2019

You are describing how _third_party_main.js currently works. The user has to recompile Node.js itself with the original source code plus a custom _third_party_main.js placed inside lib/ under the source directory (which Node.js magically knows to pick up if it exists and use as entry point), as Node.js does not itself come with a C++ compiler to ...

Correct (almost). Now, from a user/developer point of view what I am trying to avoid requiring the C++ toolchain. The approach that I took in js2bin is to embed some placeholder content into node at build time which will then get replaced at app build/bundle time - besides node size, do you see any other issues with this approach?

in original comment

have a static array with magic content with predefined size, say 1MB

@ruyadorno
Copy link
Member

This idea has moved to its own initiative: Node.js SEA Team, spawned by a mixed cohort of new and long time Node.js collaborators 🎉

Make sure to follow activity in that repo in case anyone is still interested in distributing CLI tools as self-contained binaries.

@abdennour
Copy link

Since Node 20.x.x , it has been there: https://nodejs.org/api/single-executable-applications.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants