Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Javascript backend - run node package install once per workspace #20826

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

riisi
Copy link
Contributor

@riisi riisi commented Apr 21, 2024

Fixes #19236 and fixes #19235.

I've been poking around at this backend for a while trying to understand it. I think I'm most of the way there...
I wanted to raise this PR now for feedback - I think it's fairly complete as is - but I'm not sure if I have any blind spots in both understanding the original design intent and how it's being used.

There is a lot more functionality than I expected going in - huge props to @tobni for that - it would be great to fix the remaining issues and get some docs out so it can be used more & we can get more feedback.

Current process

For each first party package, the backend runs 'npm install' on its first party dependency packages. This is a recursive process.

Each npm install command is executed in the context of that package, generating a 'node_modules' directory containing 1st and 3rd party deps (in separate directories). Source code (from javascript_sources targets) is then copied in.
The output of node_modules from any 1st party dependency installs is merged to form node_modules for the package.

Problems

Per #19235, newer versions of package managers generate integrity files in each node_modules directories. These cause conflicts on merge.

Secondly, and more critically, it seems (I've not dug too much into this, but some past discussion here), that each npm install is effectively installing all transient deps.

E.g. in the case of a tree like package-A -> package-B -> package-C -> package-D ... I think while the intent is for package-A install to only install direct dependents, it's installing all transitive, so there's a lot more IO than intended. Effectively, it is merging ((package-A + trans deps) + (package-B + trans deps) + (package-c + trans deps)...))

New approach

Perform one install for the package in the workspace context.
The install command is npm install --workspace package-a.

The install command needs to have visibility of the root package.json, which contains the 'workspaces' configuration, as well as any dependent package.json. Currently this is all package.json in the project.

The output is a single node_modules directory at the project level (assuming default package manager config - hoisting options could be configured differently).

Advantages

Simpler and is similar to how the user would use the package manager natively to install in a mono-repo setup.
The backend is simpler to troubleshoot as there is only one sandbox per install.
We delegate more of the package resolution process to the package managers.
Performance should be theoretically better given the issue above - but I've not tested. If anyone has anything handy for testing, please let me know.

Disadvantages

We are not caching each module install within Pants, which I'm guessing was the original intent. But I'm not sure we're gaining much from this approach currently.

@riisi riisi added bug backend: JavaScript JavaScript backend-related issues labels Apr 21, 2024
Copy link
Contributor

@tobni tobni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Happy to see contributions to the JS backend.

  1. There's some unused fields / code that should be removed following the removal of the rule calls / package caches,
  2. There ought to be at least one test case that support the implementation. I suspect all tests pass because of how much heavy lifting the package managers already do. At least exposing the integrity file case as passing would be a good case to cover.
  3. Did you consider Install entire node project when creating node_modules digest, subset in sandboxes #19236? I'm fine with this change as is because it is a path forward, but I'm curious if you spent any thought on it.

This README is mainly for understanding the current backend functionality for PR review purposes.
Plan to migrate much of this content to user docs.

## Definitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only skimmed through this README, LMK if you want it thoroughly checked.

### Dependencies between packages

In the example above, for an installation of 'package-a' to also install 'package-c', the following things must in place:
- Both packages must be in a workspace - i.e. the directories added to a common parent package.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed true, but I think the backend is fairly close to being able to serve as a workspace manager as well. Whether that is complexity pants is ready to take on: I doubt.


No import, package.json dep:
- node_modules missing source code for 1st party deps, everything else present
- Probably in this case, it would be nice to generate a warning, e.g. "Package-a source code has a dependency on package-b but package-b is not present in package-a's listed dependencies in package.json".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but I'd also expect linters to detect unused deps more effectively. There's also a plethora of plugin-style packages that hijack instead of introducing explicit exports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true - that would work.


If any of the package.json files in the workspace is modified, this causes a reinstall, as the process for 'Preparing configured default npm version.', (and also the subsequent installation process) has these files as an input.

It may be possible to limit this by only including the package.json files for packages' dependencies.
Copy link
Contributor

@tobni tobni Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not when you also install the entirety of the workspace each time.

If pants want to cache these results and also support the fields presence in package.json, we'd need to parse and re-write relevant subsets of that file to the processes that require package.json contents, as that is the "pants way".

src/python/pants/backend/javascript/nodejs_project_test.py Outdated Show resolved Hide resolved
src/python/pants/backend/javascript/nodejs_project.py Outdated Show resolved Hide resolved
@huonw
Copy link
Contributor

huonw commented Apr 21, 2024

Fixes #19236 and #19235.

(Totally minor, but FYI GitHub's auto-closing doesn't handle "and" like this. I've adjusted it to repeat the fixes. 😄 )

@riisi
Copy link
Contributor Author

riisi commented Apr 21, 2024

Thanks for reviewing so quickly! It will probably take me a while to go back through everything.

  1. Did you consider Install entire node project when creating node_modules digest, subset in sandboxes #19236? I'm fine with this change as is because it is a path forward, but I'm curious if you spent any thought on it.

Only briefly. I've just been trying to find a path forward mainly while trying to understand the backend at the same time.

If I'm understanding that correctly, it's suggesting to, instead of running npm install --workspace package-a for each workspace, run npm install --workspaces (i.e. install workspaces for the whole project at once) and then extract the results for each workspace?

It seems to me there would be a couple of disadvantages:

  • if a user is only working with a particular workspace, the whole project needs to be installed - if there are issues affecting other parts of the project, (e.g. a. missing package, or package conflicts, as hypotheticals), I was thinking it might be better to be able to only need to install a single workspace
  • If the lockfile is updated (or other upstream cache invalidation), then all workspaces need to be re-installed which would take longer - unless there's a way around this?

@riisi
Copy link
Contributor Author

riisi commented May 12, 2024

I've reworked the package integration test to cover the case of importing another workspace package, which triggered the integrity file conflict. In doing so, I found a bug, which held me up for a while.

A newer version of pnpm (>8) than the default for node 16 (6.11.0) is needed for workspaces support. I found that the version specified on the subsystem options was not being used. Although it was being installed in a separate sandbox, the output was not captured, and on the process invocation for pnpm install, corepack was installing the default version.

I've left a couple of comments on the fix for now - I'm just not sure if I've missed anything re. the original intent for the prepare_corepack_tool rule @tobni ?

@riisi riisi added the category:bugfix Bug fixes for released features label May 12, 2024
Copy link
Contributor

@tobni tobni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the new bug you've found is real, I would argue keeping it to a separate issue and pull request is cleaner. I can look into this.

To make your integration test pass I believe removing src/js from the path will work as a temporary.

w.r.t to the pnpm version, I think maybe best we can do is bump the default? It is unfortunate, but I'm not keen on validating tool versions for features. I believe pants should aim to please instead of throwing errors that might not be relevant for a particular usecase. Users can use the nodejs subsystem and its package managers without writing javascript, after all.

Edit:
Honestly, I think just bumping to nodejs v18 should do the trick.

Comment on lines +563 to +564
# The working directory should be the build root - if not, the output_directory will not be collected
# working_directory=request.working_directory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this being true should be investigated. There should not be a requirement for this tool to be ran using the builld root as cwd in practice, and pants would not want to enforce that a package.json file is at this position.

I suspect the environment.corepack_home variable is a relative path instead of an absolute one, as you noted in the comment below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the reason we mount the package.json and move to its containing folder as CWD is to trigger corepacks default behaviour of reading the package.json of where it was invoked, and then have corepack persist the result in the location of environment.corepack_home

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially corepack_home is missing a {chroot} token.

Comment on lines +559 to +561
# TODO: Given below, is there ever any need to supply an input_digest (e.g. package.json files) here?
# as far as I can see, the version is always resolved for the workspace and passed in as a CLI arg to corepack
# input_digest=request.input_digest,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe supporting the default style of specifying package manager versions in package.json has real value. I think the issue is I accidentally only tested this with a package.json that specifies a pkg manager requirement at the build root, but it shouldn't be a requirement (as hinted by the passing of a working directory).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: JavaScript JavaScript backend-related issues bug category:bugfix Bug fixes for released features
Projects
None yet
3 participants