New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Javascript backend - run node package install once per workspace #20826
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Happy to see contributions to the JS backend.
- There's some unused fields / code that should be removed following the removal of the rule calls / package caches,
- There ought to be at least one test case that support the implementation. I suspect all tests pass because of how much heavy lifting the package managers already do. At least exposing the integrity file case as passing would be a good case to cover.
- Did you consider Install entire node project when creating node_modules digest, subset in sandboxes #19236? I'm fine with this change as is because it is a path forward, but I'm curious if you spent any thought on it.
This README is mainly for understanding the current backend functionality for PR review purposes. | ||
Plan to migrate much of this content to user docs. | ||
|
||
## Definitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only skimmed through this README, LMK if you want it thoroughly checked.
### Dependencies between packages | ||
|
||
In the example above, for an installation of 'package-a' to also install 'package-c', the following things must in place: | ||
- Both packages must be in a workspace - i.e. the directories added to a common parent package.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed true, but I think the backend is fairly close to being able to serve as a workspace manager as well. Whether that is complexity pants is ready to take on: I doubt.
|
||
No import, package.json dep: | ||
- node_modules missing source code for 1st party deps, everything else present | ||
- Probably in this case, it would be nice to generate a warning, e.g. "Package-a source code has a dependency on package-b but package-b is not present in package-a's listed dependencies in package.json". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, but I'd also expect linters to detect unused deps more effectively. There's also a plethora of plugin-style packages that hijack instead of introducing explicit exports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true - that would work.
|
||
If any of the package.json files in the workspace is modified, this causes a reinstall, as the process for 'Preparing configured default npm version.', (and also the subsequent installation process) has these files as an input. | ||
|
||
It may be possible to limit this by only including the package.json files for packages' dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not when you also install the entirety of the workspace each time.
If pants want to cache these results and also support the fields presence in package.json
, we'd need to parse and re-write relevant subsets of that file to the processes that require package.json
contents, as that is the "pants way".
src/python/pants/backend/javascript/subsystems/package-lock.json
Outdated
Show resolved
Hide resolved
Thanks for reviewing so quickly! It will probably take me a while to go back through everything.
Only briefly. I've just been trying to find a path forward mainly while trying to understand the backend at the same time. If I'm understanding that correctly, it's suggesting to, instead of running It seems to me there would be a couple of disadvantages:
|
I've reworked the package integration test to cover the case of importing another workspace package, which triggered the integrity file conflict. In doing so, I found a bug, which held me up for a while. A newer version of pnpm (>8) than the default for node 16 (6.11.0) is needed for workspaces support. I found that the version specified on the subsystem options was not being used. Although it was being installed in a separate sandbox, the output was not captured, and on the process invocation for I've left a couple of comments on the fix for now - I'm just not sure if I've missed anything re. the original intent for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the new bug you've found is real, I would argue keeping it to a separate issue and pull request is cleaner. I can look into this.
To make your integration test pass I believe removing src/js
from the path will work as a temporary.
w.r.t to the pnpm version, I think maybe best we can do is bump the default? It is unfortunate, but I'm not keen on validating tool versions for features. I believe pants should aim to please instead of throwing errors that might not be relevant for a particular usecase. Users can use the nodejs
subsystem and its package managers without writing javascript, after all.
Edit:
Honestly, I think just bumping to nodejs v18 should do the trick.
# The working directory should be the build root - if not, the output_directory will not be collected | ||
# working_directory=request.working_directory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for this being true should be investigated. There should not be a requirement for this tool to be ran using the builld root as cwd in practice, and pants would not want to enforce that a package.json
file is at this position.
I suspect the environment.corepack_home
variable is a relative path instead of an absolute one, as you noted in the comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the reason we mount the package.json and move to its containing folder as CWD
is to trigger corepacks default behaviour of reading the package.json
of where it was invoked, and then have corepack persist the result in the location of environment.corepack_home
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially corepack_home
is missing a {chroot}
token.
# TODO: Given below, is there ever any need to supply an input_digest (e.g. package.json files) here? | ||
# as far as I can see, the version is always resolved for the workspace and passed in as a CLI arg to corepack | ||
# input_digest=request.input_digest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe supporting the default style of specifying package manager versions in package.json
has real value. I think the issue is I accidentally only tested this with a package.json
that specifies a pkg manager requirement at the build root, but it shouldn't be a requirement (as hinted by the passing of a working directory).
Fixes #19236 and fixes #19235.
I've been poking around at this backend for a while trying to understand it. I think I'm most of the way there...
I wanted to raise this PR now for feedback - I think it's fairly complete as is - but I'm not sure if I have any blind spots in both understanding the original design intent and how it's being used.
There is a lot more functionality than I expected going in - huge props to @tobni for that - it would be great to fix the remaining issues and get some docs out so it can be used more & we can get more feedback.
Current process
For each first party package, the backend runs 'npm install' on its first party dependency packages. This is a recursive process.
Each
npm install
command is executed in the context of that package, generating a 'node_modules' directory containing 1st and 3rd party deps (in separate directories). Source code (fromjavascript_sources
targets) is then copied in.The output of
node_modules
from any 1st party dependency installs is merged to formnode_modules
for the package.Problems
Per #19235, newer versions of package managers generate integrity files in each node_modules directories. These cause conflicts on merge.
Secondly, and more critically, it seems (I've not dug too much into this, but some past discussion here), that each
npm install
is effectively installing all transient deps.E.g. in the case of a tree like package-A -> package-B -> package-C -> package-D ... I think while the intent is for package-A install to only install direct dependents, it's installing all transitive, so there's a lot more IO than intended. Effectively, it is merging ((package-A + trans deps) + (package-B + trans deps) + (package-c + trans deps)...))
New approach
Perform one install for the package in the workspace context.
The install command is
npm install --workspace package-a
.The install command needs to have visibility of the root package.json, which contains the 'workspaces' configuration, as well as any dependent package.json. Currently this is all package.json in the project.
The output is a single node_modules directory at the project level (assuming default package manager config - hoisting options could be configured differently).
Advantages
Simpler and is similar to how the user would use the package manager natively to install in a mono-repo setup.
The backend is simpler to troubleshoot as there is only one sandbox per install.
We delegate more of the package resolution process to the package managers.
Performance should be theoretically better given the issue above - but I've not tested. If anyone has anything handy for testing, please let me know.
Disadvantages
We are not caching each module install within Pants, which I'm guessing was the original intent. But I'm not sure we're gaining much from this approach currently.