Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Flaky build failure: npm package directory copy fails with "No such file or directory" #1412

Open
alexeagle opened this issue Dec 19, 2023 · 24 comments · Fixed by #1533 or #1538
Labels
bug Something isn't working

Comments

@alexeagle
Copy link
Member

What happened?

Errors look like

ERROR: /__w/monorepo/monorepo/BUILD.bazel:87:22: Copying directory npm__find-up__5.0.0/package failed: Exec failed due to IOException: /github/home/.cache/bazel/_bazel_root/818c72ad5a3a83f42a3d6f48674a13b0/execroot/__main__/external/npm__find-up__5.0.0/package (No such file or directory)

https://bazelbuild.slack.com/archives/CEZUUKQ6P/p1702916591850299

Copying directory npm_rulesjs__uuid__9.0.0/package failed: Exec failed due to IOException: /root/.cache/bazel/_bazel_root/8937bc5b791d3db8cd10f15350ebd801/execroot/__main__/external/npm_rulesjs__uuid__9.0.0/package (No such file or directory) 

They happen on about 2% of builds as observed by one client. They use --remote_download_minimal.

Version

One client observed on rules_js 1.32.2 and bazel 6.4.0

How to reproduce

Not sure yet.

Any other information?

No response

@alexeagle alexeagle added the bug Something isn't working label Dec 19, 2023
@github-actions github-actions bot added the untriaged Requires traige label Dec 19, 2023
@DavidZbarsky-at
Copy link

I just hit this as well. Some empirical observations:

  • There were around 8 failures like this to start
  • Retrying the global build a few times succeeded 0-2 of them every time (so 8 -> 7 -> 5 -> 4 -> 4 -> 3), suggesting maybe a race condition?
  • Retrying past 3 did not seem to help
  • Disabling BES made this go away

I saw this right after an upgrade to Bazel7.0.0, but that may have just been causing a full invalidation. Although I don't recall ever seeing this this on a Bazel7 nightly from a few weeks ago even with fairly frequent expunges.

@alexeagle
Copy link
Member Author

@DavidZbarsky-at which Bazel version were you on prior to the upgrade to Bazel 7? And could you be more precise how you "disable BES" - which flags did you change?

@DavidZbarsky-at
Copy link

We were previously on 7.0.0-pre.20231011.2.

I disabled the following flags:

#build --bes_results_url=https://app.buildbuddy.io/invocation/
#build --bes_backend=grpcs://remote.buildbuddy.io

@ewhauser
Copy link

I wasn't seeing this on Bazel 6, but it is happening pretty regularly right after upgrading to Bazel 7. Here's our options:

INFO: Options provided by the client:
  Inherited 'common' options: --isatty=0 --terminal_columns=80
INFO: Reading rc options for 'run' from /__w/monorepo/monorepo/.aspect/bazelrc/correctness.bazelrc:
  Inherited 'common' options: --incompatible_disallow_empty_glob
INFO: Reading rc options for 'run' from /__w/monorepo/monorepo/.aspect/bazelrc/convenience.bazelrc:
  Inherited 'common' options: --enable_platform_specific_config --heap_dump_on_oom
INFO: Reading rc options for 'run' from /__w/monorepo/monorepo/.bazelrc:
  Inherited 'common' options: --incompatible_disallow_empty_glob=false --enable_bzlmod=false
INFO: Reading rc options for 'run' from /__w/monorepo/monorepo/.aspect/bazelrc/correctness.bazelrc:
  Inherited 'build' options: --noremote_upload_local_results --sandbox_default_allow_network=false --incompatible_strict_action_env --experimental_allow_tags_propagation --incompatible_default_to_explicit_init_py
INFO: Reading rc options for 'run' from /__w/monorepo/monorepo/.aspect/bazelrc/convenience.bazelrc:
  Inherited 'build' options: --keep_going --show_result=20
INFO: Reading rc options for 'run' from /__w/monorepo/monorepo/.aspect/bazelrc/javascript.bazelrc:
  Inherited 'build' options: --enable_runfiles
INFO: Reading rc options for 'run' from /__w/monorepo/monorepo/.aspect/bazelrc/performance.bazelrc:
  Inherited 'build' options: --noexperimental_check_output_files --incompatible_remote_results_ignore_disk --experimental_reuse_sandbox_directories --nolegacy_external_runfiles
INFO: Reading rc options for 'run' from /__w/monorepo/monorepo/.aspect/bazelrc/bazel6.bazelrc:
  Inherited 'build' options: --noexperimental_check_external_repository_files --reuse_sandbox_directories --noexperimental_action_cache_store_output_metadata
INFO: Reading rc options for 'run' from /__w/monorepo/monorepo/.bazelrc:
  Inherited 'build' options: --incompatible_strict_action_env=false --bes_upload_mode=fully_async --xcode_version_config=//:host_xcodes --noexperimental_convenience_symlinks --aspects //bazel/api_linter:defs.bzl%api_linter_aspect --output_groups=+api_linter --java_runtime_version=remotejdk_11 --java_language_version=11 --java_runtime_version=11 --nobuild_runfile_links --nolegacy_external_runfiles --experimental_repository_downloader_retries=5 --incompatible_default_to_explicit_init_py --@aspect_rules_ts//ts:skipLibCheck=always --incompatible_legacy_local_fallback
INFO: Reading rc options for 'run' from /__w/monorepo/monorepo/.aspect/bazelrc/ci.bazelrc:
  Inherited 'build' options: --announce_rc --show_timestamps --show_progress_rate_limit=60 --curses=yes --color=yes --remote_download_toplevel --remote_timeout=3600 --remote_upload_local_results --remote_local_fallback --grpc_keepalive_time=30s
INFO: Reading rc options for 'run' from /tmp/remote.bazelrc:
  Inherited 'build' options: --show_timestamps=false --remote_cache=******* --tls_certificate=******** --experimental_remote_cache_async --remote_timeout=3600

I tried 1) enabling runfiles 2) disabling sandboxing to see if either of those worked but it didn't have any impact.

@theospears
Copy link

We speculated that configuring CopyDirectory not to run remotely via the following could help:

common:buildbuddy --modify_execution_info=CopyDirectory=+no-remote-exec

However, testing shows this was not the case - we still see the same error when the action runs locally.

@ewhauser
Copy link

ewhauser commented Jan 9, 2024

I've now seeing this in a non-CI scenario:

INFO: Invocation ID: 9b83f608-95a9-4690-a06a-707f772a013b
ERROR: /workspaces/monorepo/BUILD.bazel:87:22: Copying directory npm__at_twilio-paste_skeleton-loader__0.1.3__327301399/package failed: Exec failed due to IOException: /home/vscode/.cache
/bazel/_bazel_vscode/f4ac7acc8046f57d430e720e89de5487/execroot/__main__/external/npm__at_twilio-paste_skeleton-loader__0.1.3__327301399/package (No such file or directory)
ERROR: /workspaces/monorepo/BUILD.bazel:87:22: Copying directory npm__at_twilio-paste_sibling-box__3.0.5__-301804951/package failed: Exec failed due to IOException: /home/vscode/.cache/ba
zel/_bazel_vscode/f4ac7acc8046f57d430e720e89de5487/execroot/__main__/external/npm__at_twilio-paste_sibling-box__3.0.5__-301804951/package (No such file or directory)

However, the directory exists locally:

ls -la /home/vscode/.cache/bazel/_bazel_vscode/f4ac7acc8046f57d430e720e89d
e5487/execroot/__main__/external/npm__at_twilio-paste_skeleton-loader__0.1.3__327301399
.rwxr-xr-x 624 vscode  9 Jan 17:35 BUILD.bazel
drwxr-xr-x   - vscode  9 Jan 17:35 package
.rw-r--r-- 46k vscode  9 Jan 17:35 package.tgz
.rw-r--r-- 154 vscode  9 Jan 17:35 WORKSPACE

If I comment out remote caching, then the issue goes away:

 #build --remote_cache=grpcs://secret:secret@secret.cache.host:9092

I'll note that that library it is complaining about probably does not exist on our remote cache server yet because this change hasn't been pushed through yet.

@philsc
Copy link

philsc commented Feb 14, 2024

We're also seeing this. I do find it very interesting that it's an exec failure. AFAICT it's not an error printed by the tool responsible for the action, it's bazel printing that error.

Whenever this happens, I cannot find the action in the profile produced by --profile. The failed actions appear to be missing. Which seems to me a sign that the failure is pretty early on in Bazel and not in the build rule.

I'm only talking from intuition, I have no proof for any of my speculation.

@AustinSchuh
Copy link

I was able to reproduce it 100% of the time with Bazel 7 if I do things in the right order.

~/.bazelrc has the following.

build --repository_cache=~/.cache/bazel/_bazel_austin/repository_cache
sync --repository_cache=~/.cache/bazel/_bazel_austin/repository_cache

I start by doing bazel clean --expunge

Doing a build of just the part of the repo with JS dependencies:

$ bazel build -c opt //ui/... --config=engflow --config=build_without_the_bytes --verbose_failures
Starting local Bazel server and connecting to it...
INFO: Invocation ID: 9c0ac736-ee47-461c-876d-565275794751
INFO: Streaming build results to: https://cluster.engflow.com/invocation/9c0ac736-ee47-461c-876d-565275794751
ERROR: /home/austin/local/three-shasta/BUILD.bazel:18:22: Copying directory npm__supports-hyperlinks__2.3.0/package [for tool] failed: Exec failed due to IOException: /media/austin/a7ea7e92-b2f9-4e20-8134-54e7a3d73a5c/home/austin/.cache/bazel/_bazel_austin/d42add3dc3a56cc3d91cd8759e714dfe/execroot/shasta/external/npm__supports-hyperlinks__2.3.0/package (No such file or directory)
ERROR: /home/austin/local/three-shasta/BUILD.bazel:18:22: Copying directory npm__signal-exit__3.0.7/package [for tool] failed: Exec failed due to IOException: /media/austin/a7ea7e92-b2f9-4e20-8134-54e7a3d73a5c/home/austin/.cache/bazel/_bazel_austin/d42add3dc3a56cc3d91cd8759e714dfe/execroot/shasta/external/npm__signal-exit__3.0.7/package (No such file or directory)
ERROR: /home/austin/local/three-shasta/BUILD.bazel:18:22: Copying directory npm__write-file-atomic__3.0.3/package [for tool] failed: Exec failed due to IOException: /media/austin/a7ea7e92-b2f9-4e20-8134-54e7a3d73a5c/home/austin/.cache/bazel/_bazel_austin/d42add3dc3a56cc3d91cd8759e714dfe/execroot/shasta/external/npm__write-file-atomic__3.0.3/package (No such file or directory)
INFO: Elapsed time: 97.282s, Critical Path: 1.43s
INFO: 1835 processes: 7 remote cache hit, 1797 internal, 31 local.
ERROR: Build did NOT complete successfully
INFO: Streaming build results to: https://cluster.engflow.com/invocation/9c0ac736-ee47-461c-876d-565275794751

I'm 3 for 3 now in my test of reproducing it. Any ideas what to look for or try?

If I run the exact same build command again, it works the second try.

And now that I ran the build command the second time, I can't reproduce the error... Time to dig in some more.

@snakethatlovesstaticlibs

I've also experienced this, I'm not sure what's causing it though

@hjellek
Copy link
Contributor

hjellek commented Feb 23, 2024

We are also seeing this on Bazel 7.0.2 (and on 7.1.0rc1), but for us it is now consistent on a single package (during tests) in CI:

ERROR: /home/runner/.cache/bazel/_bazel_runner/bccdf71c163d013389bc1939108453cd/external/npm_typescript/BUILD.bazel:32:17: Copying directory npm_typescript/package [for tool] failed: Exec failed due to IOException: /home/runner/.cache/bazel/_bazel_runner/bccdf71c163d013389bc1939108453cd/execroot/_main/external/npm_typescript/package (No such file or directory)

Given the name of the target I suspect that https://github.com/aspect-build/rules_ts/blob/main/ts/private/npm_repositories.bzl#L87 is involved, even though I am not sure how or why.

@gregmagolan
Copy link
Member

gregmagolan commented Mar 6, 2024

The work-around documented here has resolved this for a few users we've talked to on Slack: https://github.com/aspect-build/rules_js/blob/main/docs/faq.md#flaky-build-failure-exec-failed-due-to-ioexception

Unclear what the root cause is but looks related to "build without the bytes" with remote-able copy actions from bazel-lib.

Note: if you're using persistent runners than even with this fix landed at HEAD, your runner could still get the external repository into this bad state if a build was run on the runner on a PR with a base branch without the fix. After landing, you'll need to ask all developers to rebase PRs past the fix so all builds on the persistent runner have the flags set.

@philsc
Copy link

philsc commented Mar 6, 2024

I wanted to note that that work around did not work in our case. It was still happening with 7.0.2.

@ewhauser
Copy link

ewhauser commented Mar 7, 2024

I can also confirm that this workaround does not solve the issue

@gregmagolan
Copy link
Member

gregmagolan commented Mar 7, 2024

Unfortunately, I don't have a repo of this issue in any of our builds on rules_js CI or on our internal uses. We are on 7.0.2 internally.

@ewhauser I believe you're on persistent runners without RBE.
@philsc You're on RBE with persistent runners correct?

Are you using bzlmod or WORKSPACE?

Have you doubled checked that you don't have multiple --modify_execution_info flags set? This flag is not additive.

@philsc
Copy link

philsc commented Mar 7, 2024

Huh. I would never have guessed that --modify_execution_info is not additive. I suspect that's where my mistake was. I will check.

@gregmagolan
Copy link
Member

Should be fixed by #1533 which is included in the v1.39.1 release.

@matthewjh
Copy link

@gregmagolan, so with that fix do we still need --modify_execution_info=CopyDirectory=+no-remote?

@gregmagolan gregmagolan reopened this Mar 20, 2024
@gregmagolan
Copy link
Member

It looks like the fix just made the flake less frequent but didn't fix it. Re-opening this issue.

Our plan is now to switch the copy_directory rule that takes the source directory as its input to a new tar extract rule that takes the tgz file as an input. That should solve it once and for all. @thesayyn and @alexeagle are looking into expanding the tar rule in bazel-lib to support extraction.

@philsc
Copy link

philsc commented Mar 21, 2024

As an aside: Would it be worth filing an upstream Bazel issue for this? This feels like a bug in Bazel that you're working around here.

@binoche9
Copy link

binoche9 commented Mar 21, 2024

We've narrowed down the problem (for us at least) down to the --experimental_merged_skyframe_analysis_execution (Skymeld) flag, which was made the default in Bazel 7. Adding --noexperimental_merged_skyframe_analysis_execution made things pass for us.

I believe that we already had --experimental_merged_skyframe_analysis_execution enabled in Bazel 6.4, so I'm guessing that something in the implementation behind that flag changed from 6.4 to 7 (I'm working on this upgrade, and this issue was blocking me)

I don't have a minimal repro, but my Bazel 7 upgrade was failing consistently without --noexperimental_merged_skyframe_analysis_execution and now consistently passes with it. That being said, the failure was surfacing for us in a very specific environment so I feel there must be more to triggering this issue, but I've been unable to tell what the additional trigger might be. It might be that the environment it was running in was a docker container, but that's speculation on my part.

@gregmagolan gregmagolan changed the title [Bug]: Flaky build failure: directory copy fails [Bug]: Flaky build failure: npm package directory copy fails with "No such file or directory" Mar 23, 2024
@gregmagolan
Copy link
Member

gregmagolan commented Apr 10, 2024

Re-opening as even with the fix in #1538 there are still the corner cases that use source directories: packages with lifecycle hooks & packages with patches.

@gregmagolan gregmagolan reopened this Apr 10, 2024
@fmeum
Copy link
Contributor

fmeum commented Apr 19, 2024

CC @joeleba Based on #1412 (comment) and the error message it looks more likely that this has something to do with IncrementalPackageRoots and not so much with remote execution. Do you see a potential for a rare race related to external repos?

@gregmagolan
Copy link
Member

gregmagolan commented Apr 25, 2024

More detailed update on this issue now that #1538 has landed:

#1538 has "mostly" fixed this issue since for most packages (those that don't have lifecycle hooks or patches) the fix makes it is such that they no longer use a CopyDirectory action with a source directory input to copy into the virtual store. Instead those packages will use a tar toolchain to extract the package .tgz directly into the virtual store.

As mentioned above, packages with lifecycle hooks and those with patches will require separate fixes in the future to no longer use source directory inputs.

@gregmagolan gregmagolan removed the untriaged Requires traige label Apr 25, 2024
@SinimaWath
Copy link
Contributor

JFYI:

Updating to last version of rules_js, fixed that issue for us in 100% cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 🏗 In progress