Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No local fallback after cache timeout #20123

Open
Tracked by #19904
miscott2 opened this issue Nov 9, 2023 · 12 comments
Open
Tracked by #19904

No local fallback after cache timeout #20123

miscott2 opened this issue Nov 9, 2023 · 12 comments
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@miscott2
Copy link

miscott2 commented Nov 9, 2023

Description of the bug:

While running a build our Artifactory HTTP cache timed out for a request. Obviously we're looking at why it did that but we expected Bazel to fall back to running the action locally and instead it failed the build
ERROR: /<workspace path>/BUILD:1802:10: Compiling <source file>.c failed: unable to finalize action: Download of '/<artifactory repo path>/cas/b8f31e5fda95495273a86cc5c7395298eb321490395ca90815a9184f2a9ec980' timed out. Received 0 bytes.

The documentation suggests --remote_local_fallback only applies to remote execution but some comments on the bug tracker suggested it might also apply to remote caching so we tried that but still saw the issue.

I can try to recreate the issue but it's not totally trivial as I'll need to setup an HTTP server that can deliberately time out. So thought I'd check if this is expected behavior or if perhaps there is a trivially obvious bug to someone who knows the Bazel source.

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Not trivial to reproduce! I can work on that if it would be useful.

Which operating system are you running Bazel on?

Linux - RHEL 8

What is the output of bazel info release?

release 7.0.0-pre.20231011.2- (@non-git)

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

Built from the 7.0.0-pre.20231011.2 release tag.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No

Any other information, logs, or outputs that you want to share?

No response

@tjgq
Copy link
Contributor

tjgq commented Nov 14, 2023

The intention behind --remote_local_fallback was indeed for it to only apply to remote execution. So this should be considered a feature request to add a similar feature for remote caching.

One possibility is to fold this work into #19904 (but that would be a fairly large project, so we might still consider implementing this differently in the interim).

@tjgq tjgq removed the type: bug label Nov 14, 2023
@tjgq tjgq mentioned this issue Nov 14, 2023
12 tasks
@tjgq tjgq removed their assignment Nov 14, 2023
@tjgq tjgq added the P3 We're not considering working on this, but happy to review a PR. (No assignee) label Nov 14, 2023
@JSGette
Copy link

JSGette commented Nov 16, 2023

We used --remote_local_fallback with --remote_cache=<url> and it works, in case of any outage on remote cache server side the build was proceeding without caching. It still works in bazel 6.3.2

@miscott2
Copy link
Author

@tjgq : @JSGette 's comment made me re-check our logs. All the examples I can find relate to actions where it's downloading .d files as part of cc_common.compile(). While our builds are about 2/3rds compile actions the number of examples is starting to look suspicious. There are also examples of timeouts with actions that aren't part of cc_common.compile() and they correctly show up as warnings and a local action is run.

Could there be something special about these .d files? I know that cc_common.compile() does a special end of action step to trim dependencies which I assume is using the .d files. Could there be something about that step which is making cache timeouts behave differently in this case?

I'll try and do more investigation on our end as well. I did setup my own HTTP cache that would timeout for a specific CAS entry corresponding to a .d file but so far I haven't reproduced the issue.

@luispadron
Copy link

I'm also seeing something similar to this in Bazel 7 without BwtB:

11:01:10 ERROR: Foo/BUILD.bazel:11:15: Compiling Foo.c failed: unable to finalize action: Missing digest: <number>/<number> for bazel-out/ios_arm64-opt-ios-arm64-min12.0-applebin_ios-ST-<sha>/bin/path/to/Foo.d

Interestingly we're only seeing this for the .d files as well

@luispadron
Copy link

@tjgq is this actually a feature request? It feels like a bug since this works just fine for us in Bazel 6

@tjgq
Copy link
Contributor

tjgq commented May 13, 2024

There are two distinct issues here.

  1. The fact that --remote_local_fallback doesn't cause a fallback to occur for local execution with a remote cache is a FR (because the flag is only supposed to have an effect for remote execution).
  2. The fact that action execution fails with unable to finalize action may be a bug. Note that the error message in the original report is a timeout, while yours is a missing digest (so it's unclear that we're looking at the same root cause).

A missing digest means that Bazel was previously made aware of the existence of a digest in the remote cache, but it's no longer there by the time it tries to download it. The "build without the bytes" default has changed between Bazel 6 and 7, which widens the window between these two events (during which the blob can be evicted, i.e., deleted from the cache). It's completely up to the remote cache to decide for how long to keep an entry around; Bazel does not set an explicit lifetime nor ask for entries to be deleted.

Does the remote cache implementation you're using provide any sort of log that could be used to determine whether the missing digest used to be there, and if so, the reason why it was evicted?

The fact that this only happens with .d files is suspicious (they are, in fact, something of an edge case in Bazel), but to be frank, without a repro I'm not really sure where I should be looking for a bug.

@luispadron
Copy link

Thanks for the reply, yeah the error are slightly different but related cause of the .d files.

FWIW we're using remote_download_outputs=all so I was expecting nothing to change here for us. Any ideas what to check next besides the remote cache logs? I can open a separate issue for this if you think that makes sense too.

@tjgq
Copy link
Contributor

tjgq commented May 13, 2024

I have a hunch: does setting --noexperimental_inmemory_dotd_files make the issue go away?

Otherwise, capturing a --experimental_remote_grpc_log (log of all of the interactions between Bazel and the remote cache) should make it possible to check whether Bazel was indeed told by the remote cache that the digest was present (and how much time elapsed until it tried to download it).

@luispadron
Copy link

Thanks for the suggestion we're testing out --noexperimental_inmemory_dotd_files now

@luispadron
Copy link

@tjgq So --noexperimental_inmemory_dotd_files does seem to work, at least we haven't hit this issue in a few iterations. Should I open a separate issue for that or is this known?

@tjgq
Copy link
Contributor

tjgq commented May 15, 2024

Thanks for confirming my suspicion; that gives me a hint as to where the problem might be. Do you mind filing a fresh issue so we can track it separately?

@luispadron
Copy link

I filed #22387 thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P3 We're not considering working on this, but happy to review a PR. (No assignee) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests

8 participants