Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--experimental_remote_cache_eviction_retries doesn't evict all metadata causing build failure #18696

Open
BalestraPatrick opened this issue Jun 15, 2023 · 13 comments
Assignees
Labels
more data needed P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@BalestraPatrick
Copy link
Member

Description of the bug:

Hello!

We have used --experimental_remote_cache_eviction_retries=1 with --remote_download_toplevel during the last few weeks, and we noticed from our data that in some cases, when Bazel exits with exit code 39, the following invocation will fail with the exact same error. This is not always the case (we have seen it correctly recover in some situations), but for one specific CppArchive action, we see it failing in about 1% of our builds.

The logs look like the following:

# first invocation
ERROR: /Users/path/BUILD.bazel:122:16: Linking fl.a failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248
...
Found remote cache eviction error, retrying the build...
...
# second invocation
ERROR: /Users/path/BUILD.bazel:122:16: Linking fl.a failed: Failed to fetch blobs because they do not exist remotely.: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248

The only way to recover from this failure seems to be to run bazel clean or disable BwtB (--remote_download_toplevel).

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

In the gRPC log for the second invocation, we see the following:

---------------------------------------------------------
metadata {
  tool_details {
    tool_name: "bazel"
    tool_version: "6.1.0-7ccf98d" <- don't mind this, we build Bazel from source from our own fork and we currently pass the wrong version even though it's based off the 6.2.0 branch.
  }
  action_id: "7d45850557079c80512a065d2250586c4ac53cd91a468cc137e1ca59b3e1fbaa"
  tool_invocation_id: "007220b0-22de-401e-a407-5d6dd56225bf"
  correlated_invocations_id: "cdaeafdf-8dde-44e0-94a6-6dbc3c47c3a6"
  action_mnemonic: "CppArchive"
  target_id: "//path:lib"
  configuration_id: "702272d6c0c9a184ce867b340aaf98a381dfa7e5f316a7a64e42777e966f1345"
}
status {
  code: 1
  message: "com.google.devtools.build.lib.remote.common.CacheNotFoundException: Missing digest: 6c23fb8f89c7a10035172c6f6fcadcd37a58078ac7f141bce745ee4dee40899c/159248"
}
method_name: "google.bytestream.ByteStream/Write"
details {
  write {
  }
}
start_time {
  seconds: 1686831469
  nanos: 922000000
}
end_time {
  seconds: 1686831469
  nanos: 943000000
}

Which operating system are you running Bazel on?

macOS

What is the output of bazel info release?

6.2.0 @ 286306e

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

@brentleyjones
Copy link
Contributor

@coeuvre

@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Jun 16, 2023
@coeuvre
Copy link
Member

coeuvre commented Jun 16, 2023

That's interesting. Maybe I am wrong, but I don't think the code for cleaning Bazel's internal state relies on the specific actions, i.e. if it works for other actions, it should work for CppArchive.

For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be?

Is that possible to collect logs from server side of the access patterns for this particular blob? e.g. create time, refresh time, delete time, and etc.

@brentleyjones
Copy link
Contributor

For the missing artifact, how long between it was generated and requested to be downloaded by Bazel? And how long one invocation of this build usually be?

I don't think it matters? The first build fails with missing cache item, and it rebuilds the whole build, and the input shouldn't be assumed to exist, so the action that generates the missing item should be re-run. That's not happening. Instead the dependent action is simply run again (which is guaranteed to fail).

@coeuvre
Copy link
Member

coeuvre commented Jun 16, 2023

The generating action is expected to be re-run. Do you have execution log for the generating action from the second invoction? Otherwise, it might be the case that the file was evicted again before Bazel requested to download it.

@brentleyjones
Copy link
Contributor

brentleyjones commented Jun 16, 2023

Patrick should be able to provide that.

The build log shows only 2 actions though (status.txt and the dependent action), in the gRPC log for the second run there isn't any upload (or any calls really) for that blob/action, and it was part of the findMissingBlobs call, so I assume that means Bazel knows that it wasn't generated locally.

@coeuvre
Copy link
Member

coeuvre commented Jun 16, 2023

The build log shows only 2 actions though (status.txt and the dependent action)

This looks like a Bazel bug to me. Probably the state was not cleaned due to race conditions.

@coeuvre coeuvre self-assigned this Jun 16, 2023
@coeuvre coeuvre added P1 I'll work on this now. (Assignee required) and removed untriaged labels Jun 16, 2023
@BalestraPatrick
Copy link
Member Author

@coeuvre I sent you a gRPC log that I had from the second invocation that caused the build to fail. I hope it's useful. I can add some logging for collecting an execution log, and hope to catch it again in the next few days.

copybara-service bot pushed a commit that referenced this issue Jun 20, 2023
5 was picked randomly for --experimental_remote_cache_eviction_retries in the integration test. While it is working perfectly to test the rewinding, it cannot reports transient errors (e.g. #18696) because a next retry will probably fix that. This CL changes it to 1.

Also add a assert to check invocation ids from second attempt is different to catch #18694.

PiperOrigin-RevId: 541855893
Change-Id: I5f07cc1ed91a328454ed4949a04ff1acf6fa98b7
@brentleyjones
Copy link
Contributor

@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you.

@iancha1992
Copy link
Member

@bazelbuild/triage this is marked as a P1 but hasn't seen movement in 9 months. Can we get some eyes on it please? Thank you.

cc: @coeuvre for visibility

@coeuvre
Copy link
Member

coeuvre commented Feb 28, 2024

IIRC, the log @BalestraPatrick shared with me didn't reveal the root cause.

How frequent is the issue now in Bazel 7? Can you capture the logs again given that the code has changed a lot in Bazel 7?

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) more data needed and removed P1 I'll work on this now. (Assignee required) labels Feb 28, 2024
@acecilia
Copy link

acecilia commented Apr 26, 2024

We are also experiencing this using Bazel 7, but is pretty hard to reproduce, I did not manage to reproduce it yet in a way that I can troubleshoot it

@oliviernotteghem
Copy link

We're hitting the same issue too, and often enough that BwoB is not usable for our (local) development environment. We're using Bazel 7. Can this be prioritize?

@coeuvre
Copy link
Member

coeuvre commented May 13, 2024

It's worth trying whether this is fixed by the upcoming release 7.2 which includes eda0fe4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more data needed P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

8 participants