Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build fails with exit code 34 rather than 39 on missing input files #21778

Closed
hoj-stripe opened this issue Mar 22, 2024 · 4 comments
Closed

Build fails with exit code 34 rather than 39 on missing input files #21778

hoj-stripe opened this issue Mar 22, 2024 · 4 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@hoj-stripe
Copy link

Description of the bug:

Setup:

  • Ten targets with one output file each (“prerequisite targets”).
  • One target that takes each of the ten prerequisite targets as input (“final target”).
  • Remote cache and remote execution.

Reproduction:

  1. Build the prerequisite targets in output base A.
  2. Delete all entries from the remote cache (AC and CAS).
  3. Build the final target in output base B.

Expected behavior:

Bazel rebuilds everything successfully. Or Bazel exits with 39 because expected cache entries cannot be found remotely during evaluation.

Actual behavior:

Logs from step 3 indicate exit code 34:

Starting local Bazel server and connecting to it...
INFO: Invocation ID: 671b6d5b-516d-4685-8489-5b6e2e51c394
INFO: Analyzed target //:all_files (5 packages loaded, 27 targets configured).
[1 / 1] checking cached actions
[1 / 12] 8 actions, 0 running
    [Sched] Executing genrule //:target_2
    [Sched] Executing genrule //:target_3
    [Sched] Executing genrule //:target_6
    [Sched] Executing genrule //:target_0
    [Sched] Executing genrule //:target_7
    [Sched] Executing genrule //:target_5
    [Sched] Executing genrule //:target_9
    [Sched] Executing genrule //:target_1
ERROR: /pay/src/cache-eviction-hang/BUILD:1:9: Executing genrule //:target_6 failed: (Exit 34): FAILED_PRECONDITION: Failed to obtain action: Object not found
ERROR: /pay/src/cache-eviction-hang/BUILD:1:9: Executing genrule //:target_9 failed: (Exit 34): FAILED_PRECONDITION: Failed to obtain action: Object not found
ERROR: /pay/src/cache-eviction-hang/BUILD:1:9: Executing genrule //:target_1 failed: (Exit 34): FAILED_PRECONDITION: Failed to obtain action: Object not found
ERROR: /pay/src/cache-eviction-hang/BUILD:1:9: Executing genrule //:target_5 failed: (Exit 34): FAILED_PRECONDITION: Failed to obtain action: Object not found
ERROR: /pay/src/cache-eviction-hang/BUILD:1:9: Executing genrule //:target_3 failed: (Exit 34): FAILED_PRECONDITION: Failed to obtain action: Object not found
ERROR: /pay/src/cache-eviction-hang/BUILD:1:9: Executing genrule //:target_0 failed: (Exit 34): FAILED_PRECONDITION: Failed to obtain action: Object not found
ERROR: /pay/src/cache-eviction-hang/BUILD:1:9: Executing genrule //:target_2 failed: (Exit 34): FAILED_PRECONDITION: Failed to obtain action: Object not found
Target //:all_files failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 3.747s, Critical Path: 0.24s
INFO: 11 processes: 11 internal.
ERROR: Build did NOT complete successfully
Bazel exited with 34

Investigation:

We have not tried going deep for this bug, this was found by accident while trying to reproduce #21777.

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Enable remote executor locally on port 8980. In our reproduction, we used https://github.com/buildbarn/bb-deployments/.
Check out https://github.com/hoj-stripe/cache-eviction-hang, run ./repro_34_exit.sh.

Which operating system are you running Bazel on?

Ubuntu 20.04.6 LTS

What is the output of bazel info release?

release 7.1.0

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

cc @clintharrison @sushain97 @coeuvre

@github-actions github-actions bot added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Mar 22, 2024
@coeuvre coeuvre added P1 I'll work on this now. (Assignee required) and removed untriaged labels Mar 26, 2024
@coeuvre coeuvre self-assigned this Mar 26, 2024
copybara-service bot pushed a commit that referenced this issue Mar 28, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for #19513.

Fixes #21777.
Potential fix for #21626 and #21778.

Closes #21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
@coeuvre
Copy link
Member

coeuvre commented Mar 28, 2024

Can you patch eda0fe4 and check whether this issue is still reproducible?

@hoj-stripe
Copy link
Author

hoj-stripe commented Mar 28, 2024

I cherry-picked eda0fe4 on https://github.com/bazelbuild/bazel/tree/release-7.2.0, and this still happens :(

output.txt

It's possible that this failure might be related to the implementation of buchgr's remote cache, but I'm still little confused on why it proceeds to upload the action digest blob after not finding the action cache, and why these uploads don't go through this check.

e.g.
2024/03/28 17:22:03 GRPC AC GET 1943efc159f068f5bf5da9f2999e9ed56f74440efd605d834bc4abd1e8220fb1 NOT FOUND
followed by
2024/03/28 17:22:03 GRPC BYTESTREAM WRITE COMPLETED: uploads/8ab0f09c-bd9d-4d33-a5a5-ba189ccd0ab3/blobs/1943efc159f068f5bf5da9f2999e9ed56f74440efd605d834bc4abd1e8220fb1/147

iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Mar 28, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for bazelbuild#19513.

Fixes bazelbuild#21777.
Potential fix for bazelbuild#21626 and bazelbuild#21778.

Closes bazelbuild#21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
iancha1992 added a commit that referenced this issue Apr 2, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for #19513.

Fixes #21777.
Potential fix for #21626 and #21778.

Closes #21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5

Commit
eda0fe4

Co-authored-by: Chi Wang <chiwang@google.com>
iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Apr 9, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for bazelbuild#19513.

Fixes bazelbuild#21777.
Potential fix for bazelbuild#21626 and bazelbuild#21778.

Closes bazelbuild#21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5
iancha1992 added a commit that referenced this issue Apr 9, 2024
and when it's missing, treat it as remote cache eviction.

Also revert the workaround for #19513.

Fixes #21777.
Potential fix for #21626 and #21778.

Closes #21825.

PiperOrigin-RevId: 619877088
Change-Id: Ib1204de8440b780e5a6ee6a563a87da08f196ca5

Commit
eda0fe4

Co-authored-by: Chi Wang <chiwang@google.com>
@coeuvre
Copy link
Member

coeuvre commented May 13, 2024

For the second invocation, --remote_cache and --remote_execution point to different endpoints. I don't know how the remote executor was setup but it seems like it doesn't read the AC/CAS from --remote_cache:

  1. The second invocation started with an empty remote cache and remote execution. So we have 2024/03/28 17:22:03 GRPC AC GET 1943efc159f068f5bf5da9f2999e9ed56f74440efd605d834bc4abd1e8220fb1 NOT FOUND.
  2. Bazel decided to execute the actions for target_* remotely at --remote_executor. Before send the execution request, it needs to upload all the inputs, including AC to --remote_cache. Hence we see 2024/03/28 17:22:03 GRPC BYTESTREAM WRITE COMPLETED: uploads/8ab0f09c-bd9d-4d33-a5a5-ba189ccd0ab3/blobs/1943efc159f068f5bf5da9f2999e9ed56f74440efd605d834bc4abd1e8220fb1/147
  3. After confirmed all the inputs had been uploaded to --remote_cache (the following GRPC CAS HEAD requests), Bazel sent the execution request to --remote_executor.
  4. --remote_executor didn't read AC/CAS from --remote_cache, instead, it used its own cache which of course doesn't include the action object.
  5. Bazel got back FAILED_PRECONDITION: Failed to obtain action: Object not found. Since it is remote execution error, Bazel exited with 34.

@hoj-stripe
Copy link
Author

Ahh yeah that makes sense. I think BuildBarn's bare remote executor assumes that its remote cache is the BuildBarn storage rather than what was passed into --remote_cache= argument, that's my bad on misconfiguration. Thank you for looking into this, and this issue looks peaceful to be closed as this isn't a bug on Bazel.

@hoj-stripe hoj-stripe closed this as not planned Won't fix, can't repro, duplicate, stale May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

5 participants