Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output download failed: digest mismatch #12927

Closed
federicohml opened this issue Jan 29, 2021 · 3 comments
Closed

Output download failed: digest mismatch #12927

federicohml opened this issue Jan 29, 2021 · 3 comments
Assignees
Labels
more data needed P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@federicohml
Copy link

ATTENTION! Please read and follow:

  • if this is a question about how to build / test / query / deploy using Bazel, or a discussion starter, send it to bazel-discuss@googlegroups.com
  • if this is a bug or feature request, fill the form below as best as you can.

Description of the problem / feature request:

After upgrading to Bazel 4.0.0, I get the following error

com.google.devtools.build.lib.remote.BulkTransferException
        at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:227)
        at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:338)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:487)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:307)
        at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:140)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:102)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
        at com.google.devtools.build.lib.rules.cpp.CppLinkAction.beginExecution(CppLinkAction.java:306)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:127)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$5.execute(SkyframeActionExecutor.java:855)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1016)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:975)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:129)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:81)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:472)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:834)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:307)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:477)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
        Suppressed: java.io.IOException: Output download failed: Expected digest 'da191677a1b73bbf16d06669b32ec2a3b65fdcb316f833429ee8446b18b221dd/400451240' does not match received digest '07b7519d3f1492ed9f64ff8396487c340a0608dacc0b05fcffcf06149c7e19ff/401499816'.
                at com.google.devtools.build.lib.remote.util.Utils.verifyBlobContents(Utils.java:201)
                at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onCompleted(GrpcCacheClient.java:372)
                at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:476)
                at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
                at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
                at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
                at com.google.devtools.build.lib.remote.util.NetworkTime$NetworkTimeCall$1.onClose(NetworkTime.java:113)
                at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:413)
                at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
                at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:742)
                at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:721)
                at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
                at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
                ... 3 more

Feature requests: what underlying problem are you trying to solve with this feature?

Replace this line with your answer.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Replace this line with your answer.

What operating system are you running Bazel on?

Ubuntu 20.04

What's the output of bazel info release?

release 4.0.0

If bazel info release returns "development version" or "(@non-git)", tell us how you built Bazel.

Replace this line with your answer.

What's the output of git remote get-url origin ; git rev-parse master ; git rev-parse HEAD ?

Replace this line with your answer.

Have you found anything relevant by searching the web?

Replace these lines with your answer.

Places to look:

Any other information, logs, or outputs that you want to share?

The exact same command that produced the output above was tested in bazel 3.4.1 and 3.7.2, both work neatly.

@coeuvre
Copy link
Member

coeuvre commented Feb 1, 2021

Can you elaborate more details?

  • Which remote cache server does the build use?
  • Does the error occur for certain types of actions or just randomly?
  • Can you check whether the content of the file in the remote cache is correct?
  • Do you have a minimal example which can reproduce this error?

@coeuvre coeuvre self-assigned this Feb 1, 2021
@coeuvre coeuvre added more data needed P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug labels Feb 1, 2021
@scele
Copy link
Contributor

scele commented Feb 17, 2021

@coeuvre I think I have a reliable repro and also a root cause.

Repro case

1. Start a grpc cache server

To make the repro consistent, we hack the byte stream server to artificially delay sending each chunk by 1s.

#!/bin/bash
rm -rf bb-storage
git clone https://github.com/buildbarn/bb-storage.git
cd bb-storage
git checkout e2bed5b267e52baa66cf45512aa369df8ab821b5 # fixed commit

patch -p1 <<EOF
--- a/pkg/blobstore/grpcservers/byte_stream_server.go
+++ b/pkg/blobstore/grpcservers/byte_stream_server.go
@@ -3,6 +3,7 @@ package grpcservers
 import (
 	"context"
 	"io"
+	"time"
 
 	"github.com/buildbarn/bb-storage/pkg/blobstore"
 	"github.com/buildbarn/bb-storage/pkg/blobstore/buffer"
@@ -48,6 +49,8 @@ func (s *byteStreamServer) Read(in *bytestream.ReadRequest, out bytestream.ByteS
 		if readErr != nil {
 			return readErr
 		}
+		time.Sleep(1 * time.Second) // XXX
+
 		if writeErr := out.Send(&bytestream.ReadResponse{Data: readBuf}); writeErr != nil {
 			return writeErr
 		}
EOF

cat > config.jsonnet <<EOF
{
  blobstore: {
    contentAddressableStorage: {
      'local': {
        keyLocationMapInMemory: {
          entries: 16 * 1024 * 1024,
        },
        keyLocationMapMaximumGetAttempts: 8,
        keyLocationMapMaximumPutAttempts: 32,
        oldBlocks: 8,
        currentBlocks: 24,
        newBlocks: 3,
        blocksInMemory: {
          blockSizeBytes: ((8 * 1024 * 1024 * 1024) / (8 + 24 + 3)) | 0,
        },
      },
    },
    actionCache: {
      'local': {
        keyLocationMapInMemory: {
          entries: 16 * 1024 * 1024,
        },
        keyLocationMapMaximumGetAttempts: 8,
        keyLocationMapMaximumPutAttempts: 32,
        oldBlocks: 8,
        currentBlocks: 24,
        newBlocks: 3,
        blocksInMemory: {
          blockSizeBytes: ((4 * 1024 * 1024 * 1024) / (8 + 24 + 3)) | 0,
        },
      },
    },
  },
  global: { diagnosticsHttpListenAddress: ':9980' },
  grpcServers: [{
    listenAddresses: [':8980'],
    authenticationPolicy: { allow: {} },
  }],
  schedulers: {
    bar: { endpoint: { address: 'bar-scheduler:8981' } },
  },
  allowAcUpdatesForInstanceNamePrefixes: ['default'],
  maximumMessageSizeBytes: 16 * 1024 * 1024,
}
EOF

bazel run //cmd/bb_storage -- $PWD/config.jsonnet

2. Run bazel build

We force bazel to download a 256K CAS blob.
It will be downloaded in four 64K chunks, and downloading each chunk takes 1s due to the artificial delay in the server.
We use --remote_timeout=2 to inject a timeout to the process.

#!/bin/bash
set -ex

rm -rf bazel_test
mkdir bazel_test
cd bazel_test
echo > WORKSPACE
cat > BUILD <<EOF
genrule(
    name = "bigfile",
    outs = ["bigfile.img"],
    cmd = "fallocate -l 256K \$@",
)

genrule(
    name = "foo",
    srcs = [
        ":bigfile",
        "volatile.txt",
    ],
    outs = ["foo.txt"],
    cmd = "ls -lh \$(location :bigfile) > \$@",
)
EOF
touch volatile.txt

bazel version
# Make sure bigfile.img is written to remote cache.
bazel clean
bazel build :foo --remote_cache=grpc://localhost:8980 --remote_instance_name=default
# Now run the "foo" action with --remote_download_minimal so bazel needs to download bigfile.img from remote cache.
bazel clean
echo $(date) >> volatile.txt
set +e
bazel build :foo --remote_cache=grpc://localhost:8980 --remote_instance_name=default --remote_download_minimal --remote_timeout=2 --remote_retries=0 --experimental_remote_grpc_log=grpc.log
EXIT_CODE=$?
set -e
echo "Exit code: $EXIT_CODE"

cp $(bazel info server_log) ./java.log
remote_client --grpc_log grpc.log printlog > grpc.log.txt || true

Now, we see the following error reliably:

Executing genrule //:foo failed: Exec failed due to IOException: null

and in java.log:

...
Caused by: com.google.devtools.build.lib.remote.BulkTransferException
>       at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:227)
>       at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
>       at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:232)
>       at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:243)
>       at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:135)
>       ... 18 more
>       Suppressed: java.io.IOException: Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144' does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'.

Root cause analysis

This issue is present also at bazel HEAD.

I bisected bazel, and the issue started happening when grpc-java was upgraded from 1.26.0 to 1.31.1. (Bazel HEAD and 4.0 are currently using grpc-java 1.32.2.)

The issue is fixed if I upgrade grpc-java to 1.33.1 or 1.34.1.

Then I bisected grpc-java, and the issue was fixed by grpc/grpc-java@7ca6c02

In the related grpc/grpc-java#7105 there was a comment stating:

Note: a user has noticed a bug where onMessage() could be called after onClose() on client-side (b/169277773). The problem was present starting in 1.27 and fixed in 1.33, so we strongly believe it was this bug.

That means this was even more serious of a bug than previous believed. I'm very surprised more didn't break. Maybe more broke, and people just couldn't debug it. I'm going to update the release notes to mention it can produce weird callback orders.

And indeed this is what was is happening in bazel too! I added prints to src/main/java/com/google/devtools/build/lib/remote/GrpcCacheClient.java#requestRead and observed that onNext gets called after onError, which screws up offset bookkeeping. The result is that we end up downloading more bytes than what the CAS blob actually has. This is observed already from @federicohml's original bug report, which has

Suppressed: java.io.IOException: Output download failed: Expected digest
  'da191677a1b73bbf16d06669b32ec2a3b65fdcb316f833429ee8446b18b221dd/400451240' does not match received digest
  '07b7519d3f1492ed9f64ff8396487c340a0608dacc0b05fcffcf06149c7e19ff/401499816'.

Note how the received digest is larger in size than the expected digest!

scele added a commit to scele/bazel that referenced this issue Feb 18, 2021
Before this change, BulkTransferExceptions would be reported like this:

  Executing genrule //:foo failed: Exec failed due to IOException: null

By implementing getMessage(), we get a more informative message, in the case
when there is only one suppressed exception:

  Executing genrule //:foo failed: Exec failed due to IOException:
    Output download failed: Expected digest
    '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144'
    does not match received digest
    'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'.

and in the case when there are multiple suppressed exceptions:

  Executing genrule //:foo failed: Exec failed due to IOException:
    2 errors during bulk transfer

This helps diagnosing issues like bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 1: add v1.34.1 version to third_party/grpc
Note: partly switches to v1.34.1 too as not all bits are versioned and
      some of unversioned bits are used from other third_party targets

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 2: Switch to grpc 1.34.1.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 1: add v1.34.1 version to third_party/grpc
Note: partly switches to v1.34.1 too as not all bits are versioned and
      some of unversioned bits are used from other third_party targets

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 2: Switch to grpc 1.34.1.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 1: add v1.33.1 version to third_party/grpc
Note: partly switches to v1.33.1 too as not all bits are versioned and
      some of unversioned bits are used from other third_party targets

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 2: Switch to grpc 1.33.1.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 1: add v1.33.1 version to third_party/grpc
Note: partly switches to v1.33.1 too as not all bits are versioned and
      some of unversioned bits are used from other third_party targets

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 2: Switch to grpc 1.33.1.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 24, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
bazel-io pushed a commit that referenced this issue Feb 24, 2021
Before this change, BulkTransferExceptions would be reported like this:
```
  Executing genrule //:foo failed: Exec failed due to IOException: null
```
By implementing getMessage(), we get a more informative message, in the case
when there is only one suppressed exception:
```
  Executing genrule //:foo failed: Exec failed due to IOException:
    Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144'
    does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'.
```
and in the case when there are multiple suppressed exceptions:
```
  Executing genrule //:foo failed: Exec failed due to IOException:
    2 errors during bulk transfer
```
This helps diagnosing issues like #12927

Closes #13061.

PiperOrigin-RevId: 359250576
@philwo philwo mentioned this issue Feb 24, 2021
9 tasks
scele added a commit to scele/bazel that referenced this issue Feb 25, 2021
Part 1: add v1.33.1 version to third_party/grpc
Note: partly switches to v1.33.1 too as not all bits are versioned and
      some of unversioned bits are used from other third_party targets

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 25, 2021
Part 2: Switch to grpc 1.33.1.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 25, 2021
Part 1: add v1.33.1 version to third_party/grpc
Note: partly switches to v1.33.1 too as not all bits are versioned and
      some of unversioned bits are used from other third_party targets

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 25, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
bazel-io pushed a commit that referenced this issue Feb 25, 2021
Part 1: add v1.33.1 version to third_party/grpc
Note: partly switches to v1.33.1 too as not all bits are versioned and
      some of unversioned bits are used from other third_party targets

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

#12927
scele added a commit to scele/bazel that referenced this issue Feb 25, 2021
Part 2: Switch to grpc 1.33.1.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
scele added a commit to scele/bazel that referenced this issue Feb 26, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
bazel-io pushed a commit that referenced this issue Feb 26, 2021
Part 2: Switch to grpc 1.33.1.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

#12927

Closes #13105.

PiperOrigin-RevId: 359732452
scele added a commit to scele/bazel that referenced this issue Feb 26, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
bazel-io pushed a commit that referenced this issue Mar 1, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

#12927

Closes: #13106
@scele
Copy link
Contributor

scele commented Mar 1, 2021

Fixed by #13104, #13105, #13106 - hopefully this can be closed now @federicohml?

I have asked the fix to be included in 4.1 release, that's tracked in #13099 (I've also rebased the fix on top of 4.0 here: https://github.com/scele/bazel/commits/issue_12927_bazel4)

philwo pushed a commit that referenced this issue Mar 15, 2021
Part 1: add v1.33.1 version to third_party/grpc
Note: partly switches to v1.33.1 too as not all bits are versioned and
      some of unversioned bits are used from other third_party targets

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

#12927
philwo pushed a commit that referenced this issue Mar 15, 2021
Part 2: Switch to grpc 1.33.1.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

#12927
philwo pushed a commit that referenced this issue Mar 15, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

#12927

Closes: #13106
philwo pushed a commit that referenced this issue Mar 15, 2021
Before this change, BulkTransferExceptions would be reported like this:
```
  Executing genrule //:foo failed: Exec failed due to IOException: null
```
By implementing getMessage(), we get a more informative message, in the case
when there is only one suppressed exception:
```
  Executing genrule //:foo failed: Exec failed due to IOException:
    Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144'
    does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'.
```
and in the case when there are multiple suppressed exceptions:
```
  Executing genrule //:foo failed: Exec failed due to IOException:
    2 errors during bulk transfer
```
This helps diagnosing issues like #12927

Closes #13061.

PiperOrigin-RevId: 359250576
philwo pushed a commit that referenced this issue Mar 15, 2021
When using the dynamic scheduler, local actions may get interrupted or
cancelled when the remote strategy is faster (e.g., remote cache hit).
Ordinarily this isn't a problem, except when the local action is sharing
a file download future with another local action. The interrupted thread
of the local action cancels the future, and causes a CancellationExeception
when the other local action thread tries to retrieve it.

This resolves that problem by not letting threads/callers share the same
future instance. The shared download future is only cancelled if all
callers have requested cancellation.

Fixes #12927.

PiperOrigin-RevId: 362009791
philwo pushed a commit that referenced this issue Mar 15, 2021
Part 1: add v1.33.1 version to third_party/grpc
Note: partly switches to v1.33.1 too as not all bits are versioned and
      some of unversioned bits are used from other third_party targets

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

#12927
philwo pushed a commit that referenced this issue Mar 15, 2021
Part 2: Switch to grpc 1.33.1.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

#12927
philwo pushed a commit that referenced this issue Mar 15, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

#12927

Closes: #13106
philwo pushed a commit that referenced this issue Mar 15, 2021
Before this change, BulkTransferExceptions would be reported like this:
```
  Executing genrule //:foo failed: Exec failed due to IOException: null
```
By implementing getMessage(), we get a more informative message, in the case
when there is only one suppressed exception:
```
  Executing genrule //:foo failed: Exec failed due to IOException:
    Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144'
    does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'.
```
and in the case when there are multiple suppressed exceptions:
```
  Executing genrule //:foo failed: Exec failed due to IOException:
    2 errors during bulk transfer
```
This helps diagnosing issues like #12927

Closes #13061.

PiperOrigin-RevId: 359250576
philwo pushed a commit that referenced this issue Mar 15, 2021
When using the dynamic scheduler, local actions may get interrupted or
cancelled when the remote strategy is faster (e.g., remote cache hit).
Ordinarily this isn't a problem, except when the local action is sharing
a file download future with another local action. The interrupted thread
of the local action cancels the future, and causes a CancellationExeception
when the other local action thread tries to retrieve it.

This resolves that problem by not letting threads/callers share the same
future instance. The shared download future is only cancelled if all
callers have requested cancellation.

Fixes #12927.

PiperOrigin-RevId: 362009791
indragiek pushed a commit to specto-dev/bazel that referenced this issue Mar 30, 2021
Part 1: add v1.33.1 version to third_party/grpc
Note: partly switches to v1.33.1 too as not all bits are versioned and
      some of unversioned bits are used from other third_party targets

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
indragiek pushed a commit to specto-dev/bazel that referenced this issue Mar 30, 2021
Part 2: Switch to grpc 1.33.1.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927
indragiek pushed a commit to specto-dev/bazel that referenced this issue Mar 30, 2021
Part 3: remove 1.32.x from third_party/grpc.

grpc-java versions 1.27 through 1.32 had a bug where messages could arrive
after the call was reported clsoed.  In the case of bazel, this meant that
in GrpcCacheClient, onNext could be called after onError.  This leads to
offset bookkeeping getting out of sync, and corrupts the CAS blob download.

bazelbuild#12927

Closes: bazelbuild#13106
luca-digrazia pushed a commit to luca-digrazia/DatasetCommitsDiffSearch that referenced this issue Sep 4, 2022
    Before this change, BulkTransferExceptions would be reported like this:
    ```
      Executing genrule //:foo failed: Exec failed due to IOException: null
    ```
    By implementing getMessage(), we get a more informative message, in the case
    when there is only one suppressed exception:
    ```
      Executing genrule //:foo failed: Exec failed due to IOException:
        Output download failed: Expected digest '8a39d2abd3999ab73c34db2476849cddf303ce389b35826850f9a700589b4a90/262144'
        does not match received digest 'dae4d3dce1bb0a9414f61a65ee07622fa225ec01e6efe6df6e78f9ad5c58480d/327680'.
    ```
    and in the case when there are multiple suppressed exceptions:
    ```
      Executing genrule //:foo failed: Exec failed due to IOException:
        2 errors during bulk transfer
    ```
    This helps diagnosing issues like bazelbuild/bazel#12927

    Closes #13061.

    PiperOrigin-RevId: 359250576
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more data needed P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

3 participants