Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failing on remote cache problems unexpectedly #22119

Open
guw opened this issue Apr 25, 2024 · 2 comments
Open

Build failing on remote cache problems unexpectedly #22119

guw opened this issue Apr 25, 2024 · 2 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@guw
Copy link
Contributor

guw commented Apr 25, 2024

Description of the bug:

Our build is unreliable. The culprit seems to be remote cache problems (we use Google Cloud storage).

(22:37:24) ERROR: /...BUILD.bazel:3:17: Building ....jar (229 source files, 1 source jar) failed: unable to finalize action: Connection reset
...
(22:37:25) ERROR: Build did NOT complete successfully

This is unexpected because we have the following in .bazelrc:

common --remote_local_fallback

Remote cache reliability issues should not impact the Bazel build. Especially intermittent network issues should not fail a Bazel build. Those are expensive. If a cache upload or download fails the build should consider the remote cache unreliable and continue without problems.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 7.1.1

@guw guw changed the title Build failing on remote cache problems with --remote_local_fallback Build failing on remote cache problems unexpectedly Apr 25, 2024
@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Apr 25, 2024
@iancha1992
Copy link
Member

@guw Could you please provide complete steps to reproduce this issue?

@guw
Copy link
Contributor Author

guw commented Apr 26, 2024

@iancha1992 I am not sure how. This seems to rely on network issues within Google Cloud. All we do is running bazel build //... from within Google Cloud compute instance with remote cache being a GCS bucket.

We did notice a detail, it seems to be failing only when compiling unit tests (within java_test). We see multiple connection reset problems in the build logs for other compiles and most seem to be recovering.

Example:

(19:24:39) WARNING: Remote Cache: Connection reset
 com.google.devtools.build.lib.remote.common.BulkTransferException: Connection reset
 	at com.google.devtools.build.lib.remote.util.RxUtils$BulkTransferExceptionCollector.onResult(RxUtils.java:112)
 	at io.reactivex.rxjava3.internal.operators.flowable.FlowableCollectSingle$CollectSubscriber.onNext(FlowableCollectSing
...
 	Suppressed: java.net.SocketException: Connection reset
 		at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401)
 		at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434)
 		at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:256)
 		at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
 		at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)
 		at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
 		... 8 more

        but it still worked:
(19:26:29) INFO: Elapsed time: 684.726s, Critical Path: 199.22s
(19:26:29) INFO: 86024 processes: 43402 remote cache hit, 24489 internal, 16892 processwrapper-sandbox, 1241 worker.
(19:26:29) INFO: Build completed successfully, 86024 total actions

But the one that fail are usually compiling a unit test class and they also don't print a stacktrace.

@meisterT meisterT added P1 I'll work on this now. (Assignee required) and removed untriaged labels Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

6 participants