Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[6.3.0] Use failure_rate instead of failure count for circuit breaker #18559

Merged
merged 7 commits into from
Jun 13, 2023

Conversation

amishra-u
Copy link
Contributor

@amishra-u amishra-u commented Jun 1, 2023

Continuation of #18359
I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache.
As I described here even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result.

Issue related to the failure count:

  1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold.
  2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval.
  3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache.

Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios.

Closes #18539
commit 10fb5f6

amishra-u and others added 5 commits May 30, 2023 15:47
Copy of bazelbuild#18120: I accidentally closed bazelbuild#18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes bazelbuild#18136

Closes bazelbuild#18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704
@amishra-u amishra-u changed the base branch from master to release-6.3.0 June 1, 2023 18:23
@amishra-u amishra-u marked this pull request as ready for review June 8, 2023 00:39
@iancha1992 iancha1992 requested a review from coeuvre June 8, 2023 17:27
@iancha1992 iancha1992 added awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Jun 8, 2023
@iancha1992 iancha1992 enabled auto-merge (squash) June 8, 2023 17:31
@iancha1992 iancha1992 removed the awaiting-review PR is awaiting review from an assigned reviewer label Jun 13, 2023
@iancha1992 iancha1992 merged commit e802842 into bazelbuild:release-6.3.0 Jun 13, 2023
28 checks passed
copybara-service bot pushed a commit that referenced this pull request Jul 24, 2023
Baseline:  758b44d

Release Notes:

+ Automatic code cleanup. (#18417)
+ Update CODEOWNERS for 6.3.0 (#18369)
+ Overrides specified by non-root modules no longer cause an error, and are silently ignored instead. They were originally treated as an error to allow for the future possibility of overrides in the transitive dependency graph working together; but we've deemed that infeasible (and even if it was, it'd be so complicated and confusing to users that it would not be a good addition). (#18388)
+ Add implementation deps support for Objective-C (#18372)
+ Update release notes scripts (#18400)
+ Prevent CredentialHelperEnvironment crash when invoking Bazel outside of a workspace. (#18430)
+ Use wall-time for credential helper invalidation (#18413)
+ blaze_util_posix: handle killpg failures (#18403)
+ Pass version to java_runtimes created by local_java_repository (#18415)
+ Add jsonproto option to query --output flag (#18438)
+ Don't eagerly flatten a `NestedSet` in `RepoMappingManifestAction` (#18419)
+ rules_go & rules_python are failing in Downstream CI with Bazel@HEAD (#18447)
+ Move credential helper setup into remote_helpers.sh so it can be reused by other shell tests. (#18453)
+ Wire credential helper to repository fetching. (#18429)
+ Updates/fixes to relnotes script (#18470)
+ Report percentual download progress in repository rules (#18471)
+ Support remote symlink outputs when building without the bytes. (#18476)
+ Enrich local BEP upload errors with file path and digest possible. (#18481)
+ Set `GTEST_SHARD_STATUS_FILE` in test setup (#18482)
+ Fix relnotes script (#18491)
+ Fix Xcode 14.3 compatibility (#18490)
+ Fix #18493. (#18514)
+ Extend the credential helper default timeout to 10s. (#18527)
+ Fix formatting of release notes (#18534)
+ Use extension rather than local names in ModuleExtensionMetadata (#18536)
+ [credentialhelper] Ignore all errors when writing stdin (#18540)
+ Improve error on invalid `-//foo` and `-@repo//foo` options (#18516)
+ Implement failure circuit breaker (#18541)
+ Actually check `TEST_SHARD_STATUS_FILE` has been touched (#18418)
+ Ignore hash string casing (#18414)
+ Error if repository name isn't supplied (#18425)
+ Track repo rule label attributes after the first non-existent one (#18412)
+ Add ServerCapabilities into RemoteExecutionClient (#18442)
+ RemoteExecutionService: support output_symlinks in ActionResult (#18441)
+ RemoteExecutionService: Action.Command to set output_paths (#18440)
+ Use local_termination_grace_seconds when testing LinuxSandbox availability (#18568)
+ Fix dangling string literal in `extension_metadata` docs (#18598)
+ Include actual MODULE.bazel location in stack traces (#18612)
+ Make cpp file extensions case sensitive again (#18552)
+ Fix error when script is run after the final tag is created. (#18638)
+ Fix WORKSPACE toolchain resolution with `--enable_bzlmod` (#18649)
+ Add `ActionExecutionMetadata` as a parameter to `ActionInputPrefetcher#prefetchFiles`. (#18656)
+ Use failure_rate instead of failure count for circuit breaker  (#18559)
+ Update ignored_error logic for circuit_breaker (#18662)
+ Don't rewind the build if invocation id stays the same (#18670)
+ Fix potential memory leak in UI (#18659)
+ Test that a credential helper can supply credentials for bzlmod. (#18663)
+ Add flag --experimental_collect_code_coverage_for_generated_files. (#18664)
+ Options specified on the pseudo-command `common` in `.rc` files are now ignored by commands that do not support them as long as they are valid options for *any* Bazel command. Previously, commands that did not support all options given for `common` would fail to run. These previous semantics of `common` are now available via the new `always` pseudo-command. Closes #18130. (#18609)
+ Fix split post-processing of LLVM-based coverage (#18737)
+ Allow module extension usages to be isolated (#18727)
+ BEGIN_PUBLIC (#18729)
+ Declare credential helpers to be a stable feature. (#18752)
+ Add a new provider for injecting native libs in android_binary (#18753)
+ Properly handle invalid credential files (#18779)
+ The REPO.bazel and MODULE.bazel files are now also considered workspace boundary markers. (#18787)
+ Report remote execution messages as events (#18780)
+ Fail on isolated extension usages without imports (#18793)
+ Add changes to cc_shared_library from head to 6.3 (#18606)
+ Remove option to disable FJP. (#18791)
+ Update to latest turbine version (#18803)
+ None. None (#18808)
+ Wait for outputs downloads before emitting local BEP events that reference these outputs. (#18815)
+ Perform builtins injection for WORKSPACE-loaded bzl files. (#18819)
+ Fix non-declared symlink issue for local actions when BwoB. (#18817)
+ Make grep_includes optional inside cc_common.register_linkstamp_compile_action (#18823)
+ add feature on windows toolchain with right tag (#18654)
+ coverage_common.instrumented_files_info now has a metadata_files argument (#18838)
+ Download directory output for test actions (#18846)
+ Teach DexMapper to not separate synthetic classes from their context … (#18853)
+ **[Incompatible]** query --output=proto --order_output=deps now returns targets in topological order (previously there was no ordering). (#18870)
+ Revert "Don't eagerly flatten a `NestedSet` in `RepoMappingManifestAction` (#18419)" (#18886)
+ Additional source inputs can now be specified for compilation in cc_library targets using the additional_compiler_inputs attribute, and these inputs can be used in the $(location) function. Fixes #18766. (#18882)
+ Open-source Google test `ConvenienceSymlinkTest` (#18890)
+ Update Error Prone to 2.20.0 (#18885)
+ Check if json.gz files exist, not the gcov version. (#18889)
+ Lockfile updates (#18894)
+ handle exception instead of crashing (#18895)
+ Add a new provider for passing dex related artifacts in android_binary (#18899)
+ Prevent most side effects of yanked modules (#18908)
+ Restore the classic desugar tool in the Bazel 6.3.0 branch so that the Bazel Android tools can be built for 6.3.0 without breaking backwards compatibility (#18909)
+ Update java_tools to v12.5 (#18868)
+ Add ActionCacheStatistics to BEP (#18914)
+ Adjust --top_level_targets_for_symlinks (#18916)
+ Track dev/non-dev `use_extension` calls (#18918)
+ Overrides specified by non-root modules no longer cause an error, and are silently ignored instead. They were originally treated as an error to allow for the future possibility of overrides in the transitive dependency graph working together; but we've deemed that infeasible (and even if it was, it'd be so complicated and confusing to users that it would not be a good addition). (#18921)
+ Rollforward of https://github.com/bazelbuild/bazel/commit/482d2be27ab… (#18773)
+ Update Android tools to 0.27.2 for fixes to DexMapper for https://gith... (#18891)
+ Report dev/non-dev deps imported via non-dev/dev usages (#18922)
+ Add reverted 'isolate' changes (#18928)
+ Identify isolated extensions by exported name (#18923)
+ test-setup.sh: Attempt to raise the original signal once more (#18932)
+ Ignore broken classic desugar tests (#18933)
+ Disable UseCorrectAssertInTests by default (#18948)
+ Fix VS 2022 autodetection (#18960)
+ Fix absolute file paths showing up in lockfiles (#18993)
+ Add support for isolated extension usages to the lockfile (#19008)

Acknowledgements:

This release contains contributions from many people at Google, as well as amishra-u, Andreas Herrmann, Andy Hamon, andyrinne12, Benjamin Lee, Benjamin Peterson, Brentley Jones, Chirag Ramani, Christopher Rydell, Daniel Wagner-Hall, Ed Schouten, Fabian Brandstetter, Fabian Meumertzheim, Greg, Ivan Golub, Jon Landis, JY Lin, Kai Zhang, Keith Smiley, kotlaja, lripoche, oquenchil, Pavan Singh, Rasrack, Son Luong Ngoc, Takeo Sawada, Vertexwahn, Xùdōng Yáng, Yannic.
iancha1992 pushed a commit that referenced this pull request Jul 24, 2023
Baseline:  758b44d

Release Notes:

+ Automatic code cleanup. (#18417)
+ Update CODEOWNERS for 6.3.0 (#18369)
+ Overrides specified by non-root modules no longer cause an error, and are silently ignored instead. They were originally treated as an error to allow for the future possibility of overrides in the transitive dependency graph working together; but we've deemed that infeasible (and even if it was, it'd be so complicated and confusing to users that it would not be a good addition). (#18388)
+ Add implementation deps support for Objective-C (#18372)
+ Update release notes scripts (#18400)
+ Prevent CredentialHelperEnvironment crash when invoking Bazel outside of a workspace. (#18430)
+ Use wall-time for credential helper invalidation (#18413)
+ blaze_util_posix: handle killpg failures (#18403)
+ Pass version to java_runtimes created by local_java_repository (#18415)
+ Add jsonproto option to query --output flag (#18438)
+ Don't eagerly flatten a `NestedSet` in `RepoMappingManifestAction` (#18419)
+ rules_go & rules_python are failing in Downstream CI with Bazel@HEAD (#18447)
+ Move credential helper setup into remote_helpers.sh so it can be reused by other shell tests. (#18453)
+ Wire credential helper to repository fetching. (#18429)
+ Updates/fixes to relnotes script (#18470)
+ Report percentual download progress in repository rules (#18471)
+ Support remote symlink outputs when building without the bytes. (#18476)
+ Enrich local BEP upload errors with file path and digest possible. (#18481)
+ Set `GTEST_SHARD_STATUS_FILE` in test setup (#18482)
+ Fix relnotes script (#18491)
+ Fix Xcode 14.3 compatibility (#18490)
+ Fix #18493. (#18514)
+ Extend the credential helper default timeout to 10s. (#18527)
+ Fix formatting of release notes (#18534)
+ Use extension rather than local names in ModuleExtensionMetadata (#18536)
+ [credentialhelper] Ignore all errors when writing stdin (#18540)
+ Improve error on invalid `-//foo` and `-@repo//foo` options (#18516)
+ Implement failure circuit breaker (#18541)
+ Actually check `TEST_SHARD_STATUS_FILE` has been touched (#18418)
+ Ignore hash string casing (#18414)
+ Error if repository name isn't supplied (#18425)
+ Track repo rule label attributes after the first non-existent one (#18412)
+ Add ServerCapabilities into RemoteExecutionClient (#18442)
+ RemoteExecutionService: support output_symlinks in ActionResult (#18441)
+ RemoteExecutionService: Action.Command to set output_paths (#18440)
+ Use local_termination_grace_seconds when testing LinuxSandbox availability (#18568)
+ Fix dangling string literal in `extension_metadata` docs (#18598)
+ Include actual MODULE.bazel location in stack traces (#18612)
+ Make cpp file extensions case sensitive again (#18552)
+ Fix error when script is run after the final tag is created. (#18638)
+ Fix WORKSPACE toolchain resolution with `--enable_bzlmod` (#18649)
+ Add `ActionExecutionMetadata` as a parameter to `ActionInputPrefetcher#prefetchFiles`. (#18656)
+ Use failure_rate instead of failure count for circuit breaker  (#18559)
+ Update ignored_error logic for circuit_breaker (#18662)
+ Don't rewind the build if invocation id stays the same (#18670)
+ Fix potential memory leak in UI (#18659)
+ Test that a credential helper can supply credentials for bzlmod. (#18663)
+ Add flag --experimental_collect_code_coverage_for_generated_files. (#18664)
+ Options specified on the pseudo-command `common` in `.rc` files are now ignored by commands that do not support them as long as they are valid options for *any* Bazel command. Previously, commands that did not support all options given for `common` would fail to run. These previous semantics of `common` are now available via the new `always` pseudo-command. Closes #18130. (#18609)
+ Fix split post-processing of LLVM-based coverage (#18737)
+ Allow module extension usages to be isolated (#18727)
+ BEGIN_PUBLIC (#18729)
+ Declare credential helpers to be a stable feature. (#18752)
+ Add a new provider for injecting native libs in android_binary (#18753)
+ Properly handle invalid credential files (#18779)
+ The REPO.bazel and MODULE.bazel files are now also considered workspace boundary markers. (#18787)
+ Report remote execution messages as events (#18780)
+ Fail on isolated extension usages without imports (#18793)
+ Add changes to cc_shared_library from head to 6.3 (#18606)
+ Remove option to disable FJP. (#18791)
+ Update to latest turbine version (#18803)
+ None. None (#18808)
+ Wait for outputs downloads before emitting local BEP events that reference these outputs. (#18815)
+ Perform builtins injection for WORKSPACE-loaded bzl files. (#18819)
+ Fix non-declared symlink issue for local actions when BwoB. (#18817)
+ Make grep_includes optional inside cc_common.register_linkstamp_compile_action (#18823)
+ add feature on windows toolchain with right tag (#18654)
+ coverage_common.instrumented_files_info now has a metadata_files argument (#18838)
+ Download directory output for test actions (#18846)
+ Teach DexMapper to not separate synthetic classes from their context … (#18853)
+ **[Incompatible]** query --output=proto --order_output=deps now returns targets in topological order (previously there was no ordering). (#18870)
+ Revert "Don't eagerly flatten a `NestedSet` in `RepoMappingManifestAction` (#18419)" (#18886)
+ Additional source inputs can now be specified for compilation in cc_library targets using the additional_compiler_inputs attribute, and these inputs can be used in the $(location) function. Fixes #18766. (#18882)
+ Open-source Google test `ConvenienceSymlinkTest` (#18890)
+ Update Error Prone to 2.20.0 (#18885)
+ Check if json.gz files exist, not the gcov version. (#18889)
+ Lockfile updates (#18894)
+ handle exception instead of crashing (#18895)
+ Add a new provider for passing dex related artifacts in android_binary (#18899)
+ Prevent most side effects of yanked modules (#18908)
+ Restore the classic desugar tool in the Bazel 6.3.0 branch so that the Bazel Android tools can be built for 6.3.0 without breaking backwards compatibility (#18909)
+ Update java_tools to v12.5 (#18868)
+ Add ActionCacheStatistics to BEP (#18914)
+ Adjust --top_level_targets_for_symlinks (#18916)
+ Track dev/non-dev `use_extension` calls (#18918)
+ Overrides specified by non-root modules no longer cause an error, and are silently ignored instead. They were originally treated as an error to allow for the future possibility of overrides in the transitive dependency graph working together; but we've deemed that infeasible (and even if it was, it'd be so complicated and confusing to users that it would not be a good addition). (#18921)
+ Rollforward of https://github.com/bazelbuild/bazel/commit/482d2be27ab… (#18773)
+ Update Android tools to 0.27.2 for fixes to DexMapper for https://gith... (#18891)
+ Report dev/non-dev deps imported via non-dev/dev usages (#18922)
+ Add reverted 'isolate' changes (#18928)
+ Identify isolated extensions by exported name (#18923)
+ test-setup.sh: Attempt to raise the original signal once more (#18932)
+ Ignore broken classic desugar tests (#18933)
+ Disable UseCorrectAssertInTests by default (#18948)
+ Fix VS 2022 autodetection (#18960)
+ Fix absolute file paths showing up in lockfiles (#18993)
+ Add support for isolated extension usages to the lockfile (#19008)

Acknowledgements:

This release contains contributions from many people at Google, as well as amishra-u, Andreas Herrmann, Andy Hamon, andyrinne12, Benjamin Lee, Benjamin Peterson, Brentley Jones, Chirag Ramani, Christopher Rydell, Daniel Wagner-Hall, Ed Schouten, Fabian Brandstetter, Fabian Meumertzheim, Greg, Ivan Golub, Jon Landis, JY Lin, Kai Zhang, Keith Smiley, kotlaja, lripoche, oquenchil, Pavan Singh, Rasrack, Son Luong Ngoc, Takeo Sawada, Vertexwahn, Xùdōng Yáng, Yannic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bazel run test_target doesn't convey test exit code
3 participants