New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gax-java: Flaky BatcherImplTest.testThrottlingBlocking - Zero interactions with this mock #1193
Comments
There's been many attempts to fix it with no luck: googleapis/gax-java#1615. |
@burkedavison How often did you see this? I haven't seen it in a while, like Mike mentioned, there has been some attempts and it should be less frequent now. If this still happens pretty often, we should prioritize it again. |
So far, it's happened only once. |
Noticed a different flaky tests in
See the exact log here |
Noticed another occurrence of this flaky test. |
Another, different flaky test in
|
Noticed another occurrence of this flaky test. |
Another, different flaky test in
|
Flaked last night on Mac |
Flaked on nightly Ubuntu |
Flake: #1432 |
Java 17 on macOS-12 failed: #1614
|
Nightly build for Java 17 on macOS-12 failed. #1619
|
Nightly build for Java 11 on ubuntu-22.04 failed. #1626
|
Failed on Java8 for Windows:
|
Failed on Java8 for Windows:
|
#1957
|
#1955
|
Reproduce the issueInstall the required modules locally
IntelliJ's repeat test N times (This didn't work)I tried to run the test 100 times and it failed at 2nd run.
Maybe the test method is not meant to run twice on the same instance. Repeat 100 times with shell script (It didn't work)at sdk-platform-java repository clone directory:
It all succeeded, unfortunately. I think the sequential execution nature does not work to reproduce the issue. Run the tests concurrently (It worked)I created a project to run the test concurrently. https://github.com/suztomo/gax-batcher-impl-test-reproducer/blob/main/README.md I'm now able to reproduce the problem. Understand the testNext step: read the test code.
Next step: Where should It seems within sdk-platform-java/gax-java/gax/src/main/java/com/google/api/gax/batching/BatcherImpl.java Lines 288 to 292 in 4c74107
BatcherImpl$PushCurrentBatchRunnable 's run() method calls sendOutstanding() . PushCurrentBatchRunnable is submitted in BatcherImpl's constructor:
With debug logging, logback, and jul-to-slf4j bridge (776eb1e), I got the following logging output (a normal successful invocation):
There are three threads: main, pool-1-thread-1, pool-2-thread-1, and pool-2-thread-2. BatcherImpl$PushCurrentBatchRunnable uses pool-1-thread-1. While it tries to check the need of Simply increasing the timeout value in Mockito.verify(callContext, Mockito.timeout...) from 100 to 1000 resulted in another error:
What does that The Javadoc tells it's "for tracking batch total throttled time". sdk-platform-java/gax-java/gax/src/main/java/com/google/api/gax/batching/BatcherImpl.java Line 291 in 776eb1e
It's how much FlowController blocked. While it got 10 milliseconds lag, there's no guarantee that
When the sleeping(50 ms) before flowController.release starts earlier than Normal case:
Rare case:
To avoid such bad operation order, we can make the "sleep(50 ms)" start only after the thread of "throttle start (stopwatch start)" becomes WAITING state:
|
At the end of the test, it verifies that the batcher waited more than throttled time by capturing the argument of `callContext.withOption()`. There were two problems: # "Zero interactions with this mock" problem ``` Error: Failures: Error: BatcherImplTest.testThrottlingBlocking:899 apiCallContext.withOption( <Capturing argument>, <Capturing argument> ); Wanted 1 time: -> at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:899) But was 0 times. ``` The callContext.withOption method is called by `BatcherImpl.sendOutstanding()` method via `BatcherImpl$PushCurrentBatchRunnable` in **another thread**. https://togithub.com/googleapis/sdk-platform-java/blob/4c741077d614093d08665e9ddd83fb0e332b7881/gax-java/gax/src/main/java/com/google/api/gax/batching/BatcherImpl.java#L206 Technically, there's no guarantee that the thread calls the withOption method within a certain timeframe, especially when we run the tests concurrently in a machine with many CPU cores (the nightly test setup). That's why we occasionally saw "Zero interactions with this mock" problem. In my experiment (https://togithub.com/suztomo/gax-batcher-impl-test-reproducer/blob/main/README.md), increasing 100 ms timeout to 1000 ms just worked good to prevent false positives. The assertion is not about timing, but about BatcherImpl recording the throttled time correctly. After fixing this problem, then I saw another problem below. # "expected to be at least: 50 but was: 48" problem ``` [ERROR] com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking_gax_test06 -- Time elapsed: 5.138 s <<< FAILURE! expected to be at least: 50 but was : 48 at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking_gax_test06(BatcherImplTest.java:926) ``` At the end of the test, it verifies that the batcher waited more than throttled time by capturing the argument of `callContext.withOption()`. In ideal, usual scenarios, 1. In thread A, the batcher blocks at `batcher.add` (because of flowController) the thread and starts stopwatch to measure throttled time. 2. In thread B (created by the BatcherImplTest) sleeps 50 ms (`throttledTime`) 3. In thread B, BatcherImplTest calls `flowController.release()`, which wakes up the thread A. 4. The batcher records it was blocked more than 50 ms. However, in rare cases, these events happens in this order: 1. Thread B starts sleeping 50 ms. 2. In thread A the batcher blocks at `batcher.add` (because of flowController) and starts stopwatch to measure throttled time. 3. In thread B, after 50 ms sleep, BatcherImplTest calls `flowController.release()`, which wakes up the thread A. 5. The batcher records it was blocked **less than** 50 ms. This makes the test assertion fail with "_expected to be at least: 50 but was: 48_" message. To prevent this rare case, BatcherImplTest needs to wait before sleeping 50 ms in thread B. In this pull request, I add a do-while loop to check the thread A (the thread of `batcher.add`) is in Thread.State.WAITING state before making the sleeping of 50 ms. - [x] Make sure to open an issue as a [bug/issue](https://togithub.com/googleapis/gapic-generator-java/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [x] Ensure the tests and linter pass - [x] Code coverage does not decrease (if any source code was changed) - [x] Appropriate docs were updated (if necessary) Fixes #1193 ☕️
The below test failed and then passed on rerun.
The text was updated successfully, but these errors were encountered: