Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gax-java: Flaky BatcherImplTest.testThrottlingBlocking - Zero interactions with this mock #1193

Closed
burkedavison opened this issue Dec 21, 2022 · 27 comments · Fixed by #2001
Closed
Assignees
Labels
priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@burkedavison
Copy link
Contributor

The below test failed and then passed on rerun.

Error:  Failures: 
Error:    BatcherImplTest.testThrottlingBlocking:897 
Wanted but not invoked:
apiCallContext.withOption(
    <Capturing argument>,
    <Capturing argument>
);
-> at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:897)
Actually, there were zero interactions with this mock.
@burkedavison burkedavison added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Dec 21, 2022
@meltsufin
Copy link
Member

There's been many attempts to fix it with no luck: googleapis/gax-java#1615.

@blakeli0
Copy link
Collaborator

@burkedavison How often did you see this? I haven't seen it in a while, like Mike mentioned, there has been some attempts and it should be less frequent now. If this still happens pretty often, we should prioritize it again.

@burkedavison
Copy link
Contributor Author

So far, it's happened only once.

@blakeli0
Copy link
Collaborator

Noticed a different flaky tests in ScheduledRetryingExecutorTest.testCancelOuterFutureBeforeStart

Error:  Tests run: 30, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 4.38 s <<< FAILURE! - in com.google.api.gax.retrying.ScheduledRetryingExecutorTest
Error:  com.google.api.gax.retrying.ScheduledRetryingExecutorTest.testCancelOuterFutureBeforeStart[Custom retry settings: false]  Time elapsed: 0.023 s  <<< FAILURE!
org.mockito.exceptions.verification.NoInteractionsWanted: 

No interactions wanted here:
-> at com.google.api.gax.retrying.AbstractRetryingExecutorTest.testCancelOuterFutureBeforeStart(AbstractRetryingExecutorTest.java:247)
But found this interaction on mock 'tracer':
-> at com.google.api.gax.retrying.FailingCallable.call(FailingCallable.java:85)
Actually, above is the only interaction with this mock.
	at com.google.api.gax.retrying.AbstractRetryingExecutorTest.testCancelOuterFutureBeforeStart(AbstractRetryingExecutorTest.java:247)

See the exact log here

@blakeli0
Copy link
Collaborator

blakeli0 commented Jan 6, 2023

Noticed another occurrence of this flaky test.

@burkedavison
Copy link
Contributor Author

burkedavison commented Jan 17, 2023

Another, different flaky test in HttpJsonDirectServerStreamingCallableTest.testServerStreaming

Error:  com.google.api.gax.httpjson.HttpJsonDirectServerStreamingCallableTest.testServerStreaming  Time elapsed: 10.068 s  <<< FAILURE!
value of: await(...)
expected to be true
	at com.google.api.gax.httpjson.HttpJsonDirectServerStreamingCallableTest.testServerStreaming(HttpJsonDirectServerStreamingCallableTest.java:211)

Build log

@burkedavison
Copy link
Contributor Author

Noticed another occurrence of this flaky test.

@burkedavison
Copy link
Contributor Author

Another flake.

@burkedavison
Copy link
Contributor Author

Another, different flaky test in AbstractRetryingExecutorTest.testCancelOuterFutureBeforeStart

Error:  Failures: 
Error:    ScheduledRetryingExecutorTest>AbstractRetryingExecutorTest.testCancelOuterFutureBeforeStart:247 
No interactions wanted here:
-> at com.google.api.gax.retrying.AbstractRetryingExecutorTest.testCancelOuterFutureBeforeStart(AbstractRetryingExecutorTest.java:247)
But found this interaction on mock 'tracer':
-> at com.google.api.gax.retrying.FailingCallable.call(FailingCallable.java:85)
Actually, above is the only interaction with this mock.

Build log

@burkedavison
Copy link
Contributor Author

Another flake.

@burkedavison
Copy link
Contributor Author

Flaked last night on Mac

@burkedavison
Copy link
Contributor Author

Flaked on nightly Ubuntu

@burkedavison
Copy link
Contributor Author

burkedavison commented Mar 1, 2023

Flake on Mac Java 11 and Java 17

@burkedavison
Copy link
Contributor Author

Flake: #1432

@blakeli0
Copy link
Collaborator

Two more flaky tests on nightly Mac tests: #1566 and #1579

@alicejli
Copy link
Contributor

alicejli commented Apr 7, 2023

Java 17 on macOS-12 failed: #1614

Error:  Tests run: 25, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 39.922 s <<< FAILURE! - in com.google.api.gax.batching.BatcherImplTest
Error:  com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking  Time elapsed: 0.357 s  <<< FAILURE!
adding elements to batcher should be blocked by FlowControlled
	at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:885)

@alicejli
Copy link
Contributor

Nightly build for Java 11 on macOS-12 failed. #1620

Error:  Failures: 
Error:    BatcherImplTest.testThrottlingBlocking:885 adding elements to batcher should be blocked by FlowControlled
[INFO] 
Error:  Tests run: 530, Failures: 1, Errors: 0, Skipped: 0

#1620

@alicejli
Copy link
Contributor

Nightly build for Java 17 on macOS-12 failed. #1619

Error:  Failures: 
Error:    BatcherImplTest.testThrottlingBlocking:898 
Wanted but not invoked:
apiCallContext.withOption(
    <Capturing argument>,
    <Capturing argument>
);
-> at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:898)
Actually, there were zero interactions with this mock.

#1619

@alicejli
Copy link
Contributor

Nightly build for Java 11 on ubuntu-22.04 failed. #1626

Error:  com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking  Time elapsed: 0.094 s  <<< FAILURE!
Wanted but not invoked:
apiCallContext.withOption(
    <Capturing argument>,
    <Capturing argument>
);
-> at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:898)
Actually, there were zero interactions with this mock.

	at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:898)

#1626

@alicejli
Copy link
Contributor

Failed on Java8 for Windows:

Error:  Failures: 
Error:    BatcherImplTest.testThrottlingBlocking:899 
apiCallContext.withOption(
    <Capturing argument>,
    <Capturing argument>
);
Wanted 1 time:
-> at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:899)
But was 0 times.

#1838

@alicejli
Copy link
Contributor

alicejli commented Sep 6, 2023

Failed on Java8 for Windows:

Error:  Failures: 
Error:    BatcherImplTest.testThrottlingBlocking:899 
apiCallContext.withOption(
    <Capturing argument>,
    <Capturing argument>
);
Wanted 1 time:
-> at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:899)
But was 0 times.

[INFO] 
Error:  Tests run: 552, Failures: 1, Errors: 0, Skipped: 0

@alicejli
Copy link
Contributor

alicejli commented Sep 6, 2023

#1957
Failed on Java 11 for Windows:

Error:  Failures: 
Error:    BatcherImplTest.testThrottlingBlocking:899 
Wanted but not invoked:
apiCallContext.withOption(
    <Capturing argument>,
    <Capturing argument>
);
-> at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:899)
Actually, there were zero interactions with this mock.

@alicejli
Copy link
Contributor

alicejli commented Sep 6, 2023

#1955
Fialed on Java 8 for Windows:

Error:  Failures: 
Error:    BatcherImplTest.testThrottlingBlocking:899 
apiCallContext.withOption(
    <Capturing argument>,
    <Capturing argument>
);
Wanted 1 time:
-> at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:899)
But was 0 times.

@lqiu96 lqiu96 changed the title gax-java: Flaky BatcherImplTest.testThrottlingBlocking gax-java: Flaky BatcherImplTest.testThrottlingBlocking - Zero interactions with this mock Sep 7, 2023
@suztomo
Copy link
Member

suztomo commented Sep 13, 2023

Reproduce the issue

Install the required modules locally

~/sdk-platform-java $ mvn -am -pl gax-java/gax install

IntelliJ's repeat test N times (This didn't work)

I tried to run the test 100 times and it failed at 2nd run.

Screenshot 2023-09-12 at 11 14 52 PM
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@497486b3[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@736caf7a[Wrapped task = com.google.api.gax.batching.BatcherImpl$PushCurrentBatchRunnable@4ce7fffa]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@63f8276e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 74]

	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.scheduleWithFixedDelay(ScheduledThreadPoolExecutor.java:680)
	at java.base/java.util.concurrent.Executors$DelegatedScheduledExecutorService.scheduleWithFixedDelay(Executors.java:788)
	at com.google.api.gax.batching.BatcherImpl.<init>(BatcherImpl.java:206)
	at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:855)

Maybe the test method is not meant to run twice on the same instance.

Repeat 100 times with shell script (It didn't work)

at sdk-platform-java repository clone directory:

$ repeat 100 mvn -pl gax-java/gax test -Dtest="BatcherImplTest#testThrottlingBlocking" -Dfmt.skip

It all succeeded, unfortunately. I think the sequential execution nature does not work to reproduce the issue.

Run the tests concurrently (It worked)

I created a project to run the test concurrently.

https://github.com/suztomo/gax-batcher-impl-test-reproducer/blob/main/README.md

I'm now able to reproduce the problem.

Understand the test

Next step: read the test code.

  • Batcher: "Represents a batching context where individual elements will be accumulated and flushed in a large batch request at some point in the future. The buffered elements can be flushed manually or when triggered by an internal threshold. This is intended to be used for high throughput scenarios at the cost of latency. Batcher instances are not thread safe. To use across different threads, create a new Batcher instance per thread."
    new BatcherImpl<>(
              SQUARER_BATCHING_DESC_V2,
              callLabeledIntSquarer,
              labeledIntList,
              settings,
              EXECUTOR,
              flowController,
              callContext)
    
    
  • FlowController: "Provides flow control capability."
      FlowController flowController =
          new FlowController(
              FlowControlSettings.newBuilder()
                  .setLimitExceededBehavior(LimitExceededBehavior.Block)
                  .setMaxOutstandingElementCount(1L)
                  .build());
    

image

Next step: Where should callContext.withOption(key, value) be supposed to be called?

It seems within BatcherImpl.sendOutstanding()

ApiCallContext callContextWithOption = null;
if (callContext != null) {
callContextWithOption =
callContext.withOption(THROTTLED_TIME_KEY, accumulatedBatch.totalThrottledTimeMs);
}
. BatcherImpl$PushCurrentBatchRunnable's run() method calls sendOutstanding(). PushCurrentBatchRunnable is submitted in BatcherImpl's constructor:

      scheduledFuture =
          executor.scheduleWithFixedDelay(runnable, delay, delay, TimeUnit.MILLISECONDS);

With debug logging, logback, and jul-to-slf4j bridge (776eb1e), I got the following logging output (a normal successful invocation):

09:51:52.154 [main] INFO  c.g.api.gax.batching.BatcherImplTest -- Starting testThrottlingBlocking
09:51:53.089 [pool-2-thread-1] DEBUG c.g.api.gax.batching.BatcherImplTest -- calling batcher.add(1)
09:51:53.113 [main] DEBUG c.g.api.gax.batching.BatcherImplTest -- Caught TimeoutException as expected
09:51:53.113 [main] DEBUG c.g.api.gax.batching.BatcherImplTest -- calling future.get(3, TimeUnit.SECONDS)
09:51:53.152 [pool-2-thread-2] DEBUG c.g.api.gax.batching.BatcherImplTest -- calling flowController.release()
09:51:53.153 [pool-2-thread-2] DEBUG c.g.api.gax.batching.BatcherImplTest -- finished flowController.release()
09:51:53.175 [pool-2-thread-1] DEBUG c.g.api.gax.batching.BatcherImplTest -- batcher.add(1) finished
09:51:53.176 [main] DEBUG c.g.api.gax.batching.BatcherImplTest -- finished future.get(3, TimeUnit.SECONDS)
09:51:53.186 [pool-1-thread-1] DEBUG c.g.api.gax.batching.BatcherImpl -- callContext.withOption called
09:51:53.193 [main] DEBUG c.g.api.gax.batching.BatcherImplTest -- Mockito.verify(callContext, ...) succeeded
09:51:53.215 [main] INFO  c.g.api.gax.batching.BatcherImplTest -- Finishing testThrottlingBlocking

There are three threads: main, pool-1-thread-1, pool-2-thread-1, and pool-2-thread-2.

BatcherImpl$PushCurrentBatchRunnable uses pool-1-thread-1. While it tries to check the need of callContext.withOption every 1 millisecond, there's no guarantee that the event on the pool-1-thread-1 thread runs before main thread's Mockito.verify(callContext, Mockito.timeout.. operation.

Simply increasing the timeout value in Mockito.verify(callContext, Mockito.timeout...) from 100 to 1000 resulted in another error:

[ERROR] Failures: 
[ERROR]   BatcherImplTest.testThrottlingBlocking:903 expected to be at least: 50
but was                : 48

What does that callContext.withOptions(total_throttled_time, 48) mean?

The Javadoc tells it's "for tracking batch total throttled time".

callContext.withOption(THROTTLED_TIME_KEY, accumulatedBatch.totalThrottledTimeMs);

It's how much FlowController blocked.

Screenshot 2023-09-15 at 10 33 33 AM

While it got 10 milliseconds lag, there's no guarantee that batcher.add is called before Thread.sleep(throttledTime).

image

02:49:56.358 [gax-test06][main] INFO  c.g.api.gax.batching.BatcherImplTest -- Starting testThrottlingBlocking
02:49:59.283 [gax-test06][pool-3-thread-1] DEBUG c.g.api.gax.batching.BatcherImplTest -- calling batcher.add(1)
02:49:59.287 [gax-test06][pool-3-thread-2] DEBUG c.g.api.gax.batching.BatcherImplTest -- start sleeping 50 ms
02:49:59.293 [gax-test06][pool-3-thread-1] DEBUG c.g.api.gax.batching.BatcherImpl -- stopwatch started
02:49:59.297 [gax-test06][main] DEBUG c.g.api.gax.batching.BatcherImplTest -- Caught TimeoutException as expected
02:49:59.297 [gax-test06][main] DEBUG c.g.api.gax.batching.BatcherImplTest -- calling future.get(3, TimeUnit.SECONDS)
02:49:59.337 [gax-test06][pool-3-thread-2] DEBUG c.g.api.gax.batching.BatcherImplTest -- Sleep finished. Calling flowController.release()
02:49:59.338 [gax-test06][pool-3-thread-2] DEBUG c.g.api.gax.batching.BatcherImplTest -- finished flowController.release()
02:49:59.340 [gax-test06][pool-3-thread-1] DEBUG c.g.api.gax.batching.BatcherImpl -- stopwatch.elapsed (throttledTimeMs): 48
02:49:59.355 [gax-test06][pool-3-thread-1] DEBUG c.g.api.gax.batching.BatcherImpl -- add() called with throttledTimeMs= 48 (now totalThrottledTimeMs: 48)
02:49:59.364 [gax-test06][pool-2-thread-1] DEBUG c.g.api.gax.batching.BatcherImpl -- callContext.withOption called
02:49:59.371 [gax-test06][pool-3-thread-1] DEBUG c.g.api.gax.batching.BatcherImplTest -- batcher.add(1) finished
02:49:59.376 [gax-test06][main] DEBUG c.g.api.gax.batching.BatcherImplTest -- finished future.get(3, TimeUnit.SECONDS)
02:49:59.391 [gax-test06][main] DEBUG c.g.api.gax.batching.BatcherImplTest -- Mockito.verify(callContext, ...) succeeded
[ERROR] com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking_gax_test06 -- Time elapsed: 5.138 s <<< FAILURE!
expected to be at least: 50
but was                : 48
	at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking_gax_test06(BatcherImplTest.java:926)

When the sleeping(50 ms) before flowController.release starts earlier than
BatcherImpl's stopwatch start time, the recorded throttledTime may
be shorter than throttledTime.

Normal case:

  1. throttle start (stopwatch start)
  2. sleep(50 ms) starts
  3. sleep(50 ms) finishes and calls flowController.release
  4. stopwatch finishes, recording more than or equal to 50 ms.

Rare case:

  1. sleep(50 ms) starts
  2. throttle start (stopwatch start)
  3. sleep(50 ms) finishes and calls flowController.release
  4. stopwatch finishes, recording less than 50 ms. assertThat(value.getValue()).isAtLeast(throttledTime) fails.

To avoid such bad operation order, we can make the "sleep(50 ms)" start only after the thread of "throttle start (stopwatch start)" becomes WAITING state:

      List<Thread> batcherAddThreadHolder = Collections.synchronizedList(new ArrayList<>());
      Future future =
          executor.submit(
              new Runnable() {
                Thread batcherAddThread;
                @Override
                public void run() {
                  batcherAddThreadHolder.add(Thread.currentThread());
                  logger.fine("calling batcher.add(1)");
                  batcher.add(1);
                  logger.fine("batcher.add(1) finished");
                }
              });
      do {
        // Wait until batcher.add blocks and starts the stopwatch.
        Thread.sleep(10);
      } while (batcherAddThreadHolder.isEmpty() || batcherAddThreadHolder.get(0).getState() != Thread.State.WAITING);

@suztomo suztomo self-assigned this Sep 13, 2023
suztomo added a commit to suztomo/gapic-generator-java that referenced this issue Sep 18, 2023
gcf-merge-on-green bot pushed a commit that referenced this issue Sep 18, 2023
At the end of the test, it verifies that the batcher waited more than throttled time by capturing the argument of `callContext.withOption()`. There were two problems:

# "Zero interactions with this mock" problem

```
Error:  Failures: 
Error:    BatcherImplTest.testThrottlingBlocking:899 
apiCallContext.withOption(
    <Capturing argument>,
    <Capturing argument>
);
Wanted 1 time:
-> at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking(BatcherImplTest.java:899)
But was 0 times.
```

The callContext.withOption method is called by `BatcherImpl.sendOutstanding()` method via
`BatcherImpl$PushCurrentBatchRunnable` in **another thread**. https://togithub.com/googleapis/sdk-platform-java/blob/4c741077d614093d08665e9ddd83fb0e332b7881/gax-java/gax/src/main/java/com/google/api/gax/batching/BatcherImpl.java#L206

Technically, there's no guarantee that the thread calls the withOption method within a certain timeframe, especially when we run the tests concurrently in a machine with many CPU cores (the nightly test setup). That's why we occasionally saw "Zero interactions with this mock" problem. 

In my experiment (https://togithub.com/suztomo/gax-batcher-impl-test-reproducer/blob/main/README.md), increasing 100 ms timeout to 1000 ms just worked good to prevent false positives. The assertion is not about timing, but about BatcherImpl recording the throttled time correctly.

After fixing this problem, then I saw another problem below.

# "expected to be at least: 50 but was: 48" problem

```
[ERROR] com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking_gax_test06 -- Time elapsed: 5.138 s <<< FAILURE!
expected to be at least: 50
but was                : 48
	at com.google.api.gax.batching.BatcherImplTest.testThrottlingBlocking_gax_test06(BatcherImplTest.java:926)
```

At the end of the test, it verifies that the batcher waited more than throttled time by capturing the argument of `callContext.withOption()`. In ideal, usual scenarios,

1. In thread A, the batcher blocks at `batcher.add` (because of flowController) the thread and starts stopwatch to measure throttled time.
2. In thread B (created by the BatcherImplTest) sleeps 50 ms (`throttledTime`)
3. In thread B, BatcherImplTest calls `flowController.release()`, which wakes up the thread A.
4. The batcher records it was blocked more than 50 ms.

However, in rare cases, these events happens in this order:

1. Thread B starts sleeping 50 ms.
2. In thread A the batcher blocks at `batcher.add` (because of flowController) and starts stopwatch to measure throttled time.
3. In thread B, after 50 ms sleep, BatcherImplTest calls `flowController.release()`, which wakes up the thread A.
5. The batcher records it was blocked **less than** 50 ms. This makes the test assertion fail with "_expected to be at least: 50 but was: 48_" message.

To prevent this rare case, BatcherImplTest needs to wait before sleeping 50 ms in thread B. In this pull request, I add a do-while loop to check the thread A (the thread of `batcher.add`) is in Thread.State.WAITING state before making the sleeping of 50 ms.


- [x] Make sure to open an issue as a [bug/issue](https://togithub.com/googleapis/gapic-generator-java/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
- [x] Ensure the tests and linter pass
- [x] Code coverage does not decrease (if any source code was changed)
- [x] Appropriate docs were updated (if necessary)

Fixes #1193  ☕️
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants