robustness: make qps below threshold retriable test. #17725

siyuanfoundation · 2024-04-05T20:11:58Z

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

For issue #17717

Errors like Requiring minimal 200.000000 qps for test results to be reliable, got 72.143568 qps. could be temporary. Proposing to make it retriable.

Signed-off-by: Siyuan Zhang <sizhang@google.com>

jmhbnz

I like the idea - nice work @siyuanfoundation

Edit: On marginal hardware the retries might just shift the problem to timeout of completing robustness within 30mins/200mins but I still think it's better than current situation where one single performance flake throws off entire robustness run.

siyuanfoundation · 2024-04-05T20:55:08Z

cc @serathius @ahrtr

serathius

Don't like the idea, if we can get the performance required on github action, we should be able to get it on Prow. If you look at the recent feailures in github actions they are not directly related to low qps, it's just consequence of other issue.

serathius · 2024-04-06T09:11:39Z

I still think it's better than current situation where one single performance flake throws off entire robustness run.

This is a separate, unrelated problem. Robustness test reports are written always to the same directory dependent on the name of the test. This means that we cannot have multiple failures in run as it would override the previous result. I'm open to changing it, if you think this is a problem, but I don't think that stopping on the first failure is bad. At least as long we maintain number of failures low, and there is no sense to have robustness test with high failure rate.

jmhbnz · 2024-04-06T09:47:03Z

Don't like the idea, if we can get the performance required on github action, we should be able to get it on Prow. If you look at the recent feailures in github actions they are not directly related to low qps, it's just consequence of other issue.

It's not necessarily that we can't get the qps, clearly later iterations on prow do. For some reason the first one doesn't.

Putting that aside I still think this feature makes sense. Why wouldn't we make the test suite more resilient to these annoying performance flakes? This seems like a very sensible feature.

serathius · 2024-04-06T10:18:52Z

Same like we don't slap retries on every test, we need to understand the underlying problem. I would need to understand what is the problem with Prow first.

siyuanfoundation · 2024-04-06T20:58:47Z

we don't slap retries on every test because most of the tests are written to detect real problems when it fails. Just from the error message of Requiring minimal X qps for test results to be reliable, in my understanding, it just means we should not consider this run, not it means real failure. For similar reasons, the InjectFailpoint error is retried 3 times.

Here is my line of thought. If 1% of tests do not satisfy the qps min, is that really a problem?
Since hundreds of tests are run for each day, very likely (99%^100=36%) some test is going to fail due to this reason. But we cannot just ignore them as "Flaky", because each run is with an exploratory config, 1 test run failure can mean a real issue. So 1% of tests not satisfying the qps min means there needs manual check almost everyday. Is it really worth it?

On the other hand, if we say 10% tests do not satisfy the qps min is a real problem, 3 retries put the single test run success rate at (1-0.1^3)=0.999, and for 100 runs, the probability of >=1 test failing is 1-0.999^100 = 10%. So in the end, you would still see the test failing 10% of the time.

MadhavJivrajani · 2024-04-07T07:42:36Z

tests/robustness/main_test.go

@@ -124,7 +153,7 @@ func (s testScenario) run(ctx context.Context, t *testing.T, lg *zap.Logger, clu
 	maxRevisionChan := make(chan int64, 1)
 	g.Go(func() error {
 		defer close(maxRevisionChan)
-		operationReport = traffic.SimulateTraffic(ctx, t, lg, clus, s.profile, s.traffic, finishTraffic, baseTime, ids)
+		operationReport, retriableErr = traffic.SimulateTraffic(ctx, t, lg, clus, s.profile, s.traffic, finishTraffic, baseTime, ids)


nit: I think considering that this is the only place we are setting retriableErr, this should be fine, but in the future if we set it in any of the other goroutines, that might be a race.

How about returning the err instead of nil in this goroutine and catching it as part of g.Wait() below?

MadhavJivrajani · 2024-04-07T07:42:54Z

tests/robustness/main_test.go

-			t.Error(err)
 			cancel()
+			t.Fatal(err)


Just trying to understand this better, if we use t.Fatal here, we won't end up retrying the test right?
Referring to the comment here: #17725 (comment)

serathius · 2024-04-07T08:06:34Z

For similar reasons, the InjectFailpoint error is retried 3 times.

I consider retrying InjectFailpoint an mistake, retrying this error doesn't really help, if the failpoint fails it will fail twice. The situation only improved after we found and fixed underlying issues with process handling code. The reason I introduced it was cause the issue was so rare (one every couple hundereds of invocations) I could not have reproduced it locally, so I wanted to slap a band aid. Don't think that was a good decision.

serathius · 2024-04-07T08:14:58Z

Here is my line of thought. If 1% of tests do not satisfy the qps min, is that really a problem?

No it's not as long we investigate the reason, track flakiness changes over time and not hide it behind retries. Retrying obscures the test results without understanding the underlying issue. Low qps not only is caused by infrastructure noise, but also by other underlying availability issues. Only by observing all the flakes, and categorizing them we can discover issues like #17455

Of cause categorizing issues in robustness test, brings additional toil. For now I'm happy to handle it, with plans to share the knowledge in the community via dedicated meetings, and potentially automating the process at some point by string matching.

robustness: make qps below threshold retriable test.

4be2e12

Signed-off-by: Siyuan Zhang <sizhang@google.com>

jmhbnz approved these changes Apr 5, 2024

View reviewed changes

serathius requested changes Apr 6, 2024

View reviewed changes

MadhavJivrajani reviewed Apr 7, 2024

View reviewed changes

siyuanfoundation marked this pull request as draft April 8, 2024 16:29

k8s-ci-robot added the do-not-merge/work-in-progress label Apr 8, 2024

siyuanfoundation closed this Apr 30, 2024

siyuanfoundation deleted the robust branch April 30, 2024 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

robustness: make qps below threshold retriable test. #17725

robustness: make qps below threshold retriable test. #17725

siyuanfoundation commented Apr 5, 2024

jmhbnz left a comment •

edited

siyuanfoundation commented Apr 5, 2024

serathius left a comment

serathius commented Apr 6, 2024

jmhbnz commented Apr 6, 2024 •

edited

serathius commented Apr 6, 2024

siyuanfoundation commented Apr 6, 2024 •

edited

MadhavJivrajani Apr 7, 2024

MadhavJivrajani Apr 7, 2024

serathius commented Apr 7, 2024

serathius commented Apr 7, 2024

robustness: make qps below threshold retriable test. #17725

robustness: make qps below threshold retriable test. #17725

Conversation

siyuanfoundation commented Apr 5, 2024

jmhbnz left a comment • edited

Choose a reason for hiding this comment

siyuanfoundation commented Apr 5, 2024

serathius left a comment

Choose a reason for hiding this comment

serathius commented Apr 6, 2024

jmhbnz commented Apr 6, 2024 • edited

serathius commented Apr 6, 2024

siyuanfoundation commented Apr 6, 2024 • edited

MadhavJivrajani Apr 7, 2024

Choose a reason for hiding this comment

MadhavJivrajani Apr 7, 2024

Choose a reason for hiding this comment

serathius commented Apr 7, 2024

serathius commented Apr 7, 2024

jmhbnz left a comment •

edited

jmhbnz commented Apr 6, 2024 •

edited

siyuanfoundation commented Apr 6, 2024 •

edited