Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

robustness: make qps below threshold retriable test. #17725

Closed
wants to merge 1 commit into from

Conversation

siyuanfoundation
Copy link
Contributor

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

For issue #17717

Errors like Requiring minimal 200.000000 qps for test results to be reliable, got 72.143568 qps. could be temporary. Proposing to make it retriable.

Signed-off-by: Siyuan Zhang <sizhang@google.com>
Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea - nice work @siyuanfoundation

Edit: On marginal hardware the retries might just shift the problem to timeout of completing robustness within 30mins/200mins but I still think it's better than current situation where one single performance flake throws off entire robustness run.

@siyuanfoundation
Copy link
Contributor Author

cc @serathius @ahrtr

Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't like the idea, if we can get the performance required on github action, we should be able to get it on Prow. If you look at the recent feailures in github actions they are not directly related to low qps, it's just consequence of other issue.

@serathius
Copy link
Member

I still think it's better than current situation where one single performance flake throws off entire robustness run.

This is a separate, unrelated problem. Robustness test reports are written always to the same directory dependent on the name of the test. This means that we cannot have multiple failures in run as it would override the previous result. I'm open to changing it, if you think this is a problem, but I don't think that stopping on the first failure is bad. At least as long we maintain number of failures low, and there is no sense to have robustness test with high failure rate.

@jmhbnz
Copy link
Member

jmhbnz commented Apr 6, 2024

Don't like the idea, if we can get the performance required on github action, we should be able to get it on Prow. If you look at the recent feailures in github actions they are not directly related to low qps, it's just consequence of other issue.

It's not necessarily that we can't get the qps, clearly later iterations on prow do. For some reason the first one doesn't.

Putting that aside I still think this feature makes sense. Why wouldn't we make the test suite more resilient to these annoying performance flakes? This seems like a very sensible feature.

@serathius
Copy link
Member

Same like we don't slap retries on every test, we need to understand the underlying problem. I would need to understand what is the problem with Prow first.

@siyuanfoundation
Copy link
Contributor Author

siyuanfoundation commented Apr 6, 2024

we don't slap retries on every test because most of the tests are written to detect real problems when it fails. Just from the error message of Requiring minimal X qps for test results to be reliable, in my understanding, it just means we should not consider this run, not it means real failure. For similar reasons, the InjectFailpoint error is retried 3 times.

Here is my line of thought. If 1% of tests do not satisfy the qps min, is that really a problem?
Since hundreds of tests are run for each day, very likely (99%^100=36%) some test is going to fail due to this reason. But we cannot just ignore them as "Flaky", because each run is with an exploratory config, 1 test run failure can mean a real issue. So 1% of tests not satisfying the qps min means there needs manual check almost everyday. Is it really worth it?

On the other hand, if we say 10% tests do not satisfy the qps min is a real problem, 3 retries put the single test run success rate at (1-0.1^3)=0.999, and for 100 runs, the probability of >=1 test failing is 1-0.999^100 = 10%. So in the end, you would still see the test failing 10% of the time.

@@ -124,7 +153,7 @@ func (s testScenario) run(ctx context.Context, t *testing.T, lg *zap.Logger, clu
maxRevisionChan := make(chan int64, 1)
g.Go(func() error {
defer close(maxRevisionChan)
operationReport = traffic.SimulateTraffic(ctx, t, lg, clus, s.profile, s.traffic, finishTraffic, baseTime, ids)
operationReport, retriableErr = traffic.SimulateTraffic(ctx, t, lg, clus, s.profile, s.traffic, finishTraffic, baseTime, ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think considering that this is the only place we are setting retriableErr, this should be fine, but in the future if we set it in any of the other goroutines, that might be a race.

How about returning the err instead of nil in this goroutine and catching it as part of g.Wait() below?

Comment on lines -117 to +147
t.Error(err)
cancel()
t.Fatal(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just trying to understand this better, if we use t.Fatal here, we won't end up retrying the test right?
Referring to the comment here: #17725 (comment)

@serathius
Copy link
Member

For similar reasons, the InjectFailpoint error is retried 3 times.

I consider retrying InjectFailpoint an mistake, retrying this error doesn't really help, if the failpoint fails it will fail twice. The situation only improved after we found and fixed underlying issues with process handling code. The reason I introduced it was cause the issue was so rare (one every couple hundereds of invocations) I could not have reproduced it locally, so I wanted to slap a band aid. Don't think that was a good decision.

@serathius
Copy link
Member

Here is my line of thought. If 1% of tests do not satisfy the qps min, is that really a problem?

No it's not as long we investigate the reason, track flakiness changes over time and not hide it behind retries. Retrying obscures the test results without understanding the underlying issue. Low qps not only is caused by infrastructure noise, but also by other underlying availability issues. Only by observing all the flakes, and categorizing them we can discover issues like #17455

Of cause categorizing issues in robustness test, brings additional toil. For now I'm happy to handle it, with plans to share the knowledge in the community via dedicated meetings, and potentially automating the process at some point by string matching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

5 participants