Measure testgrid flakinesss didn't detect flakes that happend #17773

serathius · 2024-04-11T12:14:53Z

What would you like to be added?

https://github.com/etcd-io/etcd/actions/runs/8584759086/job/23525383783

Why is this needed?

Last run https://github.com/etcd-io/etcd/actions/runs/8584759086/job/23525383783 on April 7th, didn't detect a flake on April 4th.

siyuanfoundation · 2024-04-11T16:03:09Z

The flaky detection is meant to detect tests that fails sometimes, not one-off failures.
this test fails about 2% of the time. @serathius Do you think 2% threshold is reasonable?

serathius · 2024-04-11T17:55:14Z

Hmm, not sure. The 16 % flakiness on main branch seems not great https://github.com/etcd-io/etcd/actions/runs/8584643093/job/23525115058.

siyuanfoundation · 2024-04-11T18:29:08Z

I think the 16 % flakiness on main branch includes all the workflows on a PR. I am seeing a lot of flakiness wrt arm64.

jmhbnz · 2024-04-11T20:09:21Z

I think the 16 % flakiness on main branch includes all the workflows on a PR. I am seeing a lot of flakiness wrt arm64.

Raised a flake issue for TestMemberAdd e2e on arm64 and amd64. I have seen it fail a few times in GitHub actions for arm64 and there are also instances in TestGrid for amd64 on prow.

#17778

serathius · 2024-04-12T07:25:25Z

The flaky detection is meant to detect tests that fails sometimes, not one-off failures.
this test fails about 2% of the time. @serathius Do you think 2% threshold is reasonable?

Maybe we could improve on visibility. What was surprising for me was fact that the tool didn't mention any flakes. Could we maybe log flakes below 2%, with note that it's too low to file an issue?

serathius · 2024-04-30T10:23:02Z

The reports are very nice.

My suggestions:

Make them easier to find, maybe use https://github.com/marketplace/actions/publish-test-report to pushish a report in summary
10% per test threshold is very high so it will not report anything, from contributor perspective I don't care about a flakiness of a single test. I care about my PR having flakes wasting my time on retries. I would recommend to change the threshold to be per suite, if the whole suite flakiness is above 10%, we file an issue for the most flaky tests. This way we catch cases of tests with low flakiness not being a problem individually, but in aggregate. Like 10 tests with flakiness of 1%. We can start from reporting just the top flaky test in the suite, we can iterate on it later.

To go into more detail, lets define a measure of bad contributor experience due to CI, something like time wasted on CI to merge PR. I would call this TTM - time to merge, a reflection of how long it takes to test a PR and flakiness of those test. I would expect TTM to equal something like max(TSDi^(1+TSFi) for each i) where TSDi - duration of test suite i, TSFi - flakiness of test suite i. Because retries can be done on test suite level, we need to count it per suite. If we set a target for TTM, different suites might have different acceptable flakiness as it's easier and faster to retry 1 minute test, than 30 minute one. Of course it assumes that notice failure and retry is zero which is a simplification. However this is high level my mental model of the problem. If we include the TTR - time to retry the TTM=max(TSDi^TSFi+(TSDi+TTR)^TSFi for each i)

serathius added the type/feature label Apr 11, 2024

siyuanfoundation mentioned this issue Apr 12, 2024

testgrid: print out all failed tests for visibility. #17785

Merged

siyuanfoundation mentioned this issue May 1, 2024

testgrid-analysis: create issues based on test set flakiness instead of individual tests. #17924

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure testgrid flakinesss didn't detect flakes that happend #17773

Measure testgrid flakinesss didn't detect flakes that happend #17773

serathius commented Apr 11, 2024

siyuanfoundation commented Apr 11, 2024

serathius commented Apr 11, 2024

siyuanfoundation commented Apr 11, 2024

jmhbnz commented Apr 11, 2024

serathius commented Apr 12, 2024

serathius commented Apr 30, 2024 •

edited

Measure testgrid flakinesss didn't detect flakes that happend #17773

Measure testgrid flakinesss didn't detect flakes that happend #17773

Comments

serathius commented Apr 11, 2024

What would you like to be added?

Why is this needed?

siyuanfoundation commented Apr 11, 2024

serathius commented Apr 11, 2024

siyuanfoundation commented Apr 11, 2024

jmhbnz commented Apr 11, 2024

serathius commented Apr 12, 2024

serathius commented Apr 30, 2024 • edited

serathius commented Apr 30, 2024 •

edited