-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky Test][sig-node] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky #106569
Comments
@Rajalakshmi-Girish: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@aojea ^^ |
/assign |
interesting, it doesn't have any failure in all the other jobs it seems it's only flake in ppc64le https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestHTTP maybe a golang issue? |
I don't have access to a ppc64le system, but it would be good to run the test with the # compile test test as binary
$ go test -run ^TestReconnectBroken k8s.io/client-go/rest -c
# run the test using https://pkg.go.dev/golang.org/x/tools/cmd/stress
$ stress ./rest.test -test.run ^TestReconnectBroken
5s: 24 runs so far, 0 failures
10s: 72 runs so far, 0 failures
15s: 96 runs so far, 0 failures
20s: 144 runs so far, 0 failures
25s: 192 runs so far, 0 failures
30s: 216 runs so far, 0 failures
35s: 264 runs so far, 0 failures
40s: 312 runs so far, 0 failures
45s: 336 runs so far, 0 failures
50s: 384 runs so far, 0 failures |
This testgrid tab is slated for removal as the job is not maintained: kubernetes/test-infra#23249 Conformance tests are required to pass on ppc64le which are tracked here: https://testgrid.k8s.io/conformance-ppc64le#Periodic%20ppc64le%20conformance%20test%20on%20local%20cluster But I don't think we necessarily have support for unit tests. |
Yes the dashboard But we need to find a better place for dashboard k8s unit tests would need a run on ppc64le architecture. |
Is not only being able to have a job, there has to be resources, this case is a good example, the test works perfectly in x86 :/ #106569 (comment) What is the next step? Who will investigate the golang chain and the specifics of ppc64le environment? |
Well, these tests we are running for years and helping them to fix(found a few generic flaky tests as well), just that existing in the wrong group but will find a new home soon :)
I feel unit tests do give us a lot of confidence in the code, it's worth identifying the issues. |
Completely agree, need resources to debug/identify the issues. |
@aojea When I run using the stress tool the test is flaking in both ppc64le and x86_64.
@aojea the golang version was
|
yeah, that is the thing, we have this situation: same test , same golang version and it only flakes in ppc64le. It is legit to think that the architecture can be the problem here, now, who can discard that the architecture is the problem? BTW, seems golang has some reported architecture issues that may be related #106569 (comment) EDIT wait, I didn't see it failed in x86 too 👀 |
there were some http changes in golang golang/go@f9cb33c however, I can't reproduce it in my host :/
|
@aojea Is this somehow related golang/go#49741? |
I compiled the test binary using the |
:/ not a single failure in 20 mins
|
@aojea Can you please try:
|
Why was the binary for |
I was lazy and carried over the cli flags, but it doesn't matter, the generated binary is the same $ go test -run ^TestReconnectBroken k8s.io/client-go/rest -c -race -o test_withflag
$ go test k8s.io/client-go/rest -c -race -o test_withoutflag
$ md5sum test_*
cda6ddc9d31f4c760c93b34ad78981dc test_withflag
cda6ddc9d31f4c760c93b34ad78981dc test_withoutflag |
@aojea I have seen that, there are no failures after increasing the timeout to 200 Millisecond from 100 Millisecond here Before increasing the timeout:
After increasing the timeout
Machine information
Is there any way i can find, where its been taking longer time or why its expecting some extra time. |
Maybe we can just bump the timeout in the test, is not like we need such granularity and we are talking of milliseconds only .,.. , maybe something in the standard library is racing? |
submitted #106716 for bumping the timeout on this test, thanks for keep pushing |
Which jobs are flaking?
periodic-kubernetes-unit-test-ppc64le
Which tests are flaking?
Since when has it been flaking?
16th November 2021
Testgrid link
https://k8s-testgrid.appspot.com/sig-node-ppc64le#unit-tests
Reason for failure (if possible)
This test is added by change #104844
Seems like it started being flaky from the time added.
Anything else we need to know?
Below is the trace from job:
Relevant SIG(s)
/sig node
The text was updated successfully, but these errors were encountered: