Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test][sig-node] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky #106569

Closed
Rajalakshmi-Girish opened this issue Nov 20, 2021 · 23 comments · Fixed by #106716
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@Rajalakshmi-Girish
Copy link
Contributor

Rajalakshmi-Girish commented Nov 20, 2021

Which jobs are flaking?

periodic-kubernetes-unit-test-ppc64le

Which tests are flaking?

  1. vendor/k8s.io/client-go/rest.TestHTTP1DoNotReuseRequestAfterTimeout
  2. vendor/k8s.io/client-go/rest.TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1
  3. vendor/k8s.io/client-go/rest.TestHTTP1DoNotReuseRequestAfterTimeout/HTTP2

Since when has it been flaking?

16th November 2021

Testgrid link

https://k8s-testgrid.appspot.com/sig-node-ppc64le#unit-tests

Reason for failure (if possible)

This test is added by change #104844

Seems like it started being flaky from the time added.

Anything else we need to know?

Below is the trace from job:

 === FAIL: vendor/k8s.io/client-go/rest TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.13s)
    request_test.go:3016: Unexpected error: Get "https://127.0.0.1:34701/foo?timeout=100ms": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/11/20 12:06:03 http: TLS handshake error from 127.0.0.1:37890: read tcp 127.0.0.1:34701->127.0.0.1:37890: use of closed network connection
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.13s)
=== FAIL: vendor/k8s.io/client-go/rest TestHTTP1DoNotReuseRequestAfterTimeout (0.30s) 

Relevant SIG(s)

/sig node

@Rajalakshmi-Girish Rajalakshmi-Girish added the kind/flake Categorizes issue or PR as related to a flaky test. label Nov 20, 2021
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2021
@k8s-ci-robot
Copy link
Contributor

@Rajalakshmi-Girish: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Rajalakshmi-Girish
Copy link
Contributor Author

Proof of Flakiness:
image

@Rajalakshmi-Girish Rajalakshmi-Girish changed the title [Flaky Test][sig-node-ppc64le] TestHTTP1DoNotReuseRequestAfterTimeout [Flaky Test][sig-node-ppc64le] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky Nov 20, 2021
@Rajalakshmi-Girish Rajalakshmi-Girish changed the title [Flaky Test][sig-node-ppc64le] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky [Flaky Test][sig-node] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky Nov 20, 2021
@Rajalakshmi-Girish
Copy link
Contributor Author

@aojea ^^

@aojea
Copy link
Member

aojea commented Nov 20, 2021

/assign

@aojea
Copy link
Member

aojea commented Nov 20, 2021

interesting, it doesn't have any failure in all the other jobs it seems it's only flake in ppc64le

https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestHTTP

maybe a golang issue?

golang/go#49680
golang/go#44304

@aojea
Copy link
Member

aojea commented Nov 20, 2021

I don't have access to a ppc64le system, but it would be good to run the test with the stress tool

# compile test test as binary
$ go test -run ^TestReconnectBroken k8s.io/client-go/rest -c
# run the test using https://pkg.go.dev/golang.org/x/tools/cmd/stress
$ stress ./rest.test -test.run ^TestReconnectBroken
5s: 24 runs so far, 0 failures
10s: 72 runs so far, 0 failures
15s: 96 runs so far, 0 failures
20s: 144 runs so far, 0 failures
25s: 192 runs so far, 0 failures
30s: 216 runs so far, 0 failures
35s: 264 runs so far, 0 failures
40s: 312 runs so far, 0 failures
45s: 336 runs so far, 0 failures
50s: 384 runs so far, 0 failures

@ehashman ehashman added this to Triage in SIG Node CI/Test Board Nov 22, 2021
@ehashman
Copy link
Member

This testgrid tab is slated for removal as the job is not maintained: kubernetes/test-infra#23249

Conformance tests are required to pass on ppc64le which are tracked here: https://testgrid.k8s.io/conformance-ppc64le#Periodic%20ppc64le%20conformance%20test%20on%20local%20cluster

But I don't think we necessarily have support for unit tests.

@Rajalakshmi-Girish
Copy link
Contributor Author

This testgrid tab is slated for removal as the job is not maintained: kubernetes/test-infra#23249

Conformance tests are required to pass on ppc64le which are tracked here: https://testgrid.k8s.io/conformance-ppc64le#Periodic%20ppc64le%20conformance%20test%20on%20local%20cluster

But I don't think we necessarily have support for unit tests.

Yes the dashboard conformance under test-grid tab sig-node-ppc64le can be removed as there is a duplication of this at tab conformance-ppc64le.

But we need to find a better place for dashboard unittests. Removal of this altogether as part of testgrid sig-node cleanup is not right!

k8s unit tests would need a run on ppc64le architecture.

@aojea
Copy link
Member

aojea commented Nov 23, 2021

k8s unit tests would need a run on ppc64le architecture

Is not only being able to have a job, there has to be resources, this case is a good example, the test works perfectly in x86 :/ #106569 (comment)

What is the next step? Who will investigate the golang chain and the specifics of ppc64le environment?

@mkumatag
Copy link
Member

This testgrid tab is slated for removal as the job is not maintained: kubernetes/test-infra#23249

Well, these tests we are running for years and helping them to fix(found a few generic flaky tests as well), just that existing in the wrong group but will find a new home soon :)

Conformance tests are required to pass on ppc64le which are tracked here: https://testgrid.k8s.io/conformance-ppc64le#Periodic%20ppc64le%20conformance%20test%20on%20local%20cluster

But I don't think we necessarily have support for unit tests.

I feel unit tests do give us a lot of confidence in the code, it's worth identifying the issues.

@mkumatag
Copy link
Member

k8s unit tests would need a run on ppc64le architecture

Is not only being able to have a job, there has to be resources, this case is a good example, the test works perfectly in x86 :/ #106569 (comment)

What is the next step? Who will investigate the golang chain and the specifics of ppc64le environment?

Completely agree, need resources to debug/identify the issues.

@Rajalakshmi-Girish
Copy link
Contributor Author

@aojea When I run using the stress tool the test is flaking in both ppc64le and x86_64.

  • on ppc64le:
[root@rajalakshmi-workspace1 kubernetes]# stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout

/tmp/go-stress-20211123T070635-1678334821
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.45s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.28s)
        request_test.go:2974: TEST Connected from 127.0.0.1:47712 on /foo
        request_test.go:2974: TEST Connected from 127.0.0.1:47712 on /hang
        request_test.go:2976: TEST hanging 127.0.0.1:47712
        request_test.go:2974: TEST Connected from 127.0.0.1:47730 on /foo
        request_test.go:3030: Unexpected error: Get "https://127.0.0.1:35107/foo?timeout=100ms": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
FAIL


ERROR: exit status 1


/tmp/go-stress-20211123T070635-1511241200
2021/11/23 07:06:38 http: TLS handshake error from 127.0.0.1:52740: read tcp 127.0.0.1:36005->127.0.0.1:52740: read: connection reset by peer
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.29s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.11s)
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:36005/foo?timeout=100ms": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
FAIL


ERROR: exit status 1

5s: 72 runs so far, 2 failures (2.78%)
  • on x86_64:
[root@toad1 kubernetes]# stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout
5s: 50 runs so far, 0 failures

/tmp/go-stress-20211123T025122-1432112823
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.27s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP2 (0.14s)
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:44741/foo?timeout=100ms": context deadline exceeded
        request_test.go:2974: TEST Connected from 127.0.0.1:38406 on /foo
FAIL


ERROR: exit status 1

10s: 101 runs so far, 1 failures (0.99%)
15s: 156 runs so far, 1 failures (0.64%)
20s: 204 runs so far, 1 failures (0.49%)
25s: 250 runs so far, 1 failures (0.40%)
30s: 298 runs so far, 1 failures (0.34%)
35s: 351 runs so far, 1 failures (0.28%)
40s: 402 runs so far, 1 failures (0.25%)
45s: 454 runs so far, 1 failures (0.22%)
50s: 501 runs so far, 1 failures (0.20%)
55s: 553 runs so far, 1 failures (0.18%)
1m0s: 605 runs so far, 1 failures (0.17%)
1m5s: 655 runs so far, 1 failures (0.15%)

/tmp/go-stress-20211123T025122-2483598037
2021/11/23 02:52:32 http: TLS handshake error from 127.0.0.1:51906: read tcp 127.0.0.1:39957->127.0.0.1:51906: use of closed network connection
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.37s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.24s)
        request_test.go:2974: TEST Connected from 127.0.0.1:51894 on /foo
        request_test.go:2974: TEST Connected from 127.0.0.1:51894 on /hang
        request_test.go:2976: TEST hanging 127.0.0.1:51894
        request_test.go:3030: Unexpected error: Get "https://127.0.0.1:39957/foo?timeout=100ms": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
FAIL


ERROR: exit status 1

1m10s: 706 runs so far, 2 failures (0.28%)

@aojea the golang version was

[root@toad1 kubernetes]# go version
go version go1.17.3 linux/amd64
[root@toad1 kubernetes]#

@aojea
Copy link
Member

aojea commented Nov 23, 2021

@aojea When I run using the stress tool the test is flaking in both ppc64le and x86_64.

yeah, that is the thing, we have this situation:

same test , same golang version and it only flakes in ppc64le.

It is legit to think that the architecture can be the problem here, now, who can discard that the architecture is the problem?

BTW, seems golang has some reported architecture issues that may be related #106569 (comment)

EDIT

wait, I didn't see it failed in x86 too 👀

@aojea
Copy link
Member

aojea commented Nov 23, 2021

there were some http changes in golang golang/go@f9cb33c

however, I can't reproduce it in my host :/

stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout -test.v
5s: 538 runs so far, 0 failures
10s: 1096 runs so far, 0 failures
15s: 1656 runs so far, 0 failures
20s: 2208 runs so far, 0 failures
25s: 2760 runs so far, 0 failures
30s: 3312 runs so far, 0 failures
35s: 3867 runs so far, 0 failures
40s: 4419 runs so far, 0 failures
45s: 4977 runs so far, 0 failures
go version
go version go1.17.3 linux/amd64

@Rajalakshmi-Girish
Copy link
Contributor Author

Rajalakshmi-Girish commented Nov 24, 2021

@aojea Is this somehow related golang/go#49741?
Can you please look?

@Rajalakshmi-Girish
Copy link
Contributor Author

there were some http changes in golang golang/go@f9cb33c

however, I can't reproduce it in my host :/

stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout -test.v
5s: 538 runs so far, 0 failures
10s: 1096 runs so far, 0 failures
15s: 1656 runs so far, 0 failures
20s: 2208 runs so far, 0 failures
25s: 2760 runs so far, 0 failures
30s: 3312 runs so far, 0 failures
35s: 3867 runs so far, 0 failures
40s: 4419 runs so far, 0 failures
45s: 4977 runs so far, 0 failures
go version
go version go1.17.3 linux/amd64

I compiled the test binary using the -race flag enabled.
We see this flaky only when the race is enabled. @aojea Can you please try passing -race while compiling the binary for the stress tool.

I followed https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md#deflaking-unit-tests

@aojea
Copy link
Member

aojea commented Nov 24, 2021

:/ not a single failure in 20 mins

$ go test -run ^TestReconnectBroken k8s.io/client-go/rest -c -race                                    
$ stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout -test.v
5s: 392 runs so far, 0 failures
10s: 816 runs so far, 0 failures

...
18m55s: 95126 runs so far, 0 failures
19m0s: 95545 runs so far, 0 failures
19m5s: 95966 runs so far, 0 failures

@Rajalakshmi-Girish
Copy link
Contributor Author

@aojea Can you please try:

go test ./vendor/k8s.io/client-go/rest -c -race
stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout -test.v

@Rajalakshmi-Girish
Copy link
Contributor Author

Why was the binary for ^TestReconnectBroken being generated when we are running stress tool for TestHTTP1DoNotReuseRequestAfterTimeout?
Please help me understand if I am missing something.

@aojea
Copy link
Member

aojea commented Nov 24, 2021

Why was the binary for ^TestReconnectBroken being generated when we are running stress tool for TestHTTP1DoNotReuseRequestAfterTimeout? Please help me understand if I am missing something.

I was lazy and carried over the cli flags, but it doesn't matter, the generated binary is the same

$ go test -run ^TestReconnectBroken k8s.io/client-go/rest -c -race -o test_withflag
$ go test k8s.io/client-go/rest -c -race -o test_withoutflag
$ md5sum test_*
cda6ddc9d31f4c760c93b34ad78981dc  test_withflag
cda6ddc9d31f4c760c93b34ad78981dc  test_withoutflag

@Karthik-K-N
Copy link
Contributor

@aojea I have seen that, there are no failures after increasing the timeout to 200 Millisecond from 100 Millisecond here

Before increasing the timeout:

[root@rajalakshmi-workspace1 kubernetes]# stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout

/tmp/go-stress-20211129T115302-2252016254
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.29s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.10s)
        request_test.go:2974: TEST Connected from 127.0.0.1:57294 on /foo
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:36767/foo?timeout=100ms": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
FAIL


ERROR: exit status 1

5s: 74 runs so far, 1 failures (1.35%)
10s: 158 runs so far, 1 failures (0.63%)

/tmp/go-stress-20211129T115302-927173386
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.30s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.10s)
        request_test.go:2974: TEST Connected from 127.0.0.1:56260 on /foo
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:33675/foo?timeout=100ms": context deadline exceeded
FAIL

After increasing the timeout

5s: 54 runs so far, 0 failures
10s: 112 runs so far, 0 failures
15s: 169 runs so far, 0 failures
20s: 228 runs so far, 0 failures
25s: 291 runs so far, 0 failures
30s: 353 runs so far, 0 failures
35s: 413 runs so far, 0 failures
40s: 475 runs so far, 0 failures
45s: 537 runs so far, 0 failures
50s: 598 runs so far, 0 failures
55s: 659 runs so far, 0 failures
1m0s: 721 runs so far, 0 failures
1m5s: 784 runs so far, 0 failures
1m10s: 842 runs so far, 0 failures

Machine information

[root@rajalakshmi-workspace1 kubernetes]# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7
Physical sockets:    2
Physical chips:      1
Physical cores/chip: 10

Is there any way i can find, where its been taking longer time or why its expecting some extra time.

@aojea
Copy link
Member

aojea commented Nov 29, 2021

@aojea I have seen that, there are no failures after increasing the timeout to 200 Millisecond from 100 Millisecond here

Before increasing the timeout:

[root@rajalakshmi-workspace1 kubernetes]# stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout

/tmp/go-stress-20211129T115302-2252016254
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.29s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.10s)
        request_test.go:2974: TEST Connected from 127.0.0.1:57294 on /foo
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:36767/foo?timeout=100ms": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
FAIL


ERROR: exit status 1

5s: 74 runs so far, 1 failures (1.35%)
10s: 158 runs so far, 1 failures (0.63%)

/tmp/go-stress-20211129T115302-927173386
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.30s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.10s)
        request_test.go:2974: TEST Connected from 127.0.0.1:56260 on /foo
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:33675/foo?timeout=100ms": context deadline exceeded
FAIL

After increasing the timeout

5s: 54 runs so far, 0 failures
10s: 112 runs so far, 0 failures
15s: 169 runs so far, 0 failures
20s: 228 runs so far, 0 failures
25s: 291 runs so far, 0 failures
30s: 353 runs so far, 0 failures
35s: 413 runs so far, 0 failures
40s: 475 runs so far, 0 failures
45s: 537 runs so far, 0 failures
50s: 598 runs so far, 0 failures
55s: 659 runs so far, 0 failures
1m0s: 721 runs so far, 0 failures
1m5s: 784 runs so far, 0 failures
1m10s: 842 runs so far, 0 failures

Machine information

[root@rajalakshmi-workspace1 kubernetes]# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7
Physical sockets:    2
Physical chips:      1
Physical cores/chip: 10

Is there any way i can find, where its been taking longer time or why its expecting some extra time.

Maybe we can just bump the timeout in the test, is not like we need such granularity and we are talking of milliseconds only .,.. , maybe something in the standard library is racing?

@aojea
Copy link
Member

aojea commented Nov 29, 2021

submitted #106716 for bumping the timeout on this test, thanks for keep pushing

SIG Node CI/Test Board automation moved this from Triage to Done Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants