[Flaky Test][sig-node] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky #106569

Rajalakshmi-Girish · 2021-11-20T19:16:33Z

Which jobs are flaking?

periodic-kubernetes-unit-test-ppc64le

Which tests are flaking?

vendor/k8s.io/client-go/rest.TestHTTP1DoNotReuseRequestAfterTimeout
vendor/k8s.io/client-go/rest.TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1
vendor/k8s.io/client-go/rest.TestHTTP1DoNotReuseRequestAfterTimeout/HTTP2

Since when has it been flaking?

16th November 2021

Testgrid link

https://k8s-testgrid.appspot.com/sig-node-ppc64le#unit-tests

Reason for failure (if possible)

This test is added by change #104844

Seems like it started being flaky from the time added.

Anything else we need to know?

Below is the trace from job:

 === FAIL: vendor/k8s.io/client-go/rest TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.13s)
    request_test.go:3016: Unexpected error: Get "https://127.0.0.1:34701/foo?timeout=100ms": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2021/11/20 12:06:03 http: TLS handshake error from 127.0.0.1:37890: read tcp 127.0.0.1:34701->127.0.0.1:37890: use of closed network connection
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.13s)
=== FAIL: vendor/k8s.io/client-go/rest TestHTTP1DoNotReuseRequestAfterTimeout (0.30s)

Relevant SIG(s)

/sig node

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-11-20T19:16:40Z

@Rajalakshmi-Girish: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Rajalakshmi-Girish · 2021-11-20T19:19:26Z

Proof of Flakiness:

Rajalakshmi-Girish · 2021-11-20T19:50:06Z

@aojea ^^

aojea · 2021-11-20T19:51:40Z

/assign

aojea · 2021-11-20T20:03:23Z

interesting, it doesn't have any failure in all the other jobs it seems it's only flake in ppc64le

https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=TestHTTP

maybe a golang issue?

golang/go#49680
golang/go#44304

aojea · 2021-11-20T20:09:53Z

I don't have access to a ppc64le system, but it would be good to run the test with the stress tool

# compile test test as binary
$ go test -run ^TestReconnectBroken k8s.io/client-go/rest -c
# run the test using https://pkg.go.dev/golang.org/x/tools/cmd/stress
$ stress ./rest.test -test.run ^TestReconnectBroken
5s: 24 runs so far, 0 failures
10s: 72 runs so far, 0 failures
15s: 96 runs so far, 0 failures
20s: 144 runs so far, 0 failures
25s: 192 runs so far, 0 failures
30s: 216 runs so far, 0 failures
35s: 264 runs so far, 0 failures
40s: 312 runs so far, 0 failures
45s: 336 runs so far, 0 failures
50s: 384 runs so far, 0 failures

ehashman · 2021-11-22T20:23:47Z

This testgrid tab is slated for removal as the job is not maintained: kubernetes/test-infra#23249

Conformance tests are required to pass on ppc64le which are tracked here: https://testgrid.k8s.io/conformance-ppc64le#Periodic%20ppc64le%20conformance%20test%20on%20local%20cluster

But I don't think we necessarily have support for unit tests.

Rajalakshmi-Girish · 2021-11-23T06:02:12Z

This testgrid tab is slated for removal as the job is not maintained: kubernetes/test-infra#23249

Conformance tests are required to pass on ppc64le which are tracked here: https://testgrid.k8s.io/conformance-ppc64le#Periodic%20ppc64le%20conformance%20test%20on%20local%20cluster

But I don't think we necessarily have support for unit tests.

Yes the dashboard conformance under test-grid tab sig-node-ppc64le can be removed as there is a duplication of this at tab conformance-ppc64le.

But we need to find a better place for dashboard unittests. Removal of this altogether as part of testgrid sig-node cleanup is not right!

k8s unit tests would need a run on ppc64le architecture.

aojea · 2021-11-23T07:36:04Z

k8s unit tests would need a run on ppc64le architecture

Is not only being able to have a job, there has to be resources, this case is a good example, the test works perfectly in x86 :/ #106569 (comment)

What is the next step? Who will investigate the golang chain and the specifics of ppc64le environment?

mkumatag · 2021-11-23T10:13:34Z

This testgrid tab is slated for removal as the job is not maintained: kubernetes/test-infra#23249

Well, these tests we are running for years and helping them to fix(found a few generic flaky tests as well), just that existing in the wrong group but will find a new home soon :)

Conformance tests are required to pass on ppc64le which are tracked here: https://testgrid.k8s.io/conformance-ppc64le#Periodic%20ppc64le%20conformance%20test%20on%20local%20cluster

But I don't think we necessarily have support for unit tests.

I feel unit tests do give us a lot of confidence in the code, it's worth identifying the issues.

mkumatag · 2021-11-23T10:16:24Z

k8s unit tests would need a run on ppc64le architecture

Is not only being able to have a job, there has to be resources, this case is a good example, the test works perfectly in x86 :/ #106569 (comment)

What is the next step? Who will investigate the golang chain and the specifics of ppc64le environment?

Completely agree, need resources to debug/identify the issues.

Rajalakshmi-Girish · 2021-11-23T11:20:50Z

@aojea When I run using the stress tool the test is flaking in both ppc64le and x86_64.

on ppc64le:

[root@rajalakshmi-workspace1 kubernetes]# stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout

/tmp/go-stress-20211123T070635-1678334821
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.45s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.28s)
        request_test.go:2974: TEST Connected from 127.0.0.1:47712 on /foo
        request_test.go:2974: TEST Connected from 127.0.0.1:47712 on /hang
        request_test.go:2976: TEST hanging 127.0.0.1:47712
        request_test.go:2974: TEST Connected from 127.0.0.1:47730 on /foo
        request_test.go:3030: Unexpected error: Get "https://127.0.0.1:35107/foo?timeout=100ms": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
FAIL


ERROR: exit status 1


/tmp/go-stress-20211123T070635-1511241200
2021/11/23 07:06:38 http: TLS handshake error from 127.0.0.1:52740: read tcp 127.0.0.1:36005->127.0.0.1:52740: read: connection reset by peer
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.29s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.11s)
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:36005/foo?timeout=100ms": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
FAIL


ERROR: exit status 1

5s: 72 runs so far, 2 failures (2.78%)

on x86_64:

[root@toad1 kubernetes]# stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout
5s: 50 runs so far, 0 failures

/tmp/go-stress-20211123T025122-1432112823
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.27s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP2 (0.14s)
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:44741/foo?timeout=100ms": context deadline exceeded
        request_test.go:2974: TEST Connected from 127.0.0.1:38406 on /foo
FAIL


ERROR: exit status 1

10s: 101 runs so far, 1 failures (0.99%)
15s: 156 runs so far, 1 failures (0.64%)
20s: 204 runs so far, 1 failures (0.49%)
25s: 250 runs so far, 1 failures (0.40%)
30s: 298 runs so far, 1 failures (0.34%)
35s: 351 runs so far, 1 failures (0.28%)
40s: 402 runs so far, 1 failures (0.25%)
45s: 454 runs so far, 1 failures (0.22%)
50s: 501 runs so far, 1 failures (0.20%)
55s: 553 runs so far, 1 failures (0.18%)
1m0s: 605 runs so far, 1 failures (0.17%)
1m5s: 655 runs so far, 1 failures (0.15%)

/tmp/go-stress-20211123T025122-2483598037
2021/11/23 02:52:32 http: TLS handshake error from 127.0.0.1:51906: read tcp 127.0.0.1:39957->127.0.0.1:51906: use of closed network connection
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.37s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.24s)
        request_test.go:2974: TEST Connected from 127.0.0.1:51894 on /foo
        request_test.go:2974: TEST Connected from 127.0.0.1:51894 on /hang
        request_test.go:2976: TEST hanging 127.0.0.1:51894
        request_test.go:3030: Unexpected error: Get "https://127.0.0.1:39957/foo?timeout=100ms": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
FAIL


ERROR: exit status 1

1m10s: 706 runs so far, 2 failures (0.28%)

@aojea the golang version was

[root@toad1 kubernetes]# go version
go version go1.17.3 linux/amd64
[root@toad1 kubernetes]#

aojea · 2021-11-23T12:29:46Z

@aojea When I run using the stress tool the test is flaking in both ppc64le and x86_64.

yeah, that is the thing, we have this situation:

same test , same golang version and it only flakes in ppc64le.

It is legit to think that the architecture can be the problem here, now, who can discard that the architecture is the problem?

BTW, seems golang has some reported architecture issues that may be related #106569 (comment)

EDIT

wait, I didn't see it failed in x86 too 👀

aojea · 2021-11-23T12:49:27Z

there were some http changes in golang golang/go@f9cb33c

however, I can't reproduce it in my host :/

stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout -test.v
5s: 538 runs so far, 0 failures
10s: 1096 runs so far, 0 failures
15s: 1656 runs so far, 0 failures
20s: 2208 runs so far, 0 failures
25s: 2760 runs so far, 0 failures
30s: 3312 runs so far, 0 failures
35s: 3867 runs so far, 0 failures
40s: 4419 runs so far, 0 failures
45s: 4977 runs so far, 0 failures

go version
go version go1.17.3 linux/amd64

Rajalakshmi-Girish · 2021-11-24T00:32:59Z

@aojea Is this somehow related golang/go#49741?
Can you please look?

Rajalakshmi-Girish · 2021-11-24T09:52:48Z

there were some http changes in golang golang/go@f9cb33c

however, I can't reproduce it in my host :/

stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout -test.v
5s: 538 runs so far, 0 failures
10s: 1096 runs so far, 0 failures
15s: 1656 runs so far, 0 failures
20s: 2208 runs so far, 0 failures
25s: 2760 runs so far, 0 failures
30s: 3312 runs so far, 0 failures
35s: 3867 runs so far, 0 failures
40s: 4419 runs so far, 0 failures
45s: 4977 runs so far, 0 failures

go version
go version go1.17.3 linux/amd64

I compiled the test binary using the -race flag enabled.
We see this flaky only when the race is enabled. @aojea Can you please try passing -race while compiling the binary for the stress tool.

I followed https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/flaky-tests.md#deflaking-unit-tests

aojea · 2021-11-24T10:32:42Z

:/ not a single failure in 20 mins

$ go test -run ^TestReconnectBroken k8s.io/client-go/rest -c -race                                    
$ stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout -test.v
5s: 392 runs so far, 0 failures
10s: 816 runs so far, 0 failures

...
18m55s: 95126 runs so far, 0 failures
19m0s: 95545 runs so far, 0 failures
19m5s: 95966 runs so far, 0 failures

Rajalakshmi-Girish · 2021-11-24T10:44:41Z

@aojea Can you please try:

go test ./vendor/k8s.io/client-go/rest -c -race
stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout -test.v

Rajalakshmi-Girish · 2021-11-24T11:07:20Z

Why was the binary for ^TestReconnectBroken being generated when we are running stress tool for TestHTTP1DoNotReuseRequestAfterTimeout?
Please help me understand if I am missing something.

aojea · 2021-11-24T11:22:06Z

Why was the binary for ^TestReconnectBroken being generated when we are running stress tool for TestHTTP1DoNotReuseRequestAfterTimeout? Please help me understand if I am missing something.

I was lazy and carried over the cli flags, but it doesn't matter, the generated binary is the same

$ go test -run ^TestReconnectBroken k8s.io/client-go/rest -c -race -o test_withflag
$ go test k8s.io/client-go/rest -c -race -o test_withoutflag
$ md5sum test_*
cda6ddc9d31f4c760c93b34ad78981dc  test_withflag
cda6ddc9d31f4c760c93b34ad78981dc  test_withoutflag

Karthik-K-N · 2021-11-29T12:30:15Z

@aojea I have seen that, there are no failures after increasing the timeout to 200 Millisecond from 100 Millisecond here

Before increasing the timeout:

[root@rajalakshmi-workspace1 kubernetes]# stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout

/tmp/go-stress-20211129T115302-2252016254
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.29s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.10s)
        request_test.go:2974: TEST Connected from 127.0.0.1:57294 on /foo
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:36767/foo?timeout=100ms": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
FAIL


ERROR: exit status 1

5s: 74 runs so far, 1 failures (1.35%)
10s: 158 runs so far, 1 failures (0.63%)

/tmp/go-stress-20211129T115302-927173386
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.30s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.10s)
        request_test.go:2974: TEST Connected from 127.0.0.1:56260 on /foo
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:33675/foo?timeout=100ms": context deadline exceeded
FAIL

After increasing the timeout

5s: 54 runs so far, 0 failures
10s: 112 runs so far, 0 failures
15s: 169 runs so far, 0 failures
20s: 228 runs so far, 0 failures
25s: 291 runs so far, 0 failures
30s: 353 runs so far, 0 failures
35s: 413 runs so far, 0 failures
40s: 475 runs so far, 0 failures
45s: 537 runs so far, 0 failures
50s: 598 runs so far, 0 failures
55s: 659 runs so far, 0 failures
1m0s: 721 runs so far, 0 failures
1m5s: 784 runs so far, 0 failures
1m10s: 842 runs so far, 0 failures

Machine information

[root@rajalakshmi-workspace1 kubernetes]# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7
Physical sockets:    2
Physical chips:      1
Physical cores/chip: 10

Is there any way i can find, where its been taking longer time or why its expecting some extra time.

aojea · 2021-11-29T13:29:10Z

@aojea I have seen that, there are no failures after increasing the timeout to 200 Millisecond from 100 Millisecond here

Before increasing the timeout:

[root@rajalakshmi-workspace1 kubernetes]# stress ./rest.test -test.run TestHTTP1DoNotReuseRequestAfterTimeout

/tmp/go-stress-20211129T115302-2252016254
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.29s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.10s)
        request_test.go:2974: TEST Connected from 127.0.0.1:57294 on /foo
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:36767/foo?timeout=100ms": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
FAIL


ERROR: exit status 1

5s: 74 runs so far, 1 failures (1.35%)
10s: 158 runs so far, 1 failures (0.63%)

/tmp/go-stress-20211129T115302-927173386
--- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout (0.30s)
    --- FAIL: TestHTTP1DoNotReuseRequestAfterTimeout/HTTP1 (0.10s)
        request_test.go:2974: TEST Connected from 127.0.0.1:56260 on /foo
        request_test.go:3016: Unexpected error: Get "https://127.0.0.1:33675/foo?timeout=100ms": context deadline exceeded
FAIL

After increasing the timeout

5s: 54 runs so far, 0 failures
10s: 112 runs so far, 0 failures
15s: 169 runs so far, 0 failures
20s: 228 runs so far, 0 failures
25s: 291 runs so far, 0 failures
30s: 353 runs so far, 0 failures
35s: 413 runs so far, 0 failures
40s: 475 runs so far, 0 failures
45s: 537 runs so far, 0 failures
50s: 598 runs so far, 0 failures
55s: 659 runs so far, 0 failures
1m0s: 721 runs so far, 0 failures
1m5s: 784 runs so far, 0 failures
1m10s: 842 runs so far, 0 failures

Machine information

[root@rajalakshmi-workspace1 kubernetes]# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
NUMA node0 CPU(s):   0-7
Physical sockets:    2
Physical chips:      1
Physical cores/chip: 10

Is there any way i can find, where its been taking longer time or why its expecting some extra time.

Maybe we can just bump the timeout in the test, is not like we need such granularity and we are talking of milliseconds only .,.. , maybe something in the standard library is racing?

aojea · 2021-11-29T14:45:49Z

submitted #106716 for bumping the timeout on this test, thanks for keep pushing

Rajalakshmi-Girish added the kind/flake Categorizes issue or PR as related to a flaky test. label Nov 20, 2021

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 20, 2021

Rajalakshmi-Girish changed the title ~~[Flaky Test][sig-node-ppc64le] TestHTTP1DoNotReuseRequestAfterTimeout~~ [Flaky Test][sig-node-ppc64le] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky Nov 20, 2021

Rajalakshmi-Girish changed the title ~~[Flaky Test][sig-node-ppc64le] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky~~ [Flaky Test][sig-node] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky Nov 20, 2021

k8s-ci-robot assigned aojea Nov 20, 2021

ehashman added this to Triage in SIG Node CI/Test Board Nov 22, 2021

aojea mentioned this issue Nov 29, 2021

bump TestHTTP1DoNotReuseRequestAfterTimeout timeout #106716

Merged

k8s-ci-robot closed this as completed in #106716 Nov 29, 2021

SIG Node CI/Test Board automation moved this from Triage to Done Nov 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky Test][sig-node] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky #106569

[Flaky Test][sig-node] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky #106569

Rajalakshmi-Girish commented Nov 20, 2021 •

edited

k8s-ci-robot commented Nov 20, 2021

Rajalakshmi-Girish commented Nov 20, 2021

Rajalakshmi-Girish commented Nov 20, 2021

aojea commented Nov 20, 2021

aojea commented Nov 20, 2021

aojea commented Nov 20, 2021

ehashman commented Nov 22, 2021

Rajalakshmi-Girish commented Nov 23, 2021

aojea commented Nov 23, 2021 •

edited

mkumatag commented Nov 23, 2021

mkumatag commented Nov 23, 2021

Rajalakshmi-Girish commented Nov 23, 2021

aojea commented Nov 23, 2021 •

edited

aojea commented Nov 23, 2021

Rajalakshmi-Girish commented Nov 24, 2021 •

edited

Rajalakshmi-Girish commented Nov 24, 2021

aojea commented Nov 24, 2021

Rajalakshmi-Girish commented Nov 24, 2021

Rajalakshmi-Girish commented Nov 24, 2021

aojea commented Nov 24, 2021

Karthik-K-N commented Nov 29, 2021

aojea commented Nov 29, 2021 •

edited

aojea commented Nov 29, 2021

[Flaky Test][sig-node] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky #106569

[Flaky Test][sig-node] kubernetes-unit-test TestHTTP1DoNotReuseRequestAfterTimeout is being Flaky #106569

Comments

Rajalakshmi-Girish commented Nov 20, 2021 • edited

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

k8s-ci-robot commented Nov 20, 2021

Rajalakshmi-Girish commented Nov 20, 2021

Rajalakshmi-Girish commented Nov 20, 2021

aojea commented Nov 20, 2021

aojea commented Nov 20, 2021

aojea commented Nov 20, 2021

ehashman commented Nov 22, 2021

Rajalakshmi-Girish commented Nov 23, 2021

aojea commented Nov 23, 2021 • edited

mkumatag commented Nov 23, 2021

mkumatag commented Nov 23, 2021

Rajalakshmi-Girish commented Nov 23, 2021

aojea commented Nov 23, 2021 • edited

aojea commented Nov 23, 2021

Rajalakshmi-Girish commented Nov 24, 2021 • edited

Rajalakshmi-Girish commented Nov 24, 2021

aojea commented Nov 24, 2021

Rajalakshmi-Girish commented Nov 24, 2021

Rajalakshmi-Girish commented Nov 24, 2021

aojea commented Nov 24, 2021

Karthik-K-N commented Nov 29, 2021

aojea commented Nov 29, 2021 • edited

aojea commented Nov 29, 2021

Rajalakshmi-Girish commented Nov 20, 2021 •

edited

aojea commented Nov 23, 2021 •

edited

aojea commented Nov 23, 2021 •

edited

Rajalakshmi-Girish commented Nov 24, 2021 •

edited

aojea commented Nov 29, 2021 •

edited