Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test-http-server-request-timeout-keepalive #42741

Open
richardlau opened this issue Apr 14, 2022 · 20 comments · Fixed by #42846
Open

Investigate flaky test-http-server-request-timeout-keepalive #42741

richardlau opened this issue Apr 14, 2022 · 20 comments · Fixed by #42846
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@richardlau
Copy link
Member

Test

test-http-server-request-timeout-keepalive

Platform

FreeBSD, macOS

Console output

17:13:37 not ok 1424 parallel/test-http-server-request-timeout-keepalive
17:13:37   ---
17:13:37   duration_ms: 3.487
17:13:37   severity: fail
17:13:37   exitcode: 1
17:13:37   stack: |-
17:13:37     node:assert:123
17:13:37       throw new AssertionError(obj);
17:13:37       ^
17:13:37     
17:13:37     AssertionError [ERR_ASSERTION]: Expected values to be strictly equal:
17:13:37     + actual - expected
17:13:37     
17:13:37     + 'HTTP/1.1 200 OK\r\n' +
17:13:37     +   'Content-Type: text/plain\r\n' +
17:13:38     +   'Date: Thu, 14 Apr 2022 16:13:37 GMT\r\n' +
17:13:38     +   'Connection: close\r\n' +
17:13:38     +   'Transfer-Encoding: chunked\r\n' +
17:13:38     +   '\r\n' +
17:13:38     +   '0\r\n' +
17:13:38     +   '\r\n'
17:13:38     - ''
17:13:38         at Socket.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd12-x64/test/parallel/test-http-server-request-timeout-keepalive.js:82:12)
17:13:38         at Socket.<anonymous> (/usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd12-x64/test/common/index.js:438:15)
17:13:38         at Socket.emit (node:events:539:35)
17:13:38         at endReadableNT (node:internal/streams/readable:1344:12)
17:13:38         at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
17:13:38       generatedMessage: true,
17:13:38       code: 'ERR_ASSERTION',
17:13:38       actual: 'HTTP/1.1 200 OK\r\n' +
17:13:38         'Content-Type: text/plain\r\n' +
17:13:38         'Date: Thu, 14 Apr 2022 16:13:37 GMT\r\n' +
17:13:38         'Connection: close\r\n' +
17:13:38         'Transfer-Encoding: chunked\r\n' +
17:13:38         '\r\n' +
17:13:38         '0\r\n' +
17:13:38         '\r\n',
17:13:38       expected: '',
17:13:38       operator: 'strictEqual'
17:13:38     }
17:13:38     
17:13:38     Node.js v18.0.0-pre
17:13:38   ...

Build links

Additional information

Test was recently updated by #41263.

cc @ShogunPanda

@richardlau richardlau added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Apr 14, 2022
@ShogunPanda
Copy link
Contributor

I will take care of that tomorrow morning.

Out of the box I just have to fine tune timers.

@nodejs/build Can I get access to that BSD machine?

xtx1130 pushed a commit to xtx1130/node that referenced this issue Apr 25, 2022
Refs: nodejs#41263

PR-URL: nodejs#42846
Fixes: nodejs#42741
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: Paolo Insogna <paolo@cowtech.it>
@tniessen
Copy link
Member

Looks like #42846 reduced the error rate by some amount but it's still flaky. We can probably further increase the base timeout and see if that helps, even if the test will become unbearably slow.

@ShogunPanda
Copy link
Contributor

I have a further change in #42812 which should reduce it even more.
I'd suggest to wait for that to land and see what happens.

Anyway given the success rate is above 99% I think we might think about closing this (since retries in CI are normal) and reopen only if there is consistent failure.

@richardlau
Copy link
Member Author

I have a further change in #42812 which should reduce it even more. I'd suggest to wait for that to land and see what happens.

Anyway given the success rate is above 99% I think we might think about closing this (since retries in CI are normal) and reopen only if there is consistent failure.

This issue is tracking that the test is flaky, which it still appears to be so I'm reopening.

@richardlau richardlau reopened this Apr 25, 2022
@tniessen
Copy link
Member

It's interesting that it seems to only happen on macOS and FreeBSD. I might run a few stress tests.

@tniessen
Copy link
Member

tniessen commented Apr 26, 2022

Patch (assuming the issue is timing-related):

diff --git a/test/parallel/test-http-server-headers-timeout-keepalive.js b/test/parallel/test-http-server-headers-timeout-keepalive.js
index 05531087ed83..ba82cb0e0c9c 100644
--- a/test/parallel/test-http-server-headers-timeout-keepalive.js
+++ b/test/parallel/test-http-server-headers-timeout-keepalive.js
@@ -24,7 +24,7 @@ function performRequestWithDelay(client, firstDelay, secondDelay, closeAfter) {
   }, firstDelay + secondDelay).unref();
 }
 
-const headersTimeout = common.platformTimeout(2000);
+const headersTimeout = common.platformTimeout(5000);
 const server = createServer({
   headersTimeout,
   requestTimeout: 0,
diff --git a/test/parallel/test-http-server-request-timeout-keepalive.js b/test/parallel/test-http-server-request-timeout-keepalive.js
index 2466e1ee7a95..eb6d64b2f2bb 100644
--- a/test/parallel/test-http-server-request-timeout-keepalive.js
+++ b/test/parallel/test-http-server-request-timeout-keepalive.js
@@ -24,7 +24,7 @@ function performRequestWithDelay(client, firstDelay, secondDelay, closeAfter) {
   }, firstDelay + secondDelay).unref();
 }
 
-const requestTimeout = common.platformTimeout(2000);
+const requestTimeout = common.platformTimeout(5000);
 const server = createServer({
   headersTimeout: 0,
   requestTimeout,

First run (master and with patch):

osx11 freebsd12-x64
master OK: 902 NOT OK: 98 TOTAL: 1000 OK: 863 NOT OK: 137 TOTAL: 1000
increased timeout OK: 676 NOT OK: 324 TOTAL: 1000 OK: 1000 NOT OK: 0 TOTAL: 1000

Second run (master and with patch):

osx11 freebsd12-x64
master OK: 910 NOT OK: 90 TOTAL: 1000 OK: 819 NOT OK: 181 TOTAL: 1000
increased timeout OK: 713 NOT OK: 287 TOTAL: 1000 OK: 1000 NOT OK: 0 TOTAL: 1000

Unless I'm mistaken, increasing the timeout fixes the issues on FreeBSD but somehow makes it worse on macOS?

I started a third run with fresh revisions because I find this hard to believe (master and with patch).

Update: Third run (master and with patch):

osx11 freebsd12-x64
master OK: 916 NOT OK: 84 TOTAL: 1000 OK: 838 NOT OK: 162 TOTAL: 1000
increased timeout OK: 721 NOT OK: 279 TOTAL: 1000 OK: 1000 NOT OK: 0 TOTAL: 1000

@tniessen
Copy link
Member

Median success rate (refer to patch and separate tables above):

osx11 freebsd12-x64
master (2000ms) 91.0 % 83.8 %
patched (5000ms) 71.3 % 100 %

@ShogunPanda
Copy link
Contributor

@richardlau Thanks, I think one commit closed this by mistake.

@ShogunPanda
Copy link
Contributor

@tniessen This is very strange. I code on MacOS and never got a failure...

targos pushed a commit that referenced this issue Apr 28, 2022
Refs: #41263

PR-URL: #42846
Fixes: #42741
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: Paolo Insogna <paolo@cowtech.it>
@tniessen
Copy link
Member

@richardlau Based on #42741 (comment), would you be okay with further increasing the timeout to fix the test on FreeBSD and with marking it as flaky on macOS? Do GitHub actions respect our list of flaky tests?

@richardlau
Copy link
Member Author

@richardlau Based on #42741 (comment), would you be okay with further increasing the timeout to fix the test on FreeBSD and with marking it as flaky on macOS?

I would be okay increasing the timeout and marking macOS flaky although we'd still keep this open for tracking as we do for other tests marked flaky.

Do GitHub actions respect our list of flaky tests?

They should do:

FLAKY_TESTS: dontcare

node/Makefile

Lines 526 to 530 in 68fb0bf

test-ci: | clear-stalled bench-addons-build build-addons build-js-native-api-tests build-node-api-tests doc-only
out/Release/cctest --gtest_output=xml:out/junit/cctest.xml
$(PYTHON) tools/test.py $(PARALLEL_ARGS) -p tap --logfile test.tap \
--mode=$(BUILDTYPE_LOWER) --flaky-tests=$(FLAKY_TESTS) \
$(TEST_CI_ARGS) $(CI_JS_SUITES) $(CI_NATIVE_SUITES) $(CI_DOC)

@richardlau
Copy link
Member Author

I've no idea if it's the same underlying cause, but I've just spotted on LinuxONE:
https://ci.nodejs.org/job/node-test-commit-linuxone/31851/nodes=rhel8-s390x/consoleFull

12:30:49 not ok 1273 parallel/test-http-server-request-timeout-keepalive
12:30:49   ---
12:30:49   duration_ms: 6.565
12:30:49   severity: fail
12:30:49   exitcode: 1
12:30:49   stack: |-
12:30:49     Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
12:30:49         at Proxy.mustCall (/home/iojs/build/workspace/node-test-commit-linuxone/test/common/index.js:396:10)
12:30:49         at Server.<anonymous> (/home/iojs/build/workspace/node-test-commit-linuxone/test/parallel/test-http-server-request-timeout-keepalive.js:77:27)
12:30:49         at Server.<anonymous> (/home/iojs/build/workspace/node-test-commit-linuxone/test/common/index.js:438:15)
12:30:49         at Object.onceWrapper (node:events:641:28)
12:30:49         at Server.emit (node:events:527:28)
12:30:49         at emitListeningNT (node:net:1414:10)
12:30:49         at process.processTicksAndRejections (node:internal/process/task_queues:81:21)
12:30:49   ...

@ShogunPanda
Copy link
Contributor

I think this might have been an error from me, I have:

client.on('error', errOrEnd);
client.on('end', errOrEnd);

So being called two times is legit if things are not closed fast enough,

@ShogunPanda
Copy link
Contributor

ShogunPanda commented Apr 29, 2022

Should I mark this test and all tests in #42893 (without raising the timeout)?
They all rely on timers so I think we're gonna hard time guaranteeing they don't fail on slow platforms...

(just to clarify: they pass on most platforms so I don't think we're ignoring a bug)

@tniessen
Copy link
Member

I would be okay increasing the timeout and marking macOS flaky although we'd still keep this open for tracking as we do for other tests marked flaky.

@richardlau I opened #42962 to mark the two known problematic tests as flaky on macOS. Increasing the timeout seems to fix the FreeBSD failures.

They all rely on timers so I think we're gonna hard time guaranteeing they don't fail on slow platforms...

The strange part is that the error rate seems to increase when the timeout increases (see #42741 (comment)), so I am not sure if macOS being slow is the issue here.

tniessen added a commit to tniessen/node that referenced this issue Apr 30, 2022
aduh95 pushed a commit that referenced this issue May 2, 2022
Refs: #42741

PR-URL: #42926
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
@ShogunPanda
Copy link
Contributor

ShogunPanda commented May 3, 2022

@tniessen You are right. I actually spotted a bug while working on #42812, see changes in https://github.com/nodejs/node/pull/42812/files#diff-bbcf7a115fe4680f3a7c076a487c89bd4ce935e7d2a726cfcb3f69331705e25eR260.

Do you mind trying again first with and 2 and then 5 seconds once that lands?

Both related PR are now in master. Can you please try again with 2 and/or 5 seconds?

@ShogunPanda
Copy link
Contributor

@tniessen Any update on this? Are tests improving?

RafaelGSS pushed a commit that referenced this issue May 10, 2022
Refs: #42741

PR-URL: #42926
Reviewed-By: Michaël Zasso <targos@protonmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
@joyeecheung
Copy link
Member

It seems the flake is still there: nodejs/reliability#273

Reason parallel/test-http-server-request-timeouts-mixed
Type JS_TEST_FAILURE
Failed PR 13 (#42960, #42963, #42623, #42968, #42867, #37769, #42423, #42601, #43005, #42970, #43036, #42725, #43023)
Appeared test-orka-macos10.15-x64-1, test-macstadium-macos11.0-arm64-3, test-orka-macos11-x64-2, test-nearform-macos10.15-x64-1, test-orka-macos11-x64-1
First CI https://ci.nodejs.org/job/node-test-pull-request/43865/
Last CI https://ci.nodejs.org/job/node-test-pull-request/43960/
Example
not ok 1166 parallel/test-http-server-request-timeouts-mixed
  ---
  duration_ms: 3.255
  severity: fail
  exitcode: 1
  stack: |-
    node:assert:399
        throw err;
        ^
    
    AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
    
      assert(request2.completed)
    
        at Timeout._onTimeout (/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1015/test/parallel/test-http-server-request-timeouts-mixed.js:106:5)
        at listOnTimeout (node:internal/timers:564:17)
        at process.processTimers (node:internal/timers:507:7) {
      generatedMessage: true,
      code: 'ERR_ASSERTION',
      actual: false,
      expected: true,
      operator: '=='
    }
    
    Node.js v19.0.0-pre
  ...

juanarbol pushed a commit that referenced this issue May 31, 2022
Refs: #41263

PR-URL: #42846
Fixes: #42741
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Luigi Pinca <luigipinca@gmail.com>
Reviewed-By: Mohammed Keyvanzadeh <mohammadkeyvanzade94@gmail.com>
Reviewed-By: Darshan Sen <raisinten@gmail.com>
Reviewed-By: Paolo Insogna <paolo@cowtech.it>
@aduh95 aduh95 reopened this Jun 28, 2022
@LiviaMedeiros
Copy link
Contributor

I just comprehended what had happened here.
The issue was mistakenly closed by cross-connection via pushing in another repository a commit having Fixes: reference.
The push contained up to 567 latest commits (starting from 2022-04-01) from node repository, rebased on top of test commits.

Checking with curl https://api.github.com/users/LiviaMedeiros/events?per_page=100 | jq '.[]|select(.type=="IssuesEvent")' confirmed that only this issue was affected.

I deeply apologize for this embarassing incident. In the future, I'll keep anything like that on a separate GitHub account.
Please let me know if there are other side effects or something to undo.

@aduh95 thanks for reopening this.
Pinging you as a TSC member (if I made a mistake is applicable).

@targos
Copy link
Member

targos commented Sep 28, 2022

I got a failure in #44741 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants