Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test-worker-syntax-error-file #22762

Closed
Trott opened this issue Sep 8, 2018 · 12 comments
Closed

Investigate flaky test-worker-syntax-error-file #22762

Trott opened this issue Sep 8, 2018 · 12 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. worker Issues and PRs related to Worker support.

Comments

@Trott
Copy link
Member

Trott commented Sep 8, 2018

  • Version: 11.0.0-pre (master)
  • Platform: win10 vs2017
  • Subsystem: test worker

https://ci.nodejs.org/job/node-test-binary-windows/19875/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=3/console

01:14:51 not ok 518 parallel/test-worker-syntax-error-file
01:14:51   ---
01:14:51   duration_ms: 0.211
01:14:51   severity: fail
01:14:51   exitcode: 1
01:14:51   stack: |-
01:14:51     Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
01:14:51         at Object.mustCall (c:\workspace\node-test-binary-windows\test\common\index.js:312:10)
01:14:51         at Object.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-worker-syntax-error-file.js:13:24)
01:14:51         at Module._compile (internal/modules/cjs/loader.js:689:30)
01:14:51         at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
01:14:51         at Module.load (internal/modules/cjs/loader.js:599:32)
01:14:51         at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
01:14:51         at Function.Module._load (internal/modules/cjs/loader.js:530:3)
01:14:51         at Function.Module.runMain (internal/modules/cjs/loader.js:742:12)
01:14:51         at startup (internal/bootstrap/node.js:270:19)
01:14:51   ...

@nodejs/workers

@Trott Trott added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Sep 8, 2018
@addaleax addaleax added the worker Issues and PRs related to Worker support. label Sep 8, 2018
@addaleax
Copy link
Member

addaleax commented Sep 9, 2018

I have a hard time attempting to reproduce this… do people on @nodejs/platform-windows have more luck on Windows?

@bzoz
Copy link
Contributor

bzoz commented Sep 13, 2018

Cannot reproduce on Win10

@joyeecheung
Copy link
Member

Not exactly sure but this seems to be related:

Reason parallel/test-worker-syntax-error
Type JS_TEST_FAILURE
Failed PR 2 (#22843, #22239)
Appeared test-azure_msft-win10-x64-3, test-rackspace-win2012r2-x64-1
First CI https://ci.nodejs.org/job/node-test-pull-request/17202/
Last CI https://ci.nodejs.org/job/node-test-pull-request/17215/
Example
not ok 521 parallel/test-worker-syntax-error
  ---
  duration_ms: 0.195
  severity: fail
  exitcode: 1
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
        at Object.mustCall (c:\workspace\node-test-binary-windows\test\common\index.js:310:10)
        at Object.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-worker-syntax-error.js:12:24)
        at Module._compile (internal/modules/cjs/loader.js:693:30)
        at Object.Module._extensions..js (internal/modules/cjs/loader.js:704:10)
        at Module.load (internal/modules/cjs/loader.js:603:32)
        at tryModuleLoad (internal/modules/cjs/loader.js:542:12)
        at Function.Module._load (internal/modules/cjs/loader.js:534:3)
        at Function.Module.runMain (internal/modules/cjs/loader.js:746:12)
        at startup (internal/bootstrap/node.js:270:19)
  ...

refack pushed a commit to JaneaSystems/node that referenced this issue Sep 19, 2018
Refs: nodejs#20750
Refs: nodejs#22327
Refs: nodejs#22762
Refs: nodejs/reliability#16

PR-URL: nodejs#22941
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
targos pushed a commit that referenced this issue Sep 20, 2018
Refs: #20750
Refs: #22327
Refs: #22762
Refs: nodejs/reliability#16

PR-URL: #22941
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Refael Ackermann <refack@gmail.com>
@Trott
Copy link
Member Author

Trott commented Oct 24, 2018

https://ci.nodejs.org/job/node-test-commit-linux/22621/nodes=centos7-64-gcc6/console

22:48:52 not ok 2016 parallel/test-worker-syntax-error-file
22:48:52   ---
22:48:52   duration_ms: 0.343
22:48:52   severity: fail
22:48:52   exitcode: 1
22:48:52   stack: |-
22:48:52     Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
22:48:52         at Object.mustCall (/home/iojs/build/workspace/node-test-commit-linux/nodes/centos7-64-gcc6/test/common/index.js:306:10)
22:48:52         at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/nodes/centos7-64-gcc6/test/parallel/test-worker-syntax-error-file.js:13:24)
22:48:52         at Module._compile (internal/modules/cjs/loader.js:707:30)
22:48:52         at Object.Module._extensions..js (internal/modules/cjs/loader.js:718:10)
22:48:52         at Module.load (internal/modules/cjs/loader.js:605:32)
22:48:52         at tryModuleLoad (internal/modules/cjs/loader.js:544:12)
22:48:52         at Function.Module._load (internal/modules/cjs/loader.js:536:3)
22:48:52         at Function.Module.runMain (internal/modules/cjs/loader.js:760:12)
22:48:52         at startup (internal/bootstrap/node.js:304:19)
22:48:52   ...

@refack
Copy link
Contributor

refack commented Nov 19, 2018

AIX
https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/18988/
https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-2/

Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
    at Object.mustCall (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/common/index.js:312:10)
    at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-aix/nodes/aix61-ppc64/test/parallel/test-worker-syntax-error.js:12:24)
    at Module._compile (internal/modules/cjs/loader.js:722:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:733:10)
    at Module.load (internal/modules/cjs/loader.js:620:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:560:12)
    at Function.Module._load (internal/modules/cjs/loader.js:552:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:775:12)
    at startup (internal/bootstrap/node.js:300:19)

@addaleax
Copy link
Member

addaleax commented Feb 3, 2019

Couldn't reproduce this in 10000 runs on either x64 Linux or x64 Windows.

@Trott
Copy link
Member Author

Trott commented Feb 3, 2019

ncu-ci walk commit isn't pulling this up. I'm going to close this, but feel free to re-open if there's reason to believe it's still a bug. There's a good chance other fixes that have landed have also fixed this.

@Trott Trott closed this as completed Feb 3, 2019
@Trott Trott reopened this Apr 29, 2019
@Trott
Copy link
Member Author

Trott commented Apr 29, 2019

It's back.

https://ci.nodejs.org/job/node-test-commit-linux/27284/nodes=centos7-64-gcc6/console

test-rackspace-centos7-x64-1

00:13:00 not ok 2142 parallel/test-worker-syntax-error-file
00:13:00   ---
00:13:00   duration_ms: 0.487
00:13:00   severity: fail
00:13:00   exitcode: 1
00:13:00   stack: |-
00:13:00     Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
00:13:00         at Object.mustCall (/home/iojs/build/workspace/node-test-commit-linux/nodes/centos7-64-gcc6/test/common/index.js:339:10)
00:13:00         at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/nodes/centos7-64-gcc6/test/parallel/test-worker-syntax-error-file.js:12:24)
00:13:00         at Module._compile (internal/modules/cjs/loader.js:759:30)
00:13:00         at Object.Module._extensions..js (internal/modules/cjs/loader.js:770:10)
00:13:00         at Module.load (internal/modules/cjs/loader.js:628:32)
00:13:00         at Function.Module._load (internal/modules/cjs/loader.js:555:12)
00:13:00         at Function.Module.runMain (internal/modules/cjs/loader.js:824:10)
00:13:00         at internal/main/run_main_module.js:17:11
00:13:00   ...

@addaleax
Copy link
Member

@Trott I don’t think this can be fixed without somebody logging into the CI machines and figuring out where one can get a reasonable reproduction rate, and then work on that machine

(Debug logs might also be helpful, esp. NODE_DEBUG_NATIVE=message,worker)

@Trott
Copy link
Member Author

Trott commented Apr 29, 2019

Was able to replicate it on test-rackspace-centos7-x64-1 with ./tools/test.py -J --repeat 1000 test/parallel/test-worker-syntax-error-file.js in /home/iojs/build/workspace/node-test-commit-linux/nodes/centos7-64-gcc6. Had to get through about 200 runs before it failed.

Running it prefaced with NODE_DEBUG_NATIVE=message,worker generates this output upon failure:

=== release test-worker-syntax-error-file ===                   
Path: parallel/test-worker-syntax-error-file
Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
    at Object.mustCall (/home/iojs/build/workspace/node-test-commit-linux/nodes/centos7-64-gcc6/test/common/index.js:339:10)
    at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/nodes/centos7-64-gcc6/test/parallel/test-worker-syntax-error-file.js:12:24)
    at Module._compile (internal/modules/cjs/loader.js:759:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:770:10)
    at Module.load (internal/modules/cjs/loader.js:628:32)
    at Function.Module._load (internal/modules/cjs/loader.js:555:12)
    at Function.Module.runMain (internal/modules/cjs/loader.js:824:10)
    at internal/main/run_main_module.js:17:11
Worker (0:2) Creating new worker instance with thread id 1
MessagePort (0:3) Created message port
Worker (0:2) Preparation for worker 1 finished
MessagePort (0:3) Start receiving messages
MessagePort (0:3) Start receiving messages
MessagePort (0:6) Created message port
MessagePort (0:7) Created message port
MessagePort (0:6) Start receiving messages
MessagePort (0:7) Closing message port, data set = 1
MessagePort (0:6) Start receiving messages
MessagePort (0:7) MessagePort::OnClose()
Worker (0:2) Creating isolate for worker with id 1
Worker (0:2) Starting worker with id 1
Worker (0:2) Created Environment for worker with id 1
MessagePort (1:2) Created message port
Worker (0:2) Created message port for worker 1
MessagePort (1:2) Start receiving messages
MessagePort (1:2) Start receiving messages
Worker (0:2) Loaded environment for worker 1
MessagePort (1:2) Running MessagePort::OnMessage()
MessagePort (1:2) MessagePort has message, receiving = 1
MessagePort (1:3) Created message port
MessagePort (0:3) Adding message to incoming queue
MessagePort (0:3) Running MessagePort::OnMessage()
MessagePort (0:3) MessagePort has message, receiving = 1
MessagePort (0:3) MessagePort has message, receiving = 1
MessagePort (0:3) Adding message to incoming queue
Worker (0:2) Worker 1 called Exit(1)
Worker (0:2) Exiting thread for worker 1 with exit code 1
MessagePort (1:2) Closing message port, data set = 1
MessagePort (1:2) Closing message port, data set = 1
MessagePort (1:3) Closing message port, data set = 1
MessagePort (1:3) MessagePort::OnClose()
MessagePort (1:2) MessagePort::OnClose()
MessagePort (0:3) Closing message port, data set = 1
MessagePort (0:6) Running MessagePort::OnMessage()
MessagePort (0:6) MessagePort has message, receiving = 1
MessagePort (0:6) Closing message port, data set = 1
MessagePort (0:6) MessagePort::OnClose()
MessagePort (0:3) MessagePort::OnClose()
Worker (0:2) Worker 1 thread stops
Worker (0:2) Worker 1 dispose isolate
Worker (0:2) Worker 1 destroyed
Command: out/Release/node /home/iojs/build/workspace/node-test-commit-linux/nodes/centos7-64-gcc6/test/parallel/test-worker-syntax-error-file.js

@addaleax
Copy link
Member

@Trott Thanks, that’s definitely helpful.

MessagePort (0:3) Closing message port, data set = 1

For some reasons, it seems the internal message port on the parent side closes prematurely, before the drainMessagePort() call in the onexit handler runs. I don’t understand why this happens – the message port should be closed because the sibling in the child thread closes, but that should only happen after releasing all currently pending messages.

@Trott Do you think you could run this with this extra debugging patch + post failure output again? I’m not sure if it helps, but it could confirm that a) we close the message port with a message still pending and b) help figure out why the message port is being closed in this way to begin with.

diff --git a/src/node_messaging.cc b/src/node_messaging.cc
index 77ed959477f2..37f0080f9c46 100644
--- a/src/node_messaging.cc
+++ b/src/node_messaging.cc
@@ -516,6 +516,7 @@ void MessagePort::TriggerAsync() {
 
 void MessagePort::Close(v8::Local<v8::Value> close_callback) {
   Debug(this, "Closing message port, data set = %d", static_cast<int>(!!data_));
+  DumpBacktrace(stderr);
 
   if (data_) {
     // Wrap this call with accessing the mutex, so that TriggerAsync()
@@ -640,7 +641,8 @@ bool MessagePort::IsSiblingClosed() const {
 }
 
 void MessagePort::OnClose() {
-  Debug(this, "MessagePort::OnClose()");
+  Debug(this, "MessagePort::OnClose() with %zd pending messages",
+        (ssize_t)(data_ ? data_->incoming_messages_.size() : -1));
   if (data_) {
     data_->owner_ = nullptr;
     data_->Disentangle();

@Trott
Copy link
Member Author

Trott commented Apr 30, 2019

@addaleax Here's the output with your patch:

=== release test-worker-syntax-error-file ===                   
Path: parallel/test-worker-syntax-error-file
Mismatched <anonymous> function calls. Expected exactly 1, actual 0.
    at Object.mustCall (/home/iojs/build/workspace/node-test-commit-linux/nodes/centos7-64-gcc6/test/common/index.js:339:10)
    at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/nodes/centos7-64-gcc6/test/parallel/test-worker-syntax-error-file.js:12:24)
    at Module._compile (internal/modules/cjs/loader.js:759:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:770:10)
    at Module.load (internal/modules/cjs/loader.js:628:32)
    at Function.Module._load (internal/modules/cjs/loader.js:555:12)
    at Function.Module.runMain (internal/modules/cjs/loader.js:824:10)
    at internal/main/run_main_module.js:17:11
Worker (0:2) Creating new worker instance with thread id 1
MessagePort (0:3) Created message port
Worker (0:2) Preparation for worker 1 finished
MessagePort (0:3) Start receiving messages
MessagePort (0:3) Start receiving messages
MessagePort (0:6) Created message port
MessagePort (0:7) Created message port
MessagePort (0:6) Start receiving messages
MessagePort (0:7) Closing message port, data set = 1
 1: 0x9d1c42 node::worker::Message::Serialize(node::Environment*, v8::Local<v8::Context>, v8::Local<v8::Value>, v8::Local<v8::Value>, v8::Local<v8::Object>) [out/Release/node]
 2: 0x9d2037 node::worker::MessagePort::PostMessage(node::Environment*, v8::Local<v8::Value>, v8::Local<v8::Value>) [out/Release/node]
 3: 0x9d252c node::worker::MessagePort::PostMessage(v8::FunctionCallbackInfo<v8::Value> const&) [out/Release/node]
 4: 0xb8c926  [out/Release/node]
 5: 0xb8e839 v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) [out/Release/node]
 6: 0x1a84022  [out/Release/node]
MessagePort (0:6) Start receiving messages
MessagePort (0:7) MessagePort::OnClose() with -1 pending messages
Worker (0:2) Creating isolate for worker with id 1
Worker (0:2) Starting worker with id 1
Worker (0:2) Created Environment for worker with id 1
MessagePort (1:2) Created message port
Worker (0:2) Created message port for worker 1
MessagePort (1:2) Start receiving messages
MessagePort (1:2) Start receiving messages
Worker (0:2) Loaded environment for worker 1
MessagePort (1:2) Running MessagePort::OnMessage()
MessagePort (1:2) MessagePort has message, receiving = 1
MessagePort (1:3) Created message port
MessagePort (0:3) Adding message to incoming queue
MessagePort (0:3) Running MessagePort::OnMessage()
MessagePort (0:3) MessagePort has message, receiving = 1
MessagePort (0:3) MessagePort has message, receiving = 1
MessagePort (0:3) Adding message to incoming queue
Worker (0:2) Worker 1 called Exit(1)
Worker (0:2) Exiting thread for worker 1 with exit code 1
MessagePort (1:2) Closing message port, data set = 1
 1: 0x9d0d4e node::worker::MessagePort::Close(v8::Local<v8::Value>) [out/Release/node]
 2: 0xa19338  [out/Release/node]
 3: 0xa1dc36 node::worker::Worker::Run() [out/Release/node]
 4: 0xa1e19c  [out/Release/node]
 5: 0x7fa13e1c3dd5  [/lib64/libpthread.so.0]
 6: 0x7fa13deecead clone [/lib64/libc.so.6]
MessagePort (1:2) Closing message port, data set = 1
 1: 0x9d0d4e node::worker::MessagePort::Close(v8::Local<v8::Value>) [out/Release/node]
 2: 0x932db5 node::Environment::CleanupHandles() [out/Release/node]
 3: 0x93304b node::Environment::RunCleanup() [out/Release/node]
 4: 0xa1938f  [out/Release/node]
 5: 0xa1dc36 node::worker::Worker::Run() [out/Release/node]
 6: 0xa1e19c  [out/Release/node]
 7: 0x7fa13e1c3dd5  [/lib64/libpthread.so.0]
 8: 0x7fa13deecead clone [/lib64/libc.so.6]
MessagePort (1:3) Closing message port, data set = 1
 1: 0x9d0d4e node::worker::MessagePort::Close(v8::Local<v8::Value>) [out/Release/node]
 2: 0x932db5 node::Environment::CleanupHandles() [out/Release/node]
 3: 0x93304b node::Environment::RunCleanup() [out/Release/node]
 4: 0xa1938f  [out/Release/node]
 5: 0xa1dc36 node::worker::Worker::Run() [out/Release/node]
 6: 0xa1e19c  [out/Release/node]
 7: 0x7fa13e1c3dd5  [/lib64/libpthread.so.0]
 8: 0x7fa13deecead clone [/lib64/libc.so.6]
MessagePort (1:3) MessagePort::OnClose() with 0 pending messages
MessagePort (1:2) MessagePort::OnClose() with 2 pending messages
Worker (0:2) Worker 1 thread stops
Worker (0:2) Worker 1 dispose isolate
MessagePort (0:3) Closing message port, data set = 1
 1: 0x9d392b node::worker::MessagePort::OnMessage() [out/Release/node]
 2: 0xae597f  [out/Release/node]
 3: 0xaf7228  [out/Release/node]
 4: 0xae62bb uv_run [out/Release/node]
 5: 0x9cdabc node::NodeMainInstance::Run() [out/Release/node]
 6: 0x964041 node::Start(int, char**) [out/Release/node]
 7: 0x7fa13de113d5 __libc_start_main [/lib64/libc.so.6]
 8: 0x902655  [out/Release/node]
MessagePort (0:6) Running MessagePort::OnMessage()
MessagePort (0:6) MessagePort has message, receiving = 1
MessagePort (0:6) Closing message port, data set = 1
 1: 0x9d392b node::worker::MessagePort::OnMessage() [out/Release/node]
 2: 0xae597f  [out/Release/node]
 3: 0xaf7228  [out/Release/node]
 4: 0xae62bb uv_run [out/Release/node]
 5: 0x9cdabc node::NodeMainInstance::Run() [out/Release/node]
 6: 0x964041 node::Start(int, char**) [out/Release/node]
 7: 0x7fa13de113d5 __libc_start_main [/lib64/libc.so.6]
 8: 0x902655  [out/Release/node]
MessagePort (0:6) MessagePort::OnClose() with 0 pending messages
MessagePort (0:3) MessagePort::OnClose() with 1 pending messages
Worker (0:2) Worker 1 destroyed

addaleax added a commit to addaleax/node that referenced this issue May 14, 2019
When a `MessagePort` connected to another `MessagePort` closes, the
latter `MessagePort` will be closed as well. Until now, this is done
by testing whether the ports are still entangled after processing
messages. This leaves open a race condition window in which messages
sent just before the closure can be lost when timing is unfortunate.
(A description of the timing is in the test file.)

This can be addressed by using a special message instead, which is
the last message received by a `MessagePort`. This way, all previously
sent messages are processed first.

Fixes: nodejs#22762
targos pushed a commit that referenced this issue May 18, 2019
When a `MessagePort` connected to another `MessagePort` closes, the
latter `MessagePort` will be closed as well. Until now, this is done
by testing whether the ports are still entangled after processing
messages. This leaves open a race condition window in which messages
sent just before the closure can be lost when timing is unfortunate.
(A description of the timing is in the test file.)

This can be addressed by using a special message instead, which is
the last message received by a `MessagePort`. This way, all previously
sent messages are processed first.

Fixes: #22762
PR-URL: #27705
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. worker Issues and PRs related to Worker support.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants