Windows CI failures: parallel/test-trace-events-fs-sync #25512

Trott · 2019-01-15T08:34:32Z

Seeing 3221225477 (access violation) non-stop on win10 + vs2017 on CI in a trace_events test:

00:19:43 not ok 494 parallel/test-trace-events-fs-sync
00:19:43   ---
00:19:43   duration_ms: 0.781
00:19:43   severity: fail
00:19:43   exitcode: 1
00:19:43   stack: |-
00:19:43     c:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140
00:19:43         throw new Error(`${tr}:\n${util.inspect(proc)}`);
00:19:43         ^
00:19:43     
00:19:43     Error: fs.sync.fchmod:
00:19:43     { status: 3221225477,
00:19:43       signal: null,
00:19:43       output: [ null, '', '' ],
00:19:43       pid: 3272,
00:19:43       stdout: '',
00:19:43       stderr: '' }
00:19:43         at Object.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140:11)
00:19:43         at Module._compile (internal/modules/cjs/loader.js:722:30)
00:19:43         at Object.Module._extensions..js (internal/modules/cjs/loader.js:733:10)
00:19:43         at Module.load (internal/modules/cjs/loader.js:621:32)
00:19:43         at tryModuleLoad (internal/modules/cjs/loader.js:564:12)
00:19:43         at Function.Module._load (internal/modules/cjs/loader.js:556:3)
00:19:43         at Function.Module.runMain (internal/modules/cjs/loader.js:775:12)
00:19:43         at executeUserCode (internal/bootstrap/node.js:433:15)
00:19:43         at startExecution (internal/bootstrap/node.js:370:3)
00:19:43   ...

Examples:

https://ci.nodejs.org/job/node-test-binary-windows/23086/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=0/console
https://ci.nodejs.org/job/node-test-binary-windows/23085/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=0/
https://ci.nodejs.org/job/node-test-binary-windows/23084/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=0/

...and many others...started happening in the last 24 hours or so. Not sure if something changed in our code or if something changed on CI or what. First noted (to my knowledge) by @gireeshpunathil in #22712 (comment) and #22865 (comment).

That was rebased onto 7f91329. So if the problem is in our code (and not something that is only being surfaced now but has been there for a while or else something that is a problem with the CI host and not a problem with the test or code), then it would be either in that commit or one shortly before it.

@nodejs/trace-events

[refack]Added context - the above 3 fails are on 3 different workers. AFAICT all failures are similar and happen while testing fchmod
Test call site: https://github.com/nodejs/node/blob/master/test/parallel/test-trace-events-fs-sync.js#L122
and setup site: https://github.com/nodejs/node/blob/master/test/parallel/test-trace-events-fs-sync.js#L33-L36
where this test case is six deep.

The text was updated successfully, but these errors were encountered:

addaleax · 2019-01-15T09:03:11Z

Is 4da7e6e a candidate? That would be 6 hours before 7f91329, but as I said in the PR, I was a bit scared of how wide-reaching that change may be…

gireeshpunathil · 2019-01-15T11:12:16Z

@addaleax - it is not. I am able to recreate the issue with and without the said commit. My challenge is that I am unable to get userdump in my system (for some unknown reason, it is not producing one)

Trott · 2019-01-15T13:39:46Z

Happy to report that it's now only happening frequently in CI, and not non-stop. 😅

refack · 2019-01-15T15:31:45Z

So 3221225477 is 0xC0000005 or STATUS_ACCESS_VIOLATION and it's a PITA to figure out the exact cause of... But since we're seeing on Win10 and WRT to FS access, my first guess is it's related to either the builtin Windows content index service, or the builtin anti-virus.

gireeshpunathil · 2019-01-15T15:37:11Z

@refack - do you have guidance on how to enable userdump on win 10? I referred and followed several online docs, but none of those actually helped, a dump is never produced on 0xC...5 or I am not seeing it. thanks!

refack · 2019-01-15T15:51:59Z

semi-bisect: testing commit 92e95f1 which is the parent of 4da7e6e

@gireeshpunathil best experience I had was with enabling silent-exit-dumps #13947 (comment)

refack · 2019-01-15T23:11:07Z

Few data points:

I can repro (~1% of runs) but I can't get a core dump or an event logged.
Can repro with a limited test replacing

node/test/parallel/test-trace-events-fs-sync.js

Line 121 in 0759cbf

for (const tr in tests) {

with for (const tr in {'fs.sync.readdir':1}) {
I'm thinking the severity peek might have been a transient thing. Did not happen spontaneously for 7 hours - https://ci.nodejs.org/job/node-test-binary-windows/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=0/buildTimeTrend

joyeecheung · 2019-01-16T06:16:07Z

Some data points from nodejs/reliability#20

Reason	`parallel/test-trace-events-fs-sync`
Type	JS_TEST_FAILURE
Failed PR	15 (#25412, #25447, #25441, #25429, #25443, #25480, #22712, #25415, #25364, #25481, #25084, #25503, #25473, #25455, #23100)
Appeared	test-azure_msft-win10-x64-3, test-azure_msft-win10-x64-2, test-azure_msft-win10-x64-4
First CI	https://ci.nodejs.org/job/node-test-pull-request/20058/
Last CI	https://ci.nodejs.org/job/node-test-pull-request/20142/

Example

not ok 494 parallel/test-trace-events-fs-sync
  ---
  duration_ms: 0.794
  severity: fail
  exitcode: 1
  stack: |-
    c:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140
        throw new Error(`${tr}:\n${util.inspect(proc)}`);
        ^
    
    Error: fs.sync.fchmod:
    { status: 3221225477,
      signal: null,
      output: [ null, '', '' ],
      pid: 1756,
      stdout: '',
      stderr: '' }
        at Object.<anonymous> (c:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140:11)
        at Module._compile (internal/modules/cjs/loader.js:722:30)
        at Object.Module._extensions..js (internal/modules/cjs/loader.js:733:10)
        at Module.load (internal/modules/cjs/loader.js:621:32)
        at tryModuleLoad (internal/modules/cjs/loader.js:564:12)
        at Function.Module._load (internal/modules/cjs/loader.js:556:3)
        at Function.Module.runMain (internal/modules/cjs/loader.js:775:12)
        at executeUserCode (internal/bootstr...

gireeshpunathil · 2019-01-16T06:22:15Z

at this point the effort is to get a dump; recreate seem to be easy. this fails only in windows, and we don't have a single dump to see what was the sate of the child process when it caused access violation.

gireeshpunathil · 2019-01-16T10:56:56Z

I made a minimal code that potentially recreate the same thing, and I got this:

Assertion failed: 0, file d:\gireesh\node\deps\uv\src\win\handle-inl.h, line 160 repeatedly.

However, user dump is not there to figure out the context. Either gflags are not working for me, or it is not configured appropriately.

So, we are on the exit path for sure, but then a bad handle came on our way. Could it be that it is already destroyed?

gireeshpunathil · 2019-01-16T12:04:57Z

ok, modified the code to honor abort on exceptions, and got these info from the VS debugger (data copied from its UI frames, so not very much readable)

message:
Unhandled exception at 0x00007FF718B87B39 in node.exe: Fatal program exit requested.

call stack:

 	node.exe!abort() Line 77	C++
 	node.exe!common_assert_to_stderr<wchar_t>(const wchar_t * const expression, const wchar_t * const file_name, const unsigned int line_number) Line 186	C++
 	node.exe!_wassert(const wchar_t * expression, const wchar_t * file_name, unsigned int line_number) Line 443	C++
>	node.exe!uv_run(uv_loop_s * loop, <unnamed-enum-UV_RUN_DEFAULT> mode) Line 534	C
 	node.exe!uv__thread_start(void * arg) Line 104	C
 	node.exe!thread_start<unsigned int (__cdecl*)(void * __ptr64)>(void * const parameter) Line 115	C++
 	verifier.dll!00007ff93419507d()	Unknown
 	kernel32.dll!00007ff94de88102()	Unknown
 	ntdll.dll!00007ff94f56c5b4()	Unknown

loop content around endgame_handles:

-		loop	0x000001fa8540d508 {data=0xe0e0e0e0e0e0e0e0 active_handles=1 handle_queue=0x000001fa8540d518 {0x000001fa8540d618, ...} ...}	uv_loop_s *
		data	0xe0e0e0e0e0e0e0e0	void *
		active_handles	1	unsigned int
+		handle_queue	0x000001fa8540d518 {0x000001fa8540d618, 0x000001fa854114b0}	void *[2]
+		active_reqs	{unused=0x000001fa8540d528 {0xe0e0e0e000000000, 0xe0e0e0e0e0e0e0e0} count=0 }	<anonymous-tag>
		stop_flag	0	unsigned int
		iocp	0x00000000000002f8	void *
		time	21888192669	unsigned __int64
+		pending_reqs_tail	0x0000000000000000 {data=??? type=??? reserved=0x0000000000000010 {???, ???, ???, ???, ???, ???} ...}	uv_req_s *
-		endgame_handles	0xf0f0f0f0f0f0f0f0 {data=??? loop=??? type=??? ...}	uv_handle_s *
		data	<Unable to read memory>	
		loop	<Unable to read memory>	
		type	<Unable to read memory>	
		close_cb	<Unable to read memory>	
+		handle_queue	0xf0f0f0f0f0f0f110 {???, ???}	void *[2]
+		u	{fd=??? reserved=0xf0f0f0f0f0f0f120 {???, ???, ???, ???} }	<anonymous-tag>
		endgame_next	<Unable to read memory>	
		flags	<Unable to read memory>	
		timer_heap	0x000001fa8540e1d0	void *
+		prepare_handles	0x0000000000000000 {data=??? loop=??? type=??? ...}	uv_prepare_s *
+		check_handles	0x0000000000000000 {data=??? loop=??? type=??? ...}	uv_check_s *
+		idle_handles	0x0000000000000000 {data=??? loop=??? type=??? ...}	uv_idle_s *
+		next_prepare_handle	0x0000000000000000 {data=??? loop=??? type=??? ...}	uv_prepare_s *
+		next_check_handle	0x0000000000000000 {data=??? loop=??? type=??? ...}	uv_check_s *
+		next_idle_handle	0x0000000000000000 {data=??? loop=??? type=??? ...}	uv_idle_s *
+		poll_peer_sockets	0x000001fa8540d598 {0, 0, 0}	unsigned __int64[3]
		active_tcp_streams	0	unsigned int
		active_udp_streams	0	unsigned int
		timer_counter	0	unsigned __int64
+		wq	0x000001fa8540d5c0 {0x000001fa8540d5c0, 0x000001fa8540d5c0}	void *[2]
+		wq_mutex	{DebugInfo=0x000001fa8540e250 {Type=0 CreatorBackTraceIndex=148 CriticalSection=0x000001fa8540d5d0 {...} ...} ...}	_RTL_CRITICAL_SECTION
+		wq_async	{data=0xe0e0e0e0e0e0e0e0 loop=0x000001fa8540d508 {data=0xe0e0e0e0e0e0e0e0 active_handles=1 handle_queue=...} ...}	uv_async_s

so it is same as Assertion failed: 0, file d:\gireesh\node\deps\uv\src\win\handle-inl.h, line 160 with bad handle(s) appearing for cleanup.

gireeshpunathil · 2019-01-16T12:19:45Z

looking through the threads, 2 of them are relevant while others inert, just blocked on conditions.

failing thread (0x3d7c, some helper thread?)

node!abort+0x35
node!common_assert_to_stderr<wchar_t>+0xc8
node!_wassert+0x72
node!uv_run+0x3da
node!uv__thread_start+0x4d
node!thread_start<unsigned int (__cdecl*)(void * __ptr64)>+0x50
verifier!VerifierGetPropertyValueByName+0x1aefd
kernel32!BaseThreadInitThunk+0x22
ntdll!RtlUserThreadStart+0x34

main thread (0x280c)

ntdll!ZwWaitForAlertByThreadId+0x14
ntdll!RtlSleepConditionVariableCS+0xd0
KERNELBASE!SleepConditionVariableCS+0x37
node!uv_cond_wait+0x10
node!node::tracing::NodeTraceBuffer::`scalar deleting destructor'+0x63
node!v8::platform::tracing::TracingController::Initialize+0x24
node!node::tracing::Agent::~Agent+0x74
node!node::Start+0x56a
node!wmain+0x1bc
node!__scrt_common_main_seh+0x10c
kernel32!BaseThreadInitThunk+0x22
ntdll!RtlUserThreadStart+0x34

gireeshpunathil · 2019-01-16T12:29:02Z

will see if f39b3e3 is of any relevance.

gireeshpunathil · 2019-01-16T14:05:49Z

while this still need to be narrowed down, it is more or less evident that event loop is getting bad elements from the destruction that is happening in the main thread?

if so, does @addaleax 's reset StopTracingAgent() before platform teardown (#25472) going to be a solution for this?

gireeshpunathil · 2019-01-16T14:39:11Z

fwiw, the failing thread is identified as the Agent thread that runs the event loop on tracing_loop_ in the lambda within Agent::Start

gireeshpunathil · 2019-01-16T14:44:31Z

that leads to a point (of dilemma for me): is it fair / safe to execute uv_run from 2 different threads? one from agent thread created in Agent::Start and the other one (main) through Agent::~Agent?

edit: ~~to execute uv_run~~ => to execute uv_run on the same loop object

bzoz · 2019-01-17T10:40:42Z

The code looks ok. We wait for the thread to join in StopTracing, before calling uv_run from another thread.

I cant get this to reproduce on my box. Could you inspect the content of the handle that causes the crash?

gireeshpunathil · 2019-01-17T15:45:10Z

@bzoz - sure; been trying hard to get a repro. Will update when I have one.

addaleax · 2019-01-17T19:06:02Z

is it fair / safe to execute uv_run from 2 different threads?

@gireeshpunathil I’m pretty sure the answer to that is “no”.

refack · 2019-01-17T19:23:47Z

is it fair / safe to execute uv_run from 2 different threads?

Other runtimes do this (e.g. Electron & Julia), but in a thread-safe way (conditionally on a mutex or other locks).

bzoz · 2019-01-17T19:26:40Z

There is a uv_thread_join on the thread that runs the loop before the other thread gives the loop a spin.

addaleax · 2019-01-18T00:48:02Z

Oh, yeah, running it concurrently is the issue. I think consecutively running on different threads should not be an issue on its own.

gireeshpunathil · 2019-01-18T08:08:17Z

one challenge in getting the needed info is: right now the failing code is run as a child process, if we move it to the parent then it does not fail anymore; and if run as is it fails, but does not produce dumps.

I am running with SetErrorMode(0) in both libuv and the python harness, that addresses the dump generation issue for the parent, but not for the child!

Is there a specific flag I could pass to CreateProcessW? I don't see anything in uv_spawn that inhibits dump generation for the child.

gireeshpunathil · 2019-01-31T08:41:14Z

ok, now given the problem site in localized, I covered the faulty lines in __try , __except blocks and ran the original test, the control goes to the catch block; so confirm that the failure is indeed here!

the only possibility I can think of is concurrent access to the endgame_handles list and / or its chain by 2 threads.

if 2 threads needs to operate on the same loop object, I don't know how many places we will need exclusive access, but in this case, at least in uv_want_endgame and uv_process_endgames

gireeshpunathil · 2019-02-02T06:18:57Z

Given that we precisely know the failing line, I patched libuv thus:

adding an exception handler to get the exception record
adding instrument to print the relevant elements

diff --git a/deps/uv/src/win/handle-inl.h b/deps/uv/src/win/handle-inl.h
index 82c657d579..88bc38470b 100644
--- a/deps/uv/src/win/handle-inl.h
+++ b/deps/uv/src/win/handle-inl.h
@@ -91,16 +91,32 @@ INLINE static void uv_want_endgame(uv_loop_t* loop, uv_handle_t* handle) {

     handle->endgame_next = loop->endgame_handles;
     loop->endgame_handles = handle;
+    fprintf(stderr, "uv_want_endgame, loop: 0x%p, handle: 0x%p, type: %d, requester: %d\n", loop, handle, handle->type, GetCurrentThreadId());
   }
 }

+inline DWORD GGG(LPEXCEPTION_POINTERS ptr)
+{
+  EXCEPTION_RECORD* record = ptr->ExceptionRecord;
+  fprintf(stderr, "code: %d, flags: %d\n", record->ExceptionCode, record->ExceptionFlags);
+  return EXCEPTION_EXECUTE_HANDLER;
+}
+

 INLINE static void uv_process_endgames(uv_loop_t* loop) {
   uv_handle_t* handle;

+
+
   while (loop->endgame_handles) {
-    handle = loop->endgame_handles;
-    loop->endgame_handles = handle->endgame_next;
+    __try {
+      handle = loop->endgame_handles;
+      fprintf(stderr, "uv_process_endgame, loop: 0x%p, handle: 0x%p, type: %d, executor: %d\n", loop, handle, handle->type, GetCurrentThreadId());
+      loop->endgame_handles = handle->endgame_next;
+    }
+    __except(GGG(GetExceptionInformation())) {
+      fprintf(stderr, "surprise in the end game!\n");
+    }

     handle->flags &= ~UV_HANDLE_ENDGAME_QUEUED;

and got this:

uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBED20, type: 13, requester: 117392
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEDC0, type: 2, requester: 117392
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEE38, type: 6, requester: 117392
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEEB0, type: 9, requester: 117392
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEF28, type: 2, requester: 117392
uv_process_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEF28, type: 2, executor: 117392
uv_process_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEEB0, type: 9, executor: 117392
uv_process_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEE38, type: 6, executor: 117392
uv_process_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBEDC0, type: 2, executor: 117392
uv_process_endgame, loop: 0x00007FF620A97810, handle: 0x00000099BDBBED20, type: 13, executor: 117392
// no problem, one full cycle of end game
// from the many types of handles, this looks like the main loop

uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x00007FF620A951D0, type: 1, requester: 117392
uv_want_endgame, loop: 0x00007FF620A97810, handle: 0x0000025AD5BEB170, type: 1, requester: 117392
// 2 new parties registered for end game, but did not fire yet.

uv_want_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC1FB0, type: 1, requester: 62228
uv_want_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2090, type: 1, requester: 62228
uv_process_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2090, type: 1, executor: 62228
uv_process_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC1FB0, type: 1, executor: 62228
// failing thread (agent's thread) runs one cycle of its end game. All are async handles.

uv_want_endgame, loop: 0x0000025AD5BC9238, handle: 0x0000025AD5BC9408, type: 1, requester: 83608
uv_process_endgame, loop: 0x0000025AD5BC9238, handle: 0x0000025AD5BC9408, type: 1, executor: 83608
// some other loop, some other thread. don't know what it is.

uv_want_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2580, type: 1, requester: 62228
uv_want_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2660, type: 1, requester: 62228
uv_process_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2660, type: 1, executor: 62228
uv_process_endgame, loop: 0x0000025AD5BC0728, handle: 0x0000025AD5BC2580, type: 1, executor: 62228
// failing thread comes again, this time only to crash.

code: -1073741819, flags: 0
surprise in the end game!

Observations:

Unlike I suspected earlier, no 2 threads seem to interact with one loop
Problem always happens when the handle is async
Problem always occurs onto the tracing agent's thread
0x62228 is the only thread that entered the endgame twice

Inferences:

Given the code in uv_want_endgame and uv_process_endgame, this does not seem
Given that the loop->engame_handles link is only modified by one thread, there is no synchronization / cache coherence issue
Given that uv_run can be run multiple times, re-entering uv_process_endgame seem to be valid
endgame_next field in uv_handle_t is neither initialized to NULL in the handle init, nor nullified in the handle close. However, doing that do not seem to solve the issue; plus that cannot explain the current crash.

Trott · 2019-02-02T06:46:15Z

@nodejs/libuv

gireeshpunathil · 2019-02-02T11:54:25Z

ok, so here we are!

In addition to the above, add this one too:

diff --git a/src/tracing/node_trace_buffer.cc b/src/tracing/node_trace_buffer.cc
index 3b7119f6d5..9b907a6a81 100644
--- a/src/tracing/node_trace_buffer.cc
+++ b/src/tracing/node_trace_buffer.cc
@@ -110,6 +110,7 @@ NodeTraceBuffer::NodeTraceBuffer(size_t max_chunks,
 }

 NodeTraceBuffer::~NodeTraceBuffer() {
+  fprintf(stderr, "destroying exit signal handle 0x%p\n", &exit_signal_);
   uv_async_send(&exit_signal_);
   Mutex::ScopedLock scoped_lock(exit_mutex_);
   while (!exited_) {

and we get here:

uv_want_endgame, loop: 0x000002B61728B058, handle: 0x000002B61728B228, type: 1, requester: 11804
uv_process_endgame, loop: 0x000002B61728B058, handle: 0x000002B61728B228, type: 1, executor: 11804
...

destroying exit signal handle 0x000002B617283090
...
uv_want_endgame, loop: 0x000002B61727F618, handle: 0x000002B617282FB0, type: 1, requester: 103264
uv_want_endgame, loop: 0x000002B61727F618, handle: 0x000002B617283090, type: 1, requester: 103264
uv_process_endgame, loop: 0x000002B61727F618, handle: 0x000002B617283090, type: 1, executor: 103264
uv_process_endgame, loop: 0x000002B61727F618, handle: 0x000002B617282FB0, type: 1, executor: 103264
surprise in the end game!

the bad async handle is indeed the exit signal from the trace buffer.
the tracer is being destroyed as part of re-init.
exit_handle_ is a part of NodeTraceBuffer object, not even a pointer. So it does not survive across delete.
we send a piece of memory which we delete soon.
when the end game is on, this memory is unavailable.

Issue exists in all platforms, the crash will depend on how the C++ runtime deals with deleted memory. Some keep it aside for future uses, some unmap it, which case causes definite crash. From the symptom I assume Win 10 frees it up.

As the handle is used for cross thread communication, it should outlive the Buffer object.
And as we don't seem to create multiple TraceBuffer objects, how about this being made static?

cc @addaleax

addaleax · 2019-02-02T14:11:03Z

@gireeshpunathil It looks like you’ve struck gold, but I’m not quite sure that what you’re suspecting as the reason is correct:

exit_signal_ is closed inside its own callback, and ~NodeTraceBuffer does not finish until exited_ is set, which only happens once exit_signal_ has been closed. Freeing the memory after the handle is closed is normal and should be okay.

However, there’s something that sticks out about your last log output: uv_want_endgame and uv_process_endgame happen in reverse order for the two NodeTraceBuffer handles.

We tell the thread that deletes NodeTraceBuffer to continue once exit_signal_ is closed, but we actually try to close flush_signal_ afterwards as well, when the other thread might have continued, and so we run into trouble there.

To confirm, running with this patch doesn’t make it crash on Linux for me, but it does make valgrind complain:

@@ -176,9 +177,12 @@ void NodeTraceBuffer::ExitSignalCb(uv_async_t* signal) {
            [](uv_handle_t* signal) {
       NodeTraceBuffer* buffer =
           reinterpret_cast<NodeTraceBuffer*>(signal->data);
+      {
       Mutex::ScopedLock scoped_lock(buffer->exit_mutex_);
       buffer->exited_ = true;
       buffer->exit_cond_.Signal(scoped_lock);
+      }
+      sleep(1);
   });
 }

Libuv does not guarantee that handles have their close callbacks called in the order in which they were added (and in fact, currently calls them in reverse order). This patch ensures that the `flush_signal_` handle is no longer in use (i.e. its close callback has already been run) when we signal to the main thread that `~NodeTraceBuffer` may be destroyed. Credit for debugging goes to Gireesh Punathil. Fixes: nodejs#25512

addaleax · 2019-02-02T14:23:22Z

Suggested solution: addaleax/node@60311fe

Stress test CI: https://ci.nodejs.org/job/node-stress-single-test/2146/

gireeshpunathil · 2019-02-02T16:09:20Z

@addaleax - I will also run the test with your patch, as well as go try to understand the changes.

addaleax · 2019-02-02T17:36:05Z

@gireeshpunathil Just occurred to me that this might be somewhat easier to explain with your debug output:

destroying exit signal handle 0x000002B617283090  // enter destructor
...
// uv_close() for flush_signal_
uv_want_endgame, loop: 0x000002B61727F618, handle: 0x000002B617282FB0, type: 1, requester: 103264
// uv_close() for exit_signal_
uv_want_endgame, loop: 0x000002B61727F618, handle: 0x000002B617283090, type: 1, requester: 103264
// uv_close() callback for exit_signal_
uv_process_endgame, loop: 0x000002B61727F618, handle: 0x000002B617283090, type: 1, executor: 103264
/* XXX The destructor continues NOW and memory is released,
 * even though flush_signal_ still has a pending callback */
// uv_close() callback for flush_signal_
uv_process_endgame, loop: 0x000002B61727F618, handle: 0x000002B617282FB0, type: 1, executor: 103264
surprise in the end game!

refack · 2019-02-02T19:29:42Z

Another data point. From our CI, binaries created on https://ci.nodejs.org/computer/test-rackspace-win2008r2-x64-3/builds reproduce readily. Binaries from other machines not so much 😕

refack · 2019-02-02T23:49:25Z

Strong evidence that #25896 fixes the issue. Good job y'all 🎩

On https://ci.nodejs.org/computer/test-rackspace-win2008r2-x64-3/builds Checked out 60311fe
Build node-after.zip
run git reset --hard HEAD~1 (i.e. 28c0f84)
Build node-pre.zip
Tested on https://ci.nodejs.org/computer/test-azure_msft-win10-x64-3/
Results ("pre" first, "after" second):

C:\workspace\node-test-binary-windows>python tools\test.py --repeat 10000 parallel/test-trace-events-fs-sync
=== release test-trace-events-fs-sync ===
Path: parallel/test-trace-events-fs-sync
C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140
    throw new Error(`${tr}:\n${util.inspect(proc)}`);
    ^

Error: fs.sync.utimes:
{ status: 3221225477,
  signal: null,
  output: [ null, '', '' ],
  pid: 192,
  stdout: '',
  stderr: '' }
    at Object.<anonymous> (C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140:11)
    at Module._compile (internal/modules/cjs/loader.js:735:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:746:10)
    at Module.load (internal/modules/cjs/loader.js:627:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:570:12)
    at Function.Module._load (internal/modules/cjs/loader.js:562:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:798:12)
    at internal/main/run_main_module.js:27:11
Command: C:\workspace\node-test-binary-windows\Release\node.exe C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js
=== release test-trace-events-fs-sync ===
Path: parallel/test-trace-events-fs-sync
C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140
    throw new Error(`${tr}:\n${util.inspect(proc)}`);
    ^

Error: fs.sync.fstat:
{ status: 3221225477,
  signal: null,
  output: [ null, '', '' ],
  pid: 460,
  stdout: '',
  stderr: '' }
    at Object.<anonymous> (C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js:140:11)
    at Module._compile (internal/modules/cjs/loader.js:735:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:746:10)
    at Module.load (internal/modules/cjs/loader.js:627:32)
    at tryModuleLoad (internal/modules/cjs/loader.js:570:12)
    at Function.Module._load (internal/modules/cjs/loader.js:562:3)
    at Function.Module.runMain (internal/modules/cjs/loader.js:798:12)
    at internal/main/run_main_module.js:27:11
Command: C:\workspace\node-test-binary-windows\Release\node.exe C:\workspace\node-test-binary-windows\test\parallel\test-trace-events-fs-sync.js
[00:27|%   0|+   7|-   2]: Done

C:\workspace\node-test-binary-windows>python tools\test.py --repeat 10000 parallel/test-trace-events-fs-sync
[15:47|%   3|+ 307|-   0]: Done

C:\workspace\node-test-binary-windows>

gireeshpunathil · 2019-02-03T02:11:01Z

New crash with patch from #25896

uv_want_endgame, loop: 0x000001E67201F838, handle: 0x000001E67201FCD0, type: 1, requester: 63340
uv_want_endgame, loop: 0x000001E67201F838, handle: 0x000001E67201FDB0, type: 1, requester: 63340
uv_process_endgame, loop: 0x000001E67201F838, handle: 0x000001E67201FDB0, type: 1, executor: 63340
uv_process_endgame, loop: 0x000001E67201F838, handle: 0x000001E67201FCD0, type: 1, executor: 63340
bad type!!!

pls note that this may not be the exit / flush signal handle pairs, as the printf is missing
pls also note that the bad type is my instrument, coming from the bottom of the endgame loop where it asserts when the handle type cannot be determined - it surely is indicative of access to released memory.

But then who can be these? let us find out.

gireeshpunathil · 2019-02-03T02:31:17Z

node::tracing::NodeTraceWriter::ExitSignalCb too?

gireeshpunathil · 2019-02-03T07:42:27Z

@addaleax - I can confirm that the change is relevant to NodeTraceWriter::ExitSignalCb as well.

With the new patch, test failed once without enough context, that prompted me to look around for similar patterns. But could not get enough evidence on this, as the failure suddenly just stopped.

Then I followed you suggestions in Linux (adding timing etc.) but did not see any failure. Tried this patch (in theory this should be ok, as we are on the back edge of deleting the object, so cleaning manually is fine)

index 3b7119f..33e12e4 100644
--- a/src/tracing/node_trace_buffer.cc
+++ b/src/tracing/node_trace_buffer.cc
@@ -114,6 +114,12 @@ NodeTraceBuffer::~NodeTraceBuffer() {
   Mutex::ScopedLock scoped_lock(exit_mutex_);
   while (!exited_) {
     exit_cond_.Wait(scoped_lock);
+    if (exited_) {
+     exit_signal_.type = (uv_handle_type) -1;
+     flush_signal_.type = (uv_handle_type) -1;
+    }
+
   }
 }

and saw what we used to see in Windows - but was assertion failures on bad handle type - because I manually cleared it.

Then I examined a freed object and saw that it is was in tact - so we were using freed memory in LInux too.

So applying the same theory, I can confirm that NodeTraceWriter::~NodeTraceWriter is vulnerable too.

On a hindsight, do we / should we have an option to direct malloc to fill freed up memory with garbage? Such an option would help us detect issues faster. I looked at MALLOPT and MALLOCOPTIONS but did not find any

addaleax · 2019-02-03T13:58:45Z

@gireeshpunathil Yeah, thanks for catching that ~NodeTraceWriter has the same issue.

I think filling memory with garbage is something we'd only want to do in debug mode -- but maybe we can implement that in some way, yes?

Libuv does not guarantee that handles have their close callbacks called in the order in which they were added (and in fact, currently calls them in reverse order). This patch ensures that the `flush_signal_` handle is no longer in use (i.e. its close callback has already been run) when we signal to the main thread that `~NodeTraceBuffer` may be destroyed. The same applies for `~NodeTraceWriter`. Credit for debugging goes to Gireesh Punathil. Fixes: #25512 Co-authored-by: Gireesh Punathil <gpunathi@in.ibm.com> PR-URL: #25896 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Refael Ackermann <refack@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com>

`json_trace_writer_` is protected by `stream_mutex_`, but one access to it was not guarded by a lock on said mutex. Refs: #25512 PR-URL: #25624 Reviewed-By: Denys Otrishko <shishugi@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Richard Lau <riclau@uk.ibm.com>

Libuv does not guarantee that handles have their close callbacks called in the order in which they were added (and in fact, currently calls them in reverse order). This patch ensures that the `flush_signal_` handle is no longer in use (i.e. its close callback has already been run) when we signal to the main thread that `~NodeTraceBuffer` may be destroyed. The same applies for `~NodeTraceWriter`. Credit for debugging goes to Gireesh Punathil. Fixes: #25512 Co-authored-by: Gireesh Punathil <gpunathi@in.ibm.com> PR-URL: #25896 Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Refael Ackermann <refack@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Eugene Ostroukhov <eostroukhov@google.com>

possible related: nodejs/node#25512

Trott mentioned this issue Jan 15, 2019

test: improve known_issues/test-vm-timeout-escape-queuemicrotask #25503

Closed

2 tasks

joyeecheung mentioned this issue Jan 16, 2019

CI failures: 20190110 - 20190116 nodejs/reliability#20

Closed

mhdawson mentioned this issue Jan 17, 2019

test: revert fail coverage target if tests fail" #25543

Closed

2 tasks

gireeshpunathil mentioned this issue Jan 31, 2019

SIGSEGV handler causes infinite loop if signal raised from within #25762

Closed

Trott added the libuv Issues and PRs related to the libuv dependency or the uv binding. label Feb 2, 2019

gireeshpunathil added confirmed-bug Issues with confirmed bugs. trace_events Issues and PRs related to V8, Node.js core, and userspace code trace events. and removed trace_events Issues and PRs related to V8, Node.js core, and userspace code trace events. labels Feb 2, 2019

addaleax mentioned this issue Feb 2, 2019

src: fix race condition in ~NodeTraceBuffer #25896

Closed

2 tasks

refack mentioned this issue Feb 2, 2019

test: mark test-trace-events-fs-sync as FLAKY #25899

Closed

4 tasks

addaleax closed this as completed in 5506dcd Feb 4, 2019

dr-js added a commit to dr-js/dr-dev that referenced this issue Apr 30, 2020

TMP: ci: test babel error with nodejs@14 (again)

4760573

possible related: nodejs/node#25512

dr-js mentioned this issue Apr 30, 2020

Flaky ELIFECYCLE/3221225477 exit error with win10/nodejs@14 nodejs/help#2660

Closed

F3n67u mentioned this issue Jun 20, 2022

Investigate flaky test-trace-events-fs-sync #43502

Closed

Windows CI failures: parallel/test-trace-events-fs-sync #25512

Windows CI failures: parallel/test-trace-events-fs-sync #25512

Comments

Trott commented Jan 15, 2019 • edited by refack

addaleax commented Jan 15, 2019

gireeshpunathil commented Jan 15, 2019

Trott commented Jan 15, 2019

refack commented Jan 15, 2019

gireeshpunathil commented Jan 15, 2019 • edited

refack commented Jan 15, 2019

refack commented Jan 15, 2019

joyeecheung commented Jan 16, 2019

gireeshpunathil commented Jan 16, 2019

gireeshpunathil commented Jan 16, 2019

gireeshpunathil commented Jan 16, 2019

gireeshpunathil commented Jan 16, 2019

gireeshpunathil commented Jan 16, 2019

gireeshpunathil commented Jan 16, 2019

gireeshpunathil commented Jan 16, 2019

gireeshpunathil commented Jan 16, 2019 • edited

bzoz commented Jan 17, 2019

gireeshpunathil commented Jan 17, 2019

addaleax commented Jan 17, 2019

refack commented Jan 17, 2019 • edited

bzoz commented Jan 17, 2019

addaleax commented Jan 18, 2019

gireeshpunathil commented Jan 18, 2019

gireeshpunathil commented Jan 31, 2019 • edited

gireeshpunathil commented Feb 2, 2019

Trott commented Feb 2, 2019

gireeshpunathil commented Feb 2, 2019

addaleax commented Feb 2, 2019 • edited

addaleax commented Feb 2, 2019

gireeshpunathil commented Feb 2, 2019

addaleax commented Feb 2, 2019

refack commented Feb 2, 2019

refack commented Feb 2, 2019

gireeshpunathil commented Feb 3, 2019

gireeshpunathil commented Feb 3, 2019

gireeshpunathil commented Feb 3, 2019

addaleax commented Feb 3, 2019

Trott commented Jan 15, 2019 •

edited by refack

gireeshpunathil commented Jan 15, 2019 •

edited

gireeshpunathil commented Jan 16, 2019 •

edited

refack commented Jan 17, 2019 •

edited

gireeshpunathil commented Jan 31, 2019 •

edited

addaleax commented Feb 2, 2019 •

edited