New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unix,win: fix threadpool race condition #2021
Conversation
90891b4 introduced a race condition when accessing `slow_io_work_running` – it is being increased and later decreased as part of the worker thread loop, but was accessed with different mutexes during these operations. This fixes the race condition by making sure both accesses are protected through the global `mutex` of `threadpool.c`. This fixes a number of flaky Node.js tests. Refs: libuv#1845 Refs: nodejs/reliability#18 Refs: nodejs/node#23089 Refs: nodejs/node#23067 Refs: nodejs/node#23066 Refs: nodejs/node#23219
CI: https://ci.nodejs.org/view/libuv/job/libuv-test-commit/1050/ (:heavy_check_mark: except infra failure on z/OS) Would be cool to get a release out today if that’s possible? /cc @cjihrig |
Line 115 in 28139bf
should we signal only if |
I think you’re right, but I kind of don’t want to mix that improvement with a fix for a bug which has been haunting us for weeks 😄 Would you mind opening a separate PR? |
@addaleax - I am trying to reason out the problem, and the fix; please help me. with implication to that is the this PR fixes that by adjusting the But then it just takes us to the same point as the behavior prior to |
@gireeshpunathil I’m not sure whether you’ve seen it, but is nodejs/node#23099 (comment) the answer to your question? |
yes, I just saw it, thanks. |
@addaleax would it be possible adding a test that reproduces the issue you described (sending lots of fast DNS requests) ? |
@santigimeno I’m not sure – this seems to be platform-dependent in some way. In theory, just raising the |
Code and Learn is in less than a week and it would be really great to have Node.js CI green for it. I imagine a libuv release between now and then would be asking a lot, but maybe if this lands here, we can float it as a patch on our master branch? |
I was attempting for a stable recreate for the problem that PR addresses, but could not: my (JS) test case involved a unique domain name generator, a tight loop in which those are looked up, and the whole thing is repeated at intervals, with the interval frequency lying on the boundary of the speed of the resolution; expectation is being some looks up being piled up and eventual hang. Pileup happens, but it looks like the system recovers from it. I tried with a 2 CPU MAC and a 120 CPU Linux; but the pattern is same. So may be that |
@gireeshpunathil I think the issue is more of a platform-dependent one in some way, so trying to reproduce on WIndows might be a better idea? (I couldn’t do so on Linux either.)
I don’t see how it could “magically” recover by itself… |
I mean - irrespective of the degree of race condition and data corruption, when one or more engaged workers return, those decrement the slow io work count, and there by allow some work to progress, though not optimally, but eventually completes all the pending work? Isn't it a possibility? |
Fwiw, I could reproduce this bug on Windows (somewhat flakily) by bumping
@gireeshpunathil As far as I could tell, the issue is that those decrements end up getting lost because of the race condition? |
@cjihrig I guess we want 48 hours before Code + Learn for the libuv upgrade PR, which would mean a release before Wednesday… would that be okay with you? |
I'll make the release tomorrow (assuming this lands). There are actually a number of other Node bug fixes queued up for release. Proposal: #2024 |
90891b4 introduced a race condition when accessing `slow_io_work_running` – it is being increased and later decreased as part of the worker thread loop, but was accessed with different mutexes during these operations. This fixes the race condition by making sure both accesses are protected through the global `mutex` of `threadpool.c`. This fixes a number of flaky Node.js tests. Refs: #1845 Refs: nodejs/reliability#18 Refs: nodejs/node#23089 Refs: nodejs/node#23067 Refs: nodejs/node#23066 Refs: nodejs/node#23219 PR-URL: #2021 Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
90891b4 introduced a race
condition when accessing
slow_io_work_running
– it is beingincreased and later decreased as part of the worker thread loop,
but was accessed with different mutexes during these operations.
This fixes the race condition by making sure both accesses
are protected through the global
mutex
ofthreadpool.c
.This fixes a number of flaky Node.js tests.
Refs: #1845
Refs: nodejs/reliability#18
Refs: nodejs/node#23089
Refs: nodejs/node#23067
Refs: nodejs/node#23066
Refs: nodejs/node#23219