Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unix,win: fix threadpool race condition #2021

Closed
wants to merge 1 commit into from

Conversation

addaleax
Copy link
Contributor

@addaleax addaleax commented Oct 4, 2018

90891b4 introduced a race
condition when accessing slow_io_work_running – it is being
increased and later decreased as part of the worker thread loop,
but was accessed with different mutexes during these operations.

This fixes the race condition by making sure both accesses
are protected through the global mutex of threadpool.c.

This fixes a number of flaky Node.js tests.

Refs: #1845
Refs: nodejs/reliability#18
Refs: nodejs/node#23089
Refs: nodejs/node#23067
Refs: nodejs/node#23066
Refs: nodejs/node#23219

90891b4 introduced a race
condition when accessing `slow_io_work_running` – it is being
increased and later decreased as part of the worker thread loop,
but was accessed with different mutexes during these operations.

This fixes the race condition by making sure both accesses
are protected through the global `mutex` of `threadpool.c`.

This fixes a number of flaky Node.js tests.

Refs: libuv#1845
Refs: nodejs/reliability#18
Refs: nodejs/node#23089
Refs: nodejs/node#23067
Refs: nodejs/node#23066
Refs: nodejs/node#23219
@addaleax
Copy link
Contributor Author

addaleax commented Oct 4, 2018

CI: https://ci.nodejs.org/view/libuv/job/libuv-test-commit/1050/ (:heavy_check_mark: except infra failure on z/OS)
z/OS rebuild: https://ci.nodejs.org/job/libuv-test-commit-zos/1028/ (:heavy_check_mark:)
Node.js CI: https://ci.nodejs.org/job/node-test-commit/22032/ (:heavy_check_mark:)
Node.js stress test CI: https://ci.nodejs.org/job/node-stress-single-test/2060/ (:heavy_check_mark: not flaky on AIX; contains only failures of the form nodejs/reliability#18 (comment))

Would be cool to get a release out today if that’s possible? /cc @cjihrig

@gireeshpunathil
Copy link
Contributor

uv_cond_signal(&cond);

should we signal only if slow_work_thread_threshold() is met? else the idle workers get notified and go to dormancy for no reasons?

@addaleax
Copy link
Contributor Author

addaleax commented Oct 5, 2018

should we signal only if slow_work_thread_threshold() is met? else the idle workers get notified and go to dormancy for no reasons?

I think you’re right, but I kind of don’t want to mix that improvement with a fix for a bug which has been haunting us for weeks 😄 Would you mind opening a separate PR?

@gireeshpunathil
Copy link
Contributor

@addaleax - I am trying to reason out the problem, and the fix; please help me.

with 90891b4, the mutex was partially open for operations involving slow_io variable, and other threads were able to access those before one tread completes its work

implication to that is the slow_io workload is more carried out more than expected but seen as lesser than its defined threshold.

this PR fixes that by adjusting the slow_io work count through exclusive mutex so only one source of truth is visible.

But then it just takes us to the same point as the behavior prior to 90891b4 ? how the tests weer failing? I am unable to understand that. will you please clarify?

@addaleax
Copy link
Contributor Author

addaleax commented Oct 5, 2018

@gireeshpunathil I’m not sure whether you’ve seen it, but is nodejs/node#23099 (comment) the answer to your question?

@gireeshpunathil
Copy link
Contributor

yes, I just saw it, thanks.

@santigimeno
Copy link
Member

@addaleax would it be possible adding a test that reproduces the issue you described (sending lots of fast DNS requests) ?

@addaleax
Copy link
Contributor Author

addaleax commented Oct 5, 2018

@santigimeno I’m not sure – this seems to be platform-dependent in some way. In theory, just raising the CONCURRENT_COUNT in test-getaddrinfo.c should work, but it always passes for me regardless of how high I set it (as do the flaky Node.js tests).

@Trott
Copy link
Contributor

Trott commented Oct 7, 2018

Code and Learn is in less than a week and it would be really great to have Node.js CI green for it. I imagine a libuv release between now and then would be asking a lot, but maybe if this lands here, we can float it as a patch on our master branch?

@gireeshpunathil
Copy link
Contributor

I was attempting for a stable recreate for the problem that PR addresses, but could not: my (JS) test case involved a unique domain name generator, a tight loop in which those are looked up, and the whole thing is repeated at intervals, with the interval frequency lying on the boundary of the speed of the resolution; expectation is being some looks up being piled up and eventual hang. Pileup happens, but it looks like the system recovers from it. I tried with a 2 CPU MAC and a 120 CPU Linux; but the pattern is same. So may be that slow_io_work_running though mis-computed, when work ceases, it comes back to normal? End result is that the tests just run slow; with issue indistinguishable from non-issue. It is just my theory.

@addaleax
Copy link
Contributor Author

addaleax commented Oct 7, 2018

@gireeshpunathil I think the issue is more of a platform-dependent one in some way, so trying to reproduce on WIndows might be a better idea? (I couldn’t do so on Linux either.)

So may be that slow_io_work_running though mis-computed, when work ceases, it comes back to normal?

I don’t see how it could “magically” recover by itself…

@gireeshpunathil
Copy link
Contributor

I mean - irrespective of the degree of race condition and data corruption, when one or more engaged workers return, those decrement the slow io work count, and there by allow some work to progress, though not optimally, but eventually completes all the pending work? Isn't it a possibility?

@addaleax
Copy link
Contributor Author

addaleax commented Oct 7, 2018

Fwiw, I could reproduce this bug on Windows (somewhat flakily) by bumping CONCURRENT_COUNT to 1000 in test-getaddrinfo.c. I'm not sure whether there are any issues with adding that change to this PR, though, since the value does seem kind of high for a test that looks like it might do internet lookup?

when one or more engaged workers return, those decrement the slow io work count, and there by allow some work to progress

@gireeshpunathil As far as I could tell, the issue is that those decrements end up getting lost because of the race condition?

@addaleax
Copy link
Contributor Author

addaleax commented Oct 7, 2018

@cjihrig I guess we want 48 hours before Code + Learn for the libuv upgrade PR, which would mean a release before Wednesday… would that be okay with you?

@cjihrig
Copy link
Contributor

cjihrig commented Oct 7, 2018

I'll make the release tomorrow (assuming this lands). There are actually a number of other Node bug fixes queued up for release. Proposal: #2024

addaleax added a commit that referenced this pull request Oct 7, 2018
90891b4 introduced a race
condition when accessing `slow_io_work_running` – it is being
increased and later decreased as part of the worker thread loop,
but was accessed with different mutexes during these operations.

This fixes the race condition by making sure both accesses
are protected through the global `mutex` of `threadpool.c`.

This fixes a number of flaky Node.js tests.

Refs: #1845
Refs: nodejs/reliability#18
Refs: nodejs/node#23089
Refs: nodejs/node#23067
Refs: nodejs/node#23066
Refs: nodejs/node#23219
PR-URL: #2021
Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>
@addaleax
Copy link
Contributor Author

addaleax commented Oct 7, 2018

@cjihrig Thanks!

Went ahead and landed this in daf04e8. If somebody has strong feelings about the test I’ve mentioned above, feel free to let me know and I’ll gladly open a PR.

@addaleax addaleax closed this Oct 7, 2018
@addaleax addaleax deleted the threadpool-fix branch October 7, 2018 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants