Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race on Windows file descriptors #11304

Merged
merged 7 commits into from
Jun 29, 2022

Conversation

OlivierNicole
Copy link
Contributor

On Windows, the function win_CRT_fd_of_filedescr is called by several functions of the Unix module, to wit:

  • Unix.in_channel_of_descr
  • Unix.out_channel_of_descr
  • Unix.fsync
  • Unix.dup2
  • Unix.isatty

This function can cause races on the crt_fd field of the filedescr structure. This PR removes the data race itself by making the field atomic.

Currently not equipped with a Windows machine, I have not been able to test these changes.

Not fixed by this PR

This PR does not change the fact that concurrent calls to to this function can call the Windows API function _open_osfhandle several times on the same handle (as doing so would require some form of locking). Microsoft's documentation is unclear about the possible consequences of this (no consequences, file descriptor leaks, or error).

In addition, concurrent calls to win_CRT_fd_of_filedescr and Unix.close can still lead to fd leaks.

@xavierleroy
Copy link
Contributor

concurrent calls to to this function can call the Windows API function _open_osfhandle several times on the same handle

This defeats the purpose of caching the result of _open_osfhandle in the crt_fd field of struct filedescr...

(as doing so would require some form of locking).

Indeed, it might be safer and easier to put a CRITICAL_SECTION in each struct filedescr and use it to protect crt_fd.

But you can also get it right with only atomic instructions + some busy-waiting: the thread that needs to call _open_osfhandle puts a special GETTING_CRT_FD value in crt_fd before the call, and other threads spin while crt_fd is GETTING_CRT_FD.

@OlivierNicole OlivierNicole force-pushed the fix_racy_winfd branch 3 times, most recently from 34e2061 to 3ab0489 Compare June 8, 2022 15:09
@OlivierNicole
Copy link
Contributor Author

Indeed, it might be safer and easier to put a CRITICAL_SECTION in each struct filedescr and use it to protect crt_fd.

But you can also get it right with only atomic instructions + some busy-waiting: the thread that needs to call _open_osfhandle puts a special GETTING_CRT_FD value in crt_fd before the call, and other threads spin while crt_fd is GETTING_CRT_FD.

I propose such a busy-waiting implementation in the updated version. It wasn't clear to me how to implement it with a CRITICAL_SECTION, more precisely, when the CRITICAL_SECTION could be finalized, since blocking calls are forbidden in GC finalization, and maybe that includes DeleteCriticalSection?

Again, I haven't been able to test this, not having a Windows system at the moment.

@dra27 dra27 added this to the 5.0 milestone Jun 14, 2022
@xavierleroy
Copy link
Contributor

I took the liberty to push directly to your branch:

  • a small change to your implementation that uses the SPIN_WAIT macro to spin more carefully;
  • a test for the race condition that fails most of the time in the current trunk but seems to work OK with the new implementation.

I tested the code on the Mingw-64 system used for the Jenkins CI. It builds and passes the test suite.

Code reviews welcome, as always.

@OlivierNicole
Copy link
Contributor Author

Thank you for this. I rebased the PR on trunk after the recent merge of #10926 and ensured that our functions follow the new naming policy.

Your improvements and tests look good to me.

@xavierleroy
Copy link
Contributor

@dra27 @damiendoligez @sadiqj @ctk21 : could we get a quick review for this code? Thanks.

Copy link
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay - partly because of faffing around with other things, but mainly because there was something I didn't like about this and I hadn't been able to put my finger on it. I now have - I think this would be fine, but I wonder if we can avoid the introduction of a new function?

Any uses of CRT_fd_val as an rvalue in user code I believe start to get warnings about _Atomic - or, if not, should we actually remove the macro, on the basis that it's no longer a safe way to access that field?

Entire review is a suggestion only - the PR could be merged as-is.

otherlibs/unix/channels_win32.c Outdated Show resolved Hide resolved
otherlibs/unix/channels_win32.c Show resolved Hide resolved
otherlibs/unix/channels_win32.c Outdated Show resolved Hide resolved
otherlibs/unix/unixsupport.h Show resolved Hide resolved
otherlibs/unix/close_win32.c Show resolved Hide resolved
otherlibs/unix/dup_win32.c Show resolved Hide resolved
otherlibs/unix/channels_win32.c Show resolved Hide resolved
Copy link
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @dra27 that it's weird to have functions and the macros. I take the opposite view on the solution though, it might be nicer to have functions to retrieve/set which deal with the atomics and eliminate the macro entirely? At the moment the atomics are strewn across every use and may be forgotten in future uses of CRT_fd_val.

@@ -55,7 +55,8 @@ value caml_win32_alloc_handle(HANDLE h)
caml_alloc_custom(&handle_ops, sizeof(struct filedescr), 0, 1);
Handle_val(res) = h;
Descr_kind_val(res) = KIND_HANDLE;
CRT_fd_val(res) = NO_CRT_FD;
atomic_store_explicit(&CRT_fd_val(res), (intnat)NO_CRT_FD,
memory_order_relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This applies to all the relaxed stores in this change)

What's the rationale behind relaxed atomics? I see at least here we're doing some non-atomic stores before which makes me wonder.

@OlivierNicole
Copy link
Contributor Author

What's the rationale behind relaxed atomics? I see at least here we're doing some non-atomic stores before which makes me wonder.

I was mistaken, I thought that for initializing writes we still needed relaxed atomics, but it appears that plain stores are fine.

Regarding macros vs functions: I tend to like more the idea of removing the CRT_fd_val macro entirely, keeping only functions, as macros used for both l- and rvalues have caused problems before. But I am fine with either way.

@xavierleroy
Copy link
Contributor

Like @OlivierNicole, I'd rather unexport the CRT_fd_val macro and export only the getter function caml_win32_CRT_fd_of_filedescr.

Re "non-atomic" stores, you mean syntactic assignments a = ... instead of atomic_store(&a, ...), right? I understand both are equivalent, but please correct me if I'm wrong.

Re relaxed atomics, I think what we need here is sequential consistency for each crt_fd of each struct filedescr, but no consistency guarantees between several memory locations. In which case relaxed atomics might be enough. But performance is not critical and SC atomics are so much easier to reason about...

OlivierNicole added a commit to OlivierNicole/ocaml that referenced this pull request Jun 24, 2022
@OlivierNicole
Copy link
Contributor Author

I unexported the CRT_fd_val macro, but kept the exports of caml_win32_CRT_fd_of_filedescr and caml_win32_get_CRT_fd (since both are used in other files, such as close_win32.c.

Regarding relaxed atomics, in my understanding since these are initializing writes, they will go through a fence before another thread has a chance to access the custom block, so there should be no difference between them and plain stores.

@OlivierNicole
Copy link
Contributor Author

If a is a l-value of _Atomic-qualified type, even plain assignments a = ... are atomic (at the seq_cst consistency level, if I'm not mistaken). There is no such thing as a non-atomic write to an _Atomic-qualified l-value. But an atomic_store_explicit(..., memory_order_relaxed) can be used to produce a plain processor store.

I wasn't aware of this. In that case, I think I should revert to using atomic_store_explicit(..., memory_order_relaxed) for the initializing writes in caml_win32_alloc_handle, caml_win32_alloc_socket, caml_unix_filedescr_of_channel and caml_unix_filedescr_of_fd as atomic stores are not necessary there.

@xavierleroy
Copy link
Contributor

I think I should revert to using atomic_store_explicit(..., memory_order_relaxed) for the initializing writes in caml_win32_alloc_handle, caml_win32_alloc_socket, caml_unix_filedescr_of_channel and caml_unix_filedescr_of_fd as atomic stores are not necessary there.

Do you really think the performance difference matters? If not, let's keep the code as simple as possible, i.e. with SC atomics everywhere. If so, I think memory_order_relaxed could be used for all accesses to crt_fd, on the basis that it's still SC per location. But we need confirmation from a memory model expert.

OlivierNicole and others added 7 commits June 27, 2022 19:22
On Windows, the function `win_CRT_fd_of_filedescr` is called by several
functions of the Unix module, to wit:
- `Unix.in_channel_of_descr`
- `Unix.out_channel_of_descr`
- `Unix.fsync`
- `Unix.dup2`
- `Unix.isatty`

This function can cause races on the `crt_fd` field of the `filedescr`
structure.
This removes the data race itself. However, it does not guarantee that
the Windows API function `_open_osfhandle` will not be called several
times on the same handle. In addition, concurrent calls to
`win_CRT_fd_of_filedescr` and `Unix.close` can still lead to fd leaks.
@xavierleroy
Copy link
Contributor

Rebased to fix the conflict and restart CI.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me in its current form. It's likely that the SC atomics are too strong, but do not really hurt.

Copy link
Member

@dra27 dra27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go

@sadiqj sadiqj merged commit 663e8d2 into ocaml:trunk Jun 29, 2022
@sadiqj
Copy link
Contributor

sadiqj commented Jun 29, 2022

Thanks @OlivierNicole

@OlivierNicole OlivierNicole deleted the fix_racy_winfd branch June 30, 2022 08:22
dra27 added a commit that referenced this pull request Jul 2, 2022
dra27 pushed a commit that referenced this pull request Jul 2, 2022
Fix data race on Windows file descriptors

(cherry picked from commit 663e8d2)
@dra27
Copy link
Member

dra27 commented Jul 2, 2022

Cherry-picked to 5.0 along with a commit fixing the mingw-w64 build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants