Resource-safe C interface, part 2 #8997

gadmm · 2019-09-29T20:44:43Z

This PR continues with the resource-safe C interface to allow writing resource-safe C functions in the presence of asynchronous exceptions. The ultimately goal is to make runtime correct in the face of asynchronous exceptions. It introduces new _exn variants of current functions to allow cleaning-up after asynchronous exceptions, in the public interface:

caml_{enter,leave}_blocking_section_exn.
Functions from caml/fail.h that raise exceptions, e.g. caml_failwith_exn and many others.

In the second part, these are used to fix leaks in several places of the runtime. The general strategy is to make the exception raising explicit through error codes / exception monad by replacing raising functions by functions that return encoded exceptions. This has been made possible by the work on making polling predictable in the C API.

This contains a proposal for a public-facing interface. This is best read commit-per-commit, with further details in the commit logs.

Left for future work:

Audit the runtime for further leaks by documenting all raising C functions with // raises. This PR fixes quite a few leaks but those only were obvious ones.
Do the same for the Out_of_memory exception raised by some allocation functions, and find a general strategy to deal with it.

gadmm · 2020-03-02T19:30:59Z

Thanks for the label, the PR is currently waiting for me to get back to it and rebase. Comments about the text of the PR are still welcome, whereas it is useless to comment on the implementation at the moment.

gadmm · 2020-07-09T11:44:06Z

This PR is no longer an RFC. It has been simplified to add only what I found necessary in the API, and I fixed the leaks that I had identified previously. It is ready for review. I think this can interest @xavierleroy, @stedolan and @jhjourdan. It fixes a couple of real bugs and I think that the API additions will be consensual.

There are a couple of CI failures that I need to investigate.

gadmm · 2020-07-12T16:43:56Z

(Note: the description at the top has been updated.)

I have added more patches resulting from a full audit of async exceptions in io.c, fixing several leaks, and also the channel state invalidation noticed by @stedolan at #9722. I expect that there are no more leaks/state invalidation in io.c after this.

jhjourdan · 2020-07-13T14:35:53Z

(Appveyor MSVC64 build is failing.)

Add new variants caml_*_blocking_section_exn for the user who want to clean-up resources in case of an asynchronous exception.

Simplify for next commit.

Introduce variants of the functions that do not immediately raise the exception, allowing for resource cleanup. This is used in this way in an ulterior commit to avod duplicating cleanup code. Another motivating example is given at ocaml#7423: > caml_invalid_argument(str) is no return and does not free it's argument. So calling it with a string constructed dynamically will mean it'll never get freed. This led at the time to the introduction of caml_invalid_argument_value and caml_failwith_value that accept an OCaml string as argument. The present commit gives a general solution that fits with the rest of the resource-safe API.

Add non-raising variants of caml_enter/leave_blocking_section. The intention is to mimick masking in the release path, when one already holds an exception. Masking is not there yet, so we introduce these functions temporarily. They remain internal, to avoid redundancy in the API.

Prepares for resource-safety audit of io.c

Only stdin is global in Stdlib, and it is not seekable, so its state remained valid on exception. However it does not cost much to keep the offset and the max pointer up to date in all situtations when an interrupt arises. In addition, one can be suspicious that channel locks may leak, which can cause deadlocks. However, this is already dealt with Unlock_exn.

Similar to the previous commit

jhjourdan

Overall, I think this is a good change: it fixes many resource leaks and real bugs in several functions of the runtime, and introduce a good practice for handling exception in C code.

However, I have some general remarks:

Having caml_leave_blocking_section run async callbacks makes it difficult to free resources/raise the right exception in code that calls it. As discussed in EINTR-based signals, again #9722, I think that not running them would simplify much the code, without delaying too much async callbacks.
You will need to rebase this PR on top of EINTR-based signals, again #9722 once it will be merged, since both PR touch the smae piece of code.
Stylistically, I prefer not to use gotos when possible. This makes it much easier to review the resource flow in the function. So could you please only use them when this indeed factorizes the cleanup code?

jhjourdan · 2020-07-13T15:20:02Z

runtime/caml/signals.h

+CAMLextern void caml_enter_blocking_section (void); // raises
+CAMLextern void caml_leave_blocking_section (void); // raises
+
+CAMLextern value caml_enter_blocking_section_exn (void);
+/* Same as [caml_enter_blocking_section], but return the exception if
+   any. The runtime lock is held upon return if and only if the result
+   satisfies Is_exception_result. */
+
+CAMLextern value caml_leave_blocking_section_exn (void);
+/* Same as [caml_leave_blocking_section], but return the exception if
+   any. The runtime lock is held upon return in all cases. */


Since the discussion in #9722, I am now more and more convinced that caml_leave_blocking_section should not execute async callbacks (hence no need for caml_leave_blocking_section_exn). The reason is that, first, as @stedolan explained, it is almost certain that the blocking section returned a result that we need to exploit before doing anything else which might interrupt the call, and second, we will run polling code soon, so there is no risk to loose async callbacks.

On the other hand, polling in caml_enter_blocking_section is necessary because we will probably block for a some time and and should avoid waiting for async callbacks during this time. However, in order to get exception safety without having to many versions of these functions, I would prefer the approach in #9722 where we use a function like caml_enter_blocking_section_no_pending. Then, one should simply call caml_process_pending_actions_exn before if one wants to both poll async callbacks and releasing the runtime lock.

Thanks for the review, I just notice it now.

Indeed, I will rebase on top of #9722 and follow whatever decision is made there. However, I mentioned at #9722 a serious breakage of not polling in caml_leave_blocking_section that I just noticed: it serves to poll after a system call returns EINTR, just before Sys_error/Unix_error(EINTR,_) is raised. Once the exception is raised (changing a function raising Sys.Break to a function raising Sys_error, say), the argument that one will poll soon enough no longer applies for reasons I explained there. And since it is possible that this pattern could be found in user C libraries, fixing the Sys and Unix modules by hand would not be enough. I propose to discuss this at #9722.

jhjourdan · 2020-07-13T16:42:27Z

runtime/debugger.c

@@ -139,7 +140,10 @@ static void open_connection(void)
  if (dbg_socket == -1)
    caml_fatal_error("_open_osfhandle failed");
 #endif
-  dbg_in = caml_open_descriptor_in(dbg_socket);
+  dbg_socket_in = dup(dbg_socket);


As far as I know, dup is not available on Windows.

I have to use _dup, right?

To be honest, I have no idea.

If we want to actually call dup on unix, then the best is probably to ask @dra27 about what is the right thing to do here.

But see #9786: the original implementation corresponds to a common pattern in OCaml programming with the Unix module, which, I think, should not be considered a bug. So, I think I would prefer keeping the initial implementation, bug guard somewhere else about the double close of file descriptors used in channels.

dup is aliased to _dup, so it'll work - it would be better to use _dup on Windows, but goodness knows if/when Microsoft will actually remove the POSIX aliases.

What you're doing should work fine on Windows; I have no comment on whether it's necessarily what's wanted on either platform.

jhjourdan · 2020-07-13T16:44:05Z

runtime/io.c

 {
  struct channel * channel;

  channel = (struct channel *) caml_stat_alloc(sizeof(struct channel));
  channel->fd = fd;
-  caml_enter_blocking_section();
+  *exn = caml_enter_blocking_section_exn();


Could we avoid writing to *exn when no exception is called?

There are two possible conventions:

Only write to *exn if there is an exception.

Always write to *exn; i.e. treat exn as an output parameter.

In the latter case, the caller does not need to initialise *exn. Question for seasoned C programmers: are there any coding guidelines saying that one is better or more expected than the other?

IMHO, a common pattern for out parameters is to not write them when there is no corresponding result, and signal by another channel (i.e., the return value of the function) that the out parameter has been (or not) written.

Ok but then I would think that exn is the error code, and the channel is the output param. I.e. the signature should be

CAMLexport value caml_open_descriptor_in_exn(int fd, struct channel ** chan)

instead of:

CAMLexport struct channel * caml_open_descriptor_in_exn(int fd, value * exn)

I thought about doing this, but preferred a change that would be perceived as smaller. It seems that we will have to do it anyway.

I'd rather not change the return type of the function. I think this is misleading. Why cannot you guarantee that [exn] is written iff the return value is NULL?

I could, but in addition I would prefer if there is a convention that for _exn functions one has to first check for Is_exception_result.

Sure. But this is a convention to be followed by the caller side. This does not prevent you to follow the usual convention that out parameters are only written in case it is needed. Of course, there will be some redundancy, but is that an issue?

jhjourdan · 2020-07-13T17:43:53Z

runtime/dynlink_nat.c


  p = caml_stat_strdup_to_os(String_val(filename));
-  caml_enter_blocking_section();
+  exn = caml_enter_blocking_section_exn();
+  if (Is_exception_result(exn)) goto cleanup1;


Suggested change

if (Is_exception_result(exn)) goto cleanup1;

if (Is_exception_result(exn)) {

caml_stat_free(p);

caml_raise(Extract_exception(exn));

}

Avoiding goto when possible is always good. In this case, I don't think the code becomes more complex.

jhjourdan · 2020-07-13T17:44:17Z

runtime/dynlink_nat.c

  dlhandle = caml_dlopen(p, 1, Int_val(global));
-  caml_leave_blocking_section();
-  caml_stat_free(p);
+  exn = caml_leave_blocking_section_exn();


Suggested change

exn = caml_leave_blocking_section_exn();

exn = caml_leave_blocking_section_exn();

caml_stat_free(p);

Why not freeing p here? This would make the change smaller, the lifetime of p shorter and the resources of the function easier to track.

Of course, you then have to free it in the exceptional case above.

jhjourdan · 2020-07-14T06:52:57Z

runtime/signals.c

@@ -180,6 +180,11 @@ CAMLexport value caml_enter_blocking_section_exn(void)
  return Val_unit;
 }

+void caml_enter_blocking_section_noexn(void)


Could you please synchronize with @stedolan for the name of the function. As far as I'm concerned, I have a slight preference for _nopending, since it makes it clear that we are missing a polling point and that it should be replaced if possible by another polling point.

jhjourdan · 2020-07-14T06:54:04Z

runtime/dynlink_nat.c

-  // TODO: blocking section with masking
-  if (dlhandle)
+  if (dlhandle) {
+    caml_enter_blocking_section_noexn();
    caml_dlclose(dlhandle);
+    caml_leave_blocking_section_noexn();
+  }


Other calls to caml_dlclose are not guarded by blocking sections. Not that I really care about it, but if we decide that this call is blocking, then it should be uniform.

jhjourdan · 2020-07-14T20:52:57Z

runtime/caml/osdeps.h

+   In case of error, sets [exn] to an encoded exception value
+   [Sys_error], [Sys_blocked_io] or an exception raised from a signal
+   handler, if any. If the return value is -1 then [exn] contains an
+   encoded exception.  */


This interface is rather weird. Usually, we would expect that when there is an error, no byte is read.

Please make it clear that it is possible that an error is returned and some byte is read. Moreover, is there a point in returning -1 in the case of an error when no byte is read? Why not returning 0 in this case?

Anyway, note that we would not have this weird interface if caml_leave_blocking_section did not run async callbacks.

Also, these functions write (Val_unit) to *exn even if no errror appears. Please either make this clear in the comment or fix this behavior.

BTW, as far as I understand, if caml_leave_blocking_section did not run async callbacks, then all these changes would not be necessary, because it would be safe to raise exceptions from io.c functions.

jhjourdan · 2020-07-15T09:00:26Z

runtime/io.c

-/* caml_do_read is exported for Cash */
+// raises
+static int read_fd_into_channel(struct channel * channel, char * buf)
+{
+  int n;
+  value exn = Val_unit;
+  n = caml_read_fd_exn(channel->fd, channel->flags, buf,
+                       channel->end - buf, &exn);
+  if (n <= 0) {
+    caml_raise_if_exception(exn);
+    CAMLassert(n == 0);
+    return n;
+  }
+  channel->offset += n;
+  channel->max = buf + n;
+  caml_raise_if_exception(exn);
+  return n;
+}
+
+/* caml_do_read is exported for Cash. Raises. */


This change is only necessary because caml_leave_blocking_section runs async callbacks.

jhjourdan · 2020-07-15T09:04:26Z

runtime/io.c

+//raises
+static void write_channel_to_fd(struct channel * channel, int towrite)
+{
+  int written;
+  value exn = Val_unit;
+  written = caml_write_fd_exn(channel->fd, channel->flags,
+                              channel->buff, towrite, &exn);
+  if (written < 0) {
+    caml_raise_if_exception(exn);
+    CAMLassert(0);
+  }
+  if (written < towrite)
+    memmove(channel->buff, channel->buff + written, towrite - written);
+  channel->offset += written;
+  channel->curr = channel->buff + towrite - written;
+  caml_raise_if_exception(exn);
+}
+


Same remark as in previous commit: this change would not be necessary if caml_leave_blocking_section did not run async callbacks.

gadmm · 2020-09-09T17:34:51Z

Is there any desire to have this in for 4.12 and if so what would be the time frame?

I saw comments on the goto style on another thread so I wanted to address this now:

Stylistically, I prefer not to use gotos when possible. This makes it much easier to review the resource flow in the function. So could you please only use them when this indeed factorizes the cleanup code?

This PR indeed explicitly advocates in favour of using the goto style for resource clean-up.

Using gotos separates the success path from the failure path. This makes the success path easier to read, which matters first, and the failure path should be deduced in a mechanical way, which should be easy to check. Since in the C code the exceptions will be asynchronous exceptions or other exceptions unrelated to the current purpose of the code, they always induce a failure path that you want to separate from the success path.
This is one of the valid and idiomatic uses of goto in C, so I do not think falls in any of the "prefer not to use goto" cases. It is true that it is useful to avoid code duplication, clean-up several resources in a stack-like fashion, (and improve cache locality for the success path, but I hope it does not matter in this PR), and you do not object to those cases. But in my experience, for those reasons, as you write code and as it evolves, it naturally ends up written with a goto even if the final code could be simplified, and it is extra effort to artificially avoid the idiom (with no advantage other than being able to say that one avoided a goto).

We will see how it ends up with the simplification at #9722, but I believe that the codebase will gain in readability if exception-handling code starts being written in a systematic fashion.

stedolan · 2020-09-15T11:19:37Z

My impression from reading the code here is that a great deal of it was working around the fact that caml_leave_blocking_section could previously raise exceptions. Now that that's no longer the case, how much of this patch is still necessary to achieve resource safety?

For instance, there is quite a lot of refactoring here to thread exceptions back from caml_read_fd in the case where it both successfully reads some data and also raises an exception. Now that caml_leave_blocking_section does not raise, this case is no longer possible.

In particular, are the non-raising versions of the exception-raising functions (e.g. caml_raise_end_of_file_exn and friends) still necessary? The only uses I could find were to support the threading of exceptions above.

gadmm · 2020-09-15T11:49:50Z

My impression from reading the code here is that a great deal of it was working around the fact that caml_leave_blocking_section could previously raise exceptions. Now that that's no longer the case, how much of this patch is still necessary to achieve resource safety?

There are still real bugs, but we will see better when I get back to it :) (which can happen faster if I can please have an idea of the schedule.)

For instance, there is quite a lot of refactoring here to thread exceptions back from caml_read_fd in the case where it both successfully reads some data and also raises an exception. Now that caml_leave_blocking_section does not raise, this case is no longer possible.

Surely, but keep in mind that this was written as part of an evolution path meant to carefully maintain backwards compatibility, which is a good enough reason for the added complexity.

In particular, are the non-raising versions of the exception-raising functions (e.g. caml_raise_end_of_file_exn and friends) still necessary? The only uses I could find were to support the threading of exceptions above.

The PR is first meant as extending the public API. As the commit log explains, the _exn variants to build exceptions are meant to solve the kind of problem at #7423 in the proper way. Of course, I like it better when the compiler code cleanup allows to demonstrate the advocated style, but this example might have to go.

I could keep this part of the new api in my freezer until it is useful for the compiler codebase, as I did until now, but the _exn idiom is more limited if one cannot build exceptions values in the first place, and so it is harder for the user to write resource-safe code.

damiendoligez · 2021-01-27T15:49:26Z

Where do we stand with this PR? Obviously, we missed the 4.12 feature freeze and the fact that #9722 was merged should greatly simplify this one, right?

gadmm · 2021-01-27T16:00:44Z

We are waiting that I rewrite it after #9722, but which should give a simpler PR indeed. It was not clear to me what was the time frame for 4.12 (see above), so I put it on my plate for 4.13. Maybe that your reminder will provide me with the nudge needed to get back to it.

This was referenced Oct 11, 2019

Resource-safe C interface, part 0 #9037

Merged

Sys.readdir + async signal deadlock #9036

Closed

damiendoligez added the high-priority label Feb 25, 2020

gadmm mentioned this pull request Jun 29, 2020

EINTR-based signals, again #9722

Merged

1 task

gadmm force-pushed the resource-safe-api-pt2 branch 2 times, most recently from 3fb0706 to 22ebc9c Compare July 9, 2020 09:55

gadmm changed the title ~~[RFC] Resource-safe C interface, part 2~~ Resource-safe C interface, part 2 Jul 9, 2020

gadmm marked this pull request as ready for review July 9, 2020 10:08

gadmm force-pushed the resource-safe-api-pt2 branch from 22ebc9c to 39bece0 Compare July 12, 2020 16:33

gadmm force-pushed the resource-safe-api-pt2 branch 2 times, most recently from 590b64a to a62da4b Compare July 12, 2020 21:35

gadmm added 15 commits July 14, 2020 01:27

Resource-safe C interface for blocking sections

8017e78

Add new variants caml_*_blocking_section_exn for the user who want to clean-up resources in case of an asynchronous exception.

[minor] Merge runtime/fail_nat.c and runtime/fail_byt.c

abf5f6b

Simplify for next commit.

caml_open_descriptor*: fix several leaks and a double free

b24cb95

dynlink_nat.c: fix leaks

3b5cbdc

dynlink.c : fix leak

36b7697

caml_sys_open: fix leak

bf35149

sys.c: more leaks fixed

21472b2

Add caml_{read,write}_fd_exn

361ef7c

Prepares for resource-safety audit of io.c

Audit of caml_write_fd: update channel state

412e1a7

Similar to the previous commit

intern.c: fix leak

7144611

extern.c: fix leak

764a9e0

Finish audit of io.c: fix leaks, and update state in seek functions

d2832a1

gadmm added 2 commits July 14, 2020 01:28

Changes

3b91249

might fix MSVC compilation

9a751be

gadmm force-pushed the resource-safe-api-pt2 branch from 053ed5e to 9a751be Compare July 13, 2020 23:29

jhjourdan requested changes Jul 15, 2020

View reviewed changes

jhjourdan mentioned this pull request Jul 23, 2020

Do not poll for async callbacks in caml_leave_blocking_section #9796

Closed

xavierleroy removed the high-priority label Oct 7, 2020

gadmm mentioned this pull request Feb 20, 2022

Reintroduce caml_final_do_calls_exn #11039

Merged

gadmm mentioned this pull request Oct 5, 2023

Resource-safe C interface: caml/fail.h #12407

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource-safe C interface, part 2 #8997

Resource-safe C interface, part 2 #8997

gadmm commented Sep 29, 2019 •

edited

gadmm commented Mar 2, 2020

gadmm commented Jul 9, 2020

gadmm commented Jul 12, 2020

jhjourdan commented Jul 13, 2020

jhjourdan left a comment

jhjourdan Jul 13, 2020

gadmm Jul 19, 2020

jhjourdan Jul 13, 2020

gadmm Jul 19, 2020

jhjourdan Jul 21, 2020

dra27 Jul 27, 2020

jhjourdan Jul 13, 2020

gadmm Jul 19, 2020 •

edited

jhjourdan Jul 21, 2020

gadmm Jul 22, 2020

jhjourdan Jul 22, 2020

gadmm Jul 22, 2020

jhjourdan Jul 22, 2020

jhjourdan Jul 13, 2020

jhjourdan Jul 13, 2020

jhjourdan Jul 14, 2020

jhjourdan Jul 14, 2020

jhjourdan Jul 14, 2020

jhjourdan Jul 15, 2020

jhjourdan Jul 15, 2020

jhjourdan Jul 15, 2020

jhjourdan Jul 15, 2020

gadmm commented Sep 9, 2020

stedolan commented Sep 15, 2020

gadmm commented Sep 15, 2020

damiendoligez commented Jan 27, 2021

gadmm commented Jan 27, 2021

	exn = caml_leave_blocking_section_exn();
	exn = caml_leave_blocking_section_exn();
	caml_stat_free(p);

Resource-safe C interface, part 2 #8997

Are you sure you want to change the base?

Resource-safe C interface, part 2 #8997

Conversation

gadmm commented Sep 29, 2019 • edited

gadmm commented Mar 2, 2020

gadmm commented Jul 9, 2020

gadmm commented Jul 12, 2020

jhjourdan commented Jul 13, 2020

jhjourdan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gadmm Jul 19, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gadmm commented Sep 9, 2020

stedolan commented Sep 15, 2020

gadmm commented Sep 15, 2020

damiendoligez commented Jan 27, 2021

gadmm commented Jan 27, 2021

gadmm commented Sep 29, 2019 •

edited

gadmm Jul 19, 2020 •

edited