Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource-safe C interface, part 2 #8997

Open
wants to merge 17 commits into
base: trunk
Choose a base branch
from

Conversation

gadmm
Copy link
Contributor

@gadmm gadmm commented Sep 29, 2019

This PR continues with the resource-safe C interface to allow writing resource-safe C functions in the presence of asynchronous exceptions. The ultimately goal is to make runtime correct in the face of asynchronous exceptions. It introduces new _exn variants of current functions to allow cleaning-up after asynchronous exceptions, in the public interface:

  • caml_{enter,leave}_blocking_section_exn.
  • Functions from caml/fail.h that raise exceptions, e.g. caml_failwith_exn and many others.

In the second part, these are used to fix leaks in several places of the runtime. The general strategy is to make the exception raising explicit through error codes / exception monad by replacing raising functions by functions that return encoded exceptions. This has been made possible by the work on making polling predictable in the C API.

This contains a proposal for a public-facing interface. This is best read commit-per-commit, with further details in the commit logs.

Left for future work:

  • Audit the runtime for further leaks by documenting all raising C functions with // raises. This PR fixes quite a few leaks but those only were obvious ones.
  • Do the same for the Out_of_memory exception raised by some allocation functions, and find a general strategy to deal with it.

@gadmm
Copy link
Contributor Author

gadmm commented Mar 2, 2020

Thanks for the label, the PR is currently waiting for me to get back to it and rebase. Comments about the text of the PR are still welcome, whereas it is useless to comment on the implementation at the moment.

@gadmm gadmm mentioned this pull request Jun 29, 2020
1 task
@gadmm gadmm force-pushed the resource-safe-api-pt2 branch 2 times, most recently from 3fb0706 to 22ebc9c Compare July 9, 2020 09:55
@gadmm gadmm changed the title [RFC] Resource-safe C interface, part 2 Resource-safe C interface, part 2 Jul 9, 2020
@gadmm gadmm marked this pull request as ready for review July 9, 2020 10:08
@gadmm
Copy link
Contributor Author

gadmm commented Jul 9, 2020

This PR is no longer an RFC. It has been simplified to add only what I found necessary in the API, and I fixed the leaks that I had identified previously. It is ready for review. I think this can interest @xavierleroy, @stedolan and @jhjourdan. It fixes a couple of real bugs and I think that the API additions will be consensual.

There are a couple of CI failures that I need to investigate.

@gadmm
Copy link
Contributor Author

gadmm commented Jul 12, 2020

(Note: the description at the top has been updated.)

I have added more patches resulting from a full audit of async exceptions in io.c, fixing several leaks, and also the channel state invalidation noticed by @stedolan at #9722. I expect that there are no more leaks/state invalidation in io.c after this.

@gadmm gadmm force-pushed the resource-safe-api-pt2 branch 2 times, most recently from 590b64a to a62da4b Compare July 12, 2020 21:35
@jhjourdan
Copy link
Contributor

(Appveyor MSVC64 build is failing.)

gadmm added 15 commits July 14, 2020 01:27
Add new variants caml_*_blocking_section_exn for the user who want to
clean-up resources in case of an asynchronous exception.
Introduce variants of the functions that do not immediately raise the
exception, allowing for resource cleanup.

This is used in this way in an ulterior commit to avod duplicating
cleanup code.

Another motivating example is given at ocaml#7423:

> caml_invalid_argument(str) is no return and does not free it's
argument. So calling it with a string constructed dynamically will
mean it'll never get freed.

This led at the time to the introduction of
caml_invalid_argument_value and caml_failwith_value that accept an
OCaml string as argument. The present commit gives a general solution
that fits with the rest of the resource-safe API.
Add non-raising variants of caml_enter/leave_blocking_section.

The intention is to mimick masking in the release path, when one
already holds an exception. Masking is not there yet, so we introduce
these functions temporarily. They remain internal, to avoid redundancy
in the API.
Prepares for resource-safety audit of io.c
Only stdin is global in Stdlib, and it is not seekable, so its state
remained valid on exception. However it does not cost much to keep the
offset and the max pointer up to date in all situtations when an
interrupt arises.

In addition, one can be suspicious that channel locks may leak, which
can cause deadlocks. However, this is already dealt with Unlock_exn.
Similar to the previous commit
Copy link
Contributor

@jhjourdan jhjourdan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this is a good change: it fixes many resource leaks and real bugs in several functions of the runtime, and introduce a good practice for handling exception in C code.

However, I have some general remarks:

  • Having caml_leave_blocking_section run async callbacks makes it difficult to free resources/raise the right exception in code that calls it. As discussed in EINTR-based signals, again #9722, I think that not running them would simplify much the code, without delaying too much async callbacks.
  • You will need to rebase this PR on top of EINTR-based signals, again #9722 once it will be merged, since both PR touch the smae piece of code.
  • Stylistically, I prefer not to use gotos when possible. This makes it much easier to review the resource flow in the function. So could you please only use them when this indeed factorizes the cleanup code?

Comment on lines +33 to +43
CAMLextern void caml_enter_blocking_section (void); // raises
CAMLextern void caml_leave_blocking_section (void); // raises

CAMLextern value caml_enter_blocking_section_exn (void);
/* Same as [caml_enter_blocking_section], but return the exception if
any. The runtime lock is held upon return if and only if the result
satisfies Is_exception_result. */

CAMLextern value caml_leave_blocking_section_exn (void);
/* Same as [caml_leave_blocking_section], but return the exception if
any. The runtime lock is held upon return in all cases. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the discussion in #9722, I am now more and more convinced that caml_leave_blocking_section should not execute async callbacks (hence no need for caml_leave_blocking_section_exn). The reason is that, first, as @stedolan explained, it is almost certain that the blocking section returned a result that we need to exploit before doing anything else which might interrupt the call, and second, we will run polling code soon, so there is no risk to loose async callbacks.

On the other hand, polling in caml_enter_blocking_section is necessary because we will probably block for a some time and and should avoid waiting for async callbacks during this time. However, in order to get exception safety without having to many versions of these functions, I would prefer the approach in #9722 where we use a function like caml_enter_blocking_section_no_pending. Then, one should simply call caml_process_pending_actions_exn before if one wants to both poll async callbacks and releasing the runtime lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, I just notice it now.

Indeed, I will rebase on top of #9722 and follow whatever decision is made there. However, I mentioned at #9722 a serious breakage of not polling in caml_leave_blocking_section that I just noticed: it serves to poll after a system call returns EINTR, just before Sys_error/Unix_error(EINTR,_) is raised. Once the exception is raised (changing a function raising Sys.Break to a function raising Sys_error, say), the argument that one will poll soon enough no longer applies for reasons I explained there. And since it is possible that this pattern could be found in user C libraries, fixing the Sys and Unix modules by hand would not be enough. I propose to discuss this at #9722.

@@ -139,7 +140,10 @@ static void open_connection(void)
if (dbg_socket == -1)
caml_fatal_error("_open_osfhandle failed");
#endif
dbg_in = caml_open_descriptor_in(dbg_socket);
dbg_socket_in = dup(dbg_socket);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, dup is not available on Windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to use _dup, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I have no idea.

If we want to actually call dup on unix, then the best is probably to ask @dra27 about what is the right thing to do here.

But see #9786: the original implementation corresponds to a common pattern in OCaml programming with the Unix module, which, I think, should not be considered a bug. So, I think I would prefer keeping the initial implementation, bug guard somewhere else about the double close of file descriptors used in channels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dup is aliased to _dup, so it'll work - it would be better to use _dup on Windows, but goodness knows if/when Microsoft will actually remove the POSIX aliases.

What you're doing should work fine on Windows; I have no comment on whether it's necessarily what's wanted on either platform.

{
struct channel * channel;

channel = (struct channel *) caml_stat_alloc(sizeof(struct channel));
channel->fd = fd;
caml_enter_blocking_section();
*exn = caml_enter_blocking_section_exn();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid writing to *exn when no exception is called?

Copy link
Contributor Author

@gadmm gadmm Jul 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two possible conventions:

  • Only write to *exn if there is an exception.
  • Always write to *exn; i.e. treat exn as an output parameter.

In the latter case, the caller does not need to initialise *exn. Question for seasoned C programmers: are there any coding guidelines saying that one is better or more expected than the other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, a common pattern for out parameters is to not write them when there is no corresponding result, and signal by another channel (i.e., the return value of the function) that the out parameter has been (or not) written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but then I would think that exn is the error code, and the channel is the output param. I.e. the signature should be

CAMLexport value caml_open_descriptor_in_exn(int fd, struct channel ** chan)

instead of:

CAMLexport struct channel * caml_open_descriptor_in_exn(int fd, value * exn)

I thought about doing this, but preferred a change that would be perceived as smaller. It seems that we will have to do it anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not change the return type of the function. I think this is misleading. Why cannot you guarantee that [exn] is written iff the return value is NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but in addition I would prefer if there is a convention that for _exn functions one has to first check for Is_exception_result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. But this is a convention to be followed by the caller side. This does not prevent you to follow the usual convention that out parameters are only written in case it is needed. Of course, there will be some redundancy, but is that an issue?


p = caml_stat_strdup_to_os(String_val(filename));
caml_enter_blocking_section();
exn = caml_enter_blocking_section_exn();
if (Is_exception_result(exn)) goto cleanup1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Is_exception_result(exn)) goto cleanup1;
if (Is_exception_result(exn)) {
caml_stat_free(p);
caml_raise(Extract_exception(exn));
}

Avoiding goto when possible is always good. In this case, I don't think the code becomes more complex.

dlhandle = caml_dlopen(p, 1, Int_val(global));
caml_leave_blocking_section();
caml_stat_free(p);
exn = caml_leave_blocking_section_exn();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
exn = caml_leave_blocking_section_exn();
exn = caml_leave_blocking_section_exn();
caml_stat_free(p);

Why not freeing p here? This would make the change smaller, the lifetime of p shorter and the resources of the function easier to track.

Of course, you then have to free it in the exceptional case above.

@@ -180,6 +180,11 @@ CAMLexport value caml_enter_blocking_section_exn(void)
return Val_unit;
}

void caml_enter_blocking_section_noexn(void)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please synchronize with @stedolan for the name of the function. As far as I'm concerned, I have a slight preference for _nopending, since it makes it clear that we are missing a polling point and that it should be replaced if possible by another polling point.

Comment on lines 104 to 108
// TODO: blocking section with masking
if (dlhandle)
if (dlhandle) {
caml_enter_blocking_section_noexn();
caml_dlclose(dlhandle);
caml_leave_blocking_section_noexn();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other calls to caml_dlclose are not guarded by blocking sections. Not that I really care about it, but if we decide that this call is blocking, then it should be uniform.

Comment on lines +38 to +41
In case of error, sets [exn] to an encoded exception value
[Sys_error], [Sys_blocked_io] or an exception raised from a signal
handler, if any. If the return value is -1 then [exn] contains an
encoded exception. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface is rather weird. Usually, we would expect that when there is an error, no byte is read.

Please make it clear that it is possible that an error is returned and some byte is read. Moreover, is there a point in returning -1 in the case of an error when no byte is read? Why not returning 0 in this case?

Anyway, note that we would not have this weird interface if caml_leave_blocking_section did not run async callbacks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, these functions write (Val_unit) to *exn even if no errror appears. Please either make this clear in the comment or fix this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, as far as I understand, if caml_leave_blocking_section did not run async callbacks, then all these changes would not be necessary, because it would be safe to raise exceptions from io.c functions.

runtime/io.c Outdated
Comment on lines 304 to 322
/* caml_do_read is exported for Cash */
// raises
static int read_fd_into_channel(struct channel * channel, char * buf)
{
int n;
value exn = Val_unit;
n = caml_read_fd_exn(channel->fd, channel->flags, buf,
channel->end - buf, &exn);
if (n <= 0) {
caml_raise_if_exception(exn);
CAMLassert(n == 0);
return n;
}
channel->offset += n;
channel->max = buf + n;
caml_raise_if_exception(exn);
return n;
}

/* caml_do_read is exported for Cash. Raises. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is only necessary because caml_leave_blocking_section runs async callbacks.

runtime/io.c Outdated
Comment on lines 198 to 215
//raises
static void write_channel_to_fd(struct channel * channel, int towrite)
{
int written;
value exn = Val_unit;
written = caml_write_fd_exn(channel->fd, channel->flags,
channel->buff, towrite, &exn);
if (written < 0) {
caml_raise_if_exception(exn);
CAMLassert(0);
}
if (written < towrite)
memmove(channel->buff, channel->buff + written, towrite - written);
channel->offset += written;
channel->curr = channel->buff + towrite - written;
caml_raise_if_exception(exn);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark as in previous commit: this change would not be necessary if caml_leave_blocking_section did not run async callbacks.

@gadmm
Copy link
Contributor Author

gadmm commented Sep 9, 2020

Is there any desire to have this in for 4.12 and if so what would be the time frame?

I saw comments on the goto style on another thread so I wanted to address this now:

Stylistically, I prefer not to use gotos when possible. This makes it much easier to review the resource flow in the function. So could you please only use them when this indeed factorizes the cleanup code?

This PR indeed explicitly advocates in favour of using the goto style for resource clean-up.

  1. Using gotos separates the success path from the failure path. This makes the success path easier to read, which matters first, and the failure path should be deduced in a mechanical way, which should be easy to check. Since in the C code the exceptions will be asynchronous exceptions or other exceptions unrelated to the current purpose of the code, they always induce a failure path that you want to separate from the success path.

  2. This is one of the valid and idiomatic uses of goto in C, so I do not think falls in any of the "prefer not to use goto" cases. It is true that it is useful to avoid code duplication, clean-up several resources in a stack-like fashion, (and improve cache locality for the success path, but I hope it does not matter in this PR), and you do not object to those cases. But in my experience, for those reasons, as you write code and as it evolves, it naturally ends up written with a goto even if the final code could be simplified, and it is extra effort to artificially avoid the idiom (with no advantage other than being able to say that one avoided a goto).

We will see how it ends up with the simplification at #9722, but I believe that the codebase will gain in readability if exception-handling code starts being written in a systematic fashion.

@stedolan
Copy link
Contributor

My impression from reading the code here is that a great deal of it was working around the fact that caml_leave_blocking_section could previously raise exceptions. Now that that's no longer the case, how much of this patch is still necessary to achieve resource safety?

For instance, there is quite a lot of refactoring here to thread exceptions back from caml_read_fd in the case where it both successfully reads some data and also raises an exception. Now that caml_leave_blocking_section does not raise, this case is no longer possible.

In particular, are the non-raising versions of the exception-raising functions (e.g. caml_raise_end_of_file_exn and friends) still necessary? The only uses I could find were to support the threading of exceptions above.

@gadmm
Copy link
Contributor Author

gadmm commented Sep 15, 2020

My impression from reading the code here is that a great deal of it was working around the fact that caml_leave_blocking_section could previously raise exceptions. Now that that's no longer the case, how much of this patch is still necessary to achieve resource safety?

There are still real bugs, but we will see better when I get back to it :) (which can happen faster if I can please have an idea of the schedule.)

For instance, there is quite a lot of refactoring here to thread exceptions back from caml_read_fd in the case where it both successfully reads some data and also raises an exception. Now that caml_leave_blocking_section does not raise, this case is no longer possible.

Surely, but keep in mind that this was written as part of an evolution path meant to carefully maintain backwards compatibility, which is a good enough reason for the added complexity.

In particular, are the non-raising versions of the exception-raising functions (e.g. caml_raise_end_of_file_exn and friends) still necessary? The only uses I could find were to support the threading of exceptions above.

The PR is first meant as extending the public API. As the commit log explains, the _exn variants to build exceptions are meant to solve the kind of problem at #7423 in the proper way. Of course, I like it better when the compiler code cleanup allows to demonstrate the advocated style, but this example might have to go.

I could keep this part of the new api in my freezer until it is useful for the compiler codebase, as I did until now, but the _exn idiom is more limited if one cannot build exceptions values in the first place, and so it is harder for the user to write resource-safe code.

@damiendoligez
Copy link
Member

Where do we stand with this PR? Obviously, we missed the 4.12 feature freeze and the fact that #9722 was merged should greatly simplify this one, right?

@gadmm
Copy link
Contributor Author

gadmm commented Jan 27, 2021

We are waiting that I rewrite it after #9722, but which should give a simpler PR indeed. It was not clear to me what was the time frame for 4.12 (see above), so I put it on my plate for 4.13. Maybe that your reminder will provide me with the nudge needed to get back to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants