Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed memory leak from sigaltstack. #11167

Merged
merged 3 commits into from
Apr 12, 2022
Merged

Conversation

azewierzejew
Copy link
Contributor

Currently stack for handling signals on the first domain always leaks.
As far as I have read the code extra spawned domains have their stacks cleared on exit.
The following fix clears the stack on the original domain.

It wouldn't free data if caml_shutdown was called on different thread than caml_startup_pooled.
It wouldn't break anything either way (as the function would do nothing).
But I don't know whether calling those functions from different threads is allowed.

@kayceesrk kayceesrk requested a review from abbysmal April 7, 2022 03:59
@gasche
Copy link
Member

gasche commented Apr 7, 2022

As a newcomer to this part of the runtime, I was surprised by this PR because "freeing the signal stack" was already discussed in #10726 and I understood it to be a solved issue. Here is what I understand:

  • create_domain mallocs a signal stack (caml_init_signal_stack)
  • domain_terminate frees the signal stack (caml_free_signal_stack)
  • but the main domain (domain 0) is created with create_domain by caml_init_domains (itself calld by caml_init_gc called by caml_startup_common) but not collected with domain_terminate (right?), so its own signal stack remains unfreed

(Why is this not an issue for other cleanup things that are done in domain_terminate? I see for example the same pattern with CAML_EVENTLOG_{INIT,TEARDOWN}.)

Ideally each init function would have a corresponding free/teardown function, with symmetric code, and the caml_shutdown function would call the free_foo functions for each init_foo function called in caml_startup_*, and nothing more. The proposed patch does something that seems correct but is also abstract-breaking in this sense -- we cleanup directly in caml_shutdown a resource that was allocated deep inside startup chain. It may be nice to have caml_free_domains in domain.c, and caml_free_gc in gc_ctrl.c, and call caml_free_gc from caml_shutdown.

@gasche
Copy link
Member

gasche commented Apr 7, 2022

One aspect I wonder about is: shouldn't we call domain_terminate on the main domain as well? Some of its work is meant for synchronization with other running domains and wouldn't be necessary, but that seems fine? Are there other reasons for not doing it?

(cc @dra27 @kayceesrk @ctk21 @Engil)

@azewierzejew
Copy link
Contributor Author

@gasche Vast majority of other allocation is done with caml_stat_* functions so there is no need to free.
Stack can't be allocated on the pool, because then there will be a dangling pointer when pool is freed.

As for if it is the best solution, I have no idea.

@dra27
Copy link
Member

dra27 commented Apr 7, 2022

Just as a note, I think everything in #10726 was (intentionally) lost as part of the multicore PR merge, so it's not completely surprising that this might have reappeared. The sigaltstack freeing in 4.x is tied in with 4.x's signal handling, which is completely overhauled in 5.0

@azewierzejew
Copy link
Contributor Author

Just as a note, I think everything in #10726 was (intentionally) lost as part of the multicore PR merge, so it's not completely surprising that this might have reappeared. The sigaltstack freeing in 4.x is tied in with 4.x's signal handling, which is completely overhauled in 5.0

If I'm correct that trunk of 5.0 is https://github.com/ocaml-multicore/ocaml-multicore then it isn't fixed as there is the complementary problem (freeing stack without reseting).

I've created an issue for that ocaml-multicore/ocaml-multicore#838
so you can choose were it should be discussed.

@dra27
Copy link
Member

dra27 commented Apr 7, 2022

The ocaml-multicore repository is mostly for historical interest now (earlier versions of multicore, and so forth) - it's just branch trunk on ocaml/ocaml which matters now (5.0 will be branched from it relatively soon).

@azewierzejew
Copy link
Contributor Author

As a newcomer to this part of the runtime, I was surprised by this PR because "freeing the signal stack" was already discussed in #10726 and I understood it to be a solved issue. Here is what I understand:

* `create_domain` mallocs a signal stack (`caml_init_signal_stack`)

* `domain_terminate` frees the signal stack (`caml_free_signal_stack`)

* but the main domain (domain 0) is created with `create_domain` by `caml_init_domains` (itself calld by `caml_init_gc` called by `caml_startup_common`) but not collected with `domain_terminate` (right?), so its own signal stack remains unfreed

(Why is this not an issue for other cleanup things that are done in domain_terminate? I see for example the same pattern with CAML_EVENTLOG_{INIT,TEARDOWN}.)

Ideally each init function would have a corresponding free/teardown function, with symmetric code, and the caml_shutdown function would call the free_foo functions for each init_foo function called in caml_startup_*, and nothing more. The proposed patch does something that seems correct but is also abstract-breaking in this sense -- we cleanup directly in caml_shutdown a resource that was allocated deep inside startup chain. It may be nice to have caml_free_domains in domain.c, and caml_free_gc in gc_ctrl.c, and call caml_free_gc from caml_shutdown.

I've created a new proposition in https://github.com/azewierzejew/ocaml/tree/signal-stack-fix-alternative
It is more inline with your comment and additionally it unregisters ocaml signal handlers before exiting.
By that I mean it resets all signal handlers to default that were changed.

@gasche
Copy link
Member

gasche commented Apr 10, 2022

Your "alternative" PR is sensibly more invasive in terms of change, and I don't feel competent enough to review it by myself. It combines three things:

  • it restructures the cleanup functions, I unconditionally like this part
  • it calls terminate_domain on the main domain, I think this is right but I need more work to think throug the implications
  • it contains signal cleanup code that I have no expertise about and would need another review

I would be in favor of un-drafting the present PR and merging it to fix the leak, and then discussing restructurations as a second step.

@sadiqj
Copy link
Contributor

sadiqj commented Apr 11, 2022

caml_free_signal_handling on your alternative tree looks fine. I'm not sure I fully understand the need for caml_free_gc as it just seems to redirect to caml_free_domains at the moment. I probably also need to think through the implications of domain_terminate being called during shutdown.

Agree with @gasche , we should un-draft this PR, merge it and then follow up with a restructuring.

@gasche
Copy link
Member

gasche commented Apr 11, 2022

I'm not sure I fully understand the need for caml_free_gc as it just seems to redirect to caml_free_domains at the moment.

This mirrors the structure of the initialization code, where caml_init_domains is, somewhat weirdly, called by caml_init_gc. (One justification might be that caml_init_domains also initializes the minor GC for each domain, and maybe caml_init_gc means to initialize all GCs, not just the major GC. I would still find it clearer if the two were separated.)

I probably also need to think through the implications of domain_terminate being called during shutdown.

We're in the same boat, it's not a trivial change, but I have a good feeling about it: I think it's more likely to be right (and remain right as the code evolves) as the current approach -- we should ensure that create_domain is always followed by domain_terminate on successful termination.

@azewierzejew azewierzejew marked this pull request as ready for review April 11, 2022 13:15
@azewierzejew
Copy link
Contributor Author

I undrafted the PR.

As for using domain_terminate I've tried to read into and noticed that error handling code in create_domain looks kind of leaky.
For example it "leaks" the caml_num_running_domains counter, because on fail the counter is not decreased, but also the domain_terminate isn't called after that as domains_self is set to NULL.

Similar things happens to domain_self->ephe_state. It is allocated in caml_init_major_gc but isn't freed in caml_teardown_major_gc but manually in domain_terminate. So any error past initiation of major gc will leak that (but it would be pooled).

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving the current minimal fix, hoping that we can discuss restructuration of the cleanup code next.

@azewierzejew
Copy link
Contributor Author

azewierzejew commented Apr 11, 2022

Also calling domain_terminate currently won't work perfectly because there is assertion that backup thread is running, which isn't true on domain 0 if no other domains were spawned (for domain 0 it is spawned lazily in caml_domain_spawn).

@sadiqj
Copy link
Contributor

sadiqj commented Apr 11, 2022

@azewierzejew if you could do an update of Changes with reviewers, that should also give appveyor a kick and once it's all green I can merge.

@sadiqj sadiqj merged commit 804b2b4 into ocaml:trunk Apr 12, 2022
@sadiqj
Copy link
Contributor

sadiqj commented Apr 12, 2022

Thanks @azewierzejew !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants