Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix caml pooled mode and recursive startup #10304

Closed
wants to merge 7 commits into from
Closed

Conversation

gadmm
Copy link
Contributor

@gadmm gadmm commented Mar 20, 2021

Terminology:

  • pooled mode: the option that ensures that all memory allocated with caml_stat_alloc* is freed upon shutdown.
  • recursive startup: the ability to call caml_startup* several times (not technically recursively but whatever) described in the manual.

This PR fixes a crash in the instrumented runtime in pooled mode, and fixes some leaks in the normal runtime in pooled mode or when calling caml_startup* several times. It also factors some duplicated startup code and adds some comments to avoid similar issues in the future.

This affects all OCaml versions starting from 4.10.

@gadmm
Copy link
Contributor Author

gadmm commented Mar 20, 2021

In addition, the last commit prevents reparsing OCAMLRUNPARAM on subsequent calls to caml_startup*. In rare conditions, e.g. when using setenv/putenv to change the value of caml_use_huge_pages after OCaml has already started, this can lead to UB (a segfault most probably). This is unlikely to happen, but it shows that it does not make sense to reparse OCAMLRUNPARAM.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are various things I don't understand with this change. I guess the main point is that there are several changes and it is not clear what does what.

The description announces that the PR fixes a crash, but how do we observe the crash? (Is it something that we should consider adding in our testsuite as a regression test?)

More importantly: before caml_startup_aux had a very simple role, it would count the number of times it was called to cutoff initialization, and also call caml_stat_create_pool which is related to the startup/shutdown logic. With your PR, a lot of the "init" logic is moved inside this function (why? this is not explained; is it to fix a bug or for code factorization?), but some of the logic remains in startup_byt and startup_nat. Looking at the diff there is no clear reason for what is moved where.

My guess would be that what stays in startup_nat is exactly, in your design, the initialization functions that need to be implemented differently in the bytecode and native runtime. Here is maybe a naming scheme that could make this simpler to follow/understand:

  • caml_starting_aux is left as it was before, and possibly renamed into caml_on_first_startup
  • the new stuff you moved in it moves to a different caml_init_aux function
  • the remaining init_* bits in the byt/nat caml_startup_* functions are moved to caml_init_{byt,nat} functions, that start by calling caml_init_aux (so as a function they are in charge of all initialization)

runtime/startup_aux.c Outdated Show resolved Hide resolved
runtime/startup_byt.c Outdated Show resolved Hide resolved
@gasche
Copy link
Member

gasche commented Mar 21, 2021

Ah, I realize now that some of my questions are answered by reviewing commit-by-commit, I had only read the whole diff. So if I understand correctly, the main fix is to ensure that caml_init_domain is not called several times. I think that my comments generally still apply.

@gadmm
Copy link
Contributor Author

gadmm commented Mar 21, 2021

Mea culpa, I forgot to describe the main bug and fix: the second commit moves functions that use caml_stat_alloc after the initialisation of the pool. Initialising the pool after something has been allocated will prevent the latter from being deallocated on shutdown, and will cause UB (most likely a crash) when attempting to free it manually. The logic of startup is a bit convoluted and recent evolutions forgot to take this logic into account (also it is hard to see how this can be caught by testing, but I think adding comments in the source will avoid these issues in the future).

@gasche
Copy link
Member

gasche commented Mar 21, 2021

also it is hard to see how this can be caught by testing

We could expose caml_stat_alloc_unpooled and caml_stat_alloc_outside_pool. The unpooled version would set a global flag that unpooled allocations happened (this is bad if the runtime intends to be pooled), possibly only in the debug runtime. The outside_pool version is for allocations we explicitly want to happen outside the pool even in pooled mode (for example #10250, #10266). The pool == NULL fallback of caml_stat_alloc would call caml_stat_alloc_unpooled.

Then a test could a C program starting the pooled runtime, interpreting a toy bytecode program, shutting down the runtime and cleaning the pool, then checking that the unpooled flag is not set.

@gadmm
Copy link
Contributor Author

gadmm commented Mar 21, 2021

I started implementing the idea of a flag but then realised that for people who call caml_stat_alloc*_noexc functions without the runtime lock, this flag would need to be synchronised. The documentation of some _noexc functions explicitly mention that the runtime lock is not needed. The pooled mode was not made the default because of backwards compatibility concerns (with lots of positives found in the original PR), but in fact the new requirements were not added to the documentation (again: in pooled mode you need the runtime lock to call the _noexc functions) and are not enforced in any way with default OCaml settings, so library writers are unlikely to have started to follow the new discipline. These requirements are not even necessarily easy to follow, think about needing to acquire the runtime lock (and thus knowing whether you already have it) if you need to free a block given to foreign code during clean-up after an error condition (when the documentation of caml_stat_free does not even mention needing the runtime lock).

It seems to me that a better course of action would be to:

  • introduce caml_stat_alloc_pooled* functions (they would fail if the pool is not initialised),
  • use pooled functions everwhere in the runtime (except at the few places where one needs explicit management),
  • remove the optional pooled mode which is no longer necessary to prevent leaks when shutting down the runtime.

The caml_stat_alloc* functions would recover their original semantics compatible with the current documentation, and libraries could start using the pooled allocations explicitly if they wanted. And the memory allocated by the runtime will be freed for everybody.

I want to stress that to take advantage of the pooled allocations the program or library must be explicitly designed for it. Indeed, for a library to correctly release memory allocated with caml_stat_alloc*, it must still call caml_stat_free explicitly because pooling is not the default behaviour. In this scenario, when activating the pooled mode, either the pooled mode has no useful effect (if the deletions are all done before shutdown and with runtime lock held), or it always causes UB (otherwise). So it does not make sense to have a switch to choose whether allocations are pooled or not.

Anyway, these comments go fairly beyond the purpose of this PR. For the current PR I prefer to stop at its current state I think.

@gasche
Copy link
Member

gasche commented Mar 21, 2021

Sounds reasonable, thanks. Do you have an opinion on the refactoring of the init functions proposed in my review comment?

@gadmm
Copy link
Contributor Author

gadmm commented Mar 22, 2021

Yes, I do not see what is wrong about moving stuff to startup_aux. Even if it had a specific purpose I do not see the issue with repurposing it.

@gasche
Copy link
Member

gasche commented Mar 22, 2021

The function moves from "one purpose" to "two purposes" (what it was doing now + some of the runtime initialization logic), so it is getting strictly more complex, which is bad.

@gasche
Copy link
Member

gasche commented Mar 24, 2021

cc @murmour: do you still care about pooled allocations? What do you think of @gadmm's different API proposal? (Would you want to implement it?)

@xavierleroy
Copy link
Contributor

(I also failed to find evidence that anyone uses the OCAMLRUNPARAM c but this is harder to make sure of.)

For the record: the "c" parameter is tested as part of the CI at Inria, in the "other configs" script:

OCAMLRUNPARAM="c=1" ${main}

and also in the "sanitizers" script, in conjunction with a leak detector:

# Build the system. We want to check for memory leaks, hence
# 1- force ocamlrun to free memory before exiting
# 2- add an exception for ocamlyacc, which doesn't free memory
OCAMLRUNPARAM="c=1" \
LSAN_OPTIONS="suppressions=$(pwd)/tools/ci/inria/sanitizers/lsan-suppr.txt" \
make $jobs

So I'm reasonably confident that the "c" mode works and is maintained. The "pooled" API is another matter, though.

@murmour
Copy link
Contributor

murmour commented Mar 25, 2021

cc @murmour: do you still care about pooled allocations? What do you think of @gadmm's different API proposal? (Would you want to implement it?)

Hello, Gabriel, Guillaume, and Monsieur Leroy!

I still use the pooled (unloadable) runtime in 4.06, and totally care about it.

This PR is quite interesting, and I will do a proper review a bit later (in a few days).

@gadmm
Copy link
Contributor Author

gadmm commented Mar 25, 2021

This got me puzzled as to why it did not catch that Caml_state was allocated before pool initialisation. Digging deeper, the tool considers global variables as live at the end of the program, so it did not consider Caml_state as leaking. I did not know, but it does not catch similar leaks (including anything reachable from globals, thread-local variables, etc.)

@gadmm
Copy link
Contributor Author

gadmm commented Mar 25, 2021

I have pushed some simplifications (make static what does not need to be extern, etc.). I have also found a hidden function that could be factored (init_atom_table). Its place inside init_static in nat was nice and appropriate but I think it is even better to factor the call between nat and byt and move it to the common file.

As discussed with @gasche I do not see how to easily improve the organisation further but if you want to show me what you have in mind I will be happy to consider it as a patch.

@gasche
Copy link
Member

gasche commented Mar 25, 2021

See gadmm#1 for a modest proposal.

@gadmm
Copy link
Contributor Author

gadmm commented Mar 28, 2021

I added @gasche's suggestion and take advantage of this open PR to introduce a minor clarification to the documentation (closes #10179). This is hopefully the last addition to this PR, further reviews welcome!

Copy link
Contributor

@murmour murmour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: this PR lacks structure, which made it difficult to review. It is good that the changes are split into commits, but these commits lack proper messages (with explanations), and aligning these commits with the PR comments required effort.

Changes entry could be more detailed. Perhaps it should be split into 2-3 entries.


In addition, the last commit prevents reparsing OCAMLRUNPARAM on subsequent calls to caml_startup*. In rare conditions, e.g. when using setenv/putenv to change the value of caml_use_huge_pages after OCaml has already started, this can lead to UB (a segfault most probably). This is unlikely to happen, but it shows that it does not make sense to reparse OCAMLRUNPARAM.
[...]
Mea culpa, I forgot to describe the main bug and fix: the second commit moves functions that use caml_stat_alloc after the initialisation of the pool. Initialising the pool after something has been allocated will prevent the latter from being deallocated on shutdown, and will cause UB (most likely a crash) when attempting to free it manually. The logic of startup is a bit convoluted and recent evolutions forgot to take this logic into account (also it is hard to see how this can be caught by testing, but I think adding comments in the source will avoid these issues in the future).

Correct. Both of these problems (an unlikely UB and a few minor leaks) were introduced by me in cef4a94, while adding caml_cleanup_on_exit.


See gadmm#1 for a modest proposal.

What strikes me wrong with this approach is that caml_{startup,shutdown}_needed are named like predicates, but actually increment/decrement global counters on calls, which is completely unobvious.

How about just renaming caml_stratup_aux to caml_startup_common and documenting it properly?


I [...] take advantage of this open PR to introduce a minor clarification to the documentation (closes #10179).

This change is correct and fixes a mistake that I added in af5899f.

Comment on lines 171 to 172
if (pooling)
caml_cleanup_on_exit = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the meaning in this change.

Previously, if the user set OCAMLRUNPARAM=c, then caml_cleanup_on_exit became 1, which enabled pooling.

Now, if pooling is enabled (via a call to caml_startup_pooled), it sets caml_cleanup_on_exit to 1, which later forces caml_shutdown on exit. But the user didn't ask for this! Leak detection mode (caml_cleanup_on_exit) isn't the same as ability to unload the runtime manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this change, which stemmed from my misunderstanding. I also add a clarification of the difference between the two features in the documentation if you accept, to explain what you just wrote.

manual/src/cmds/intf-c.etex Show resolved Hide resolved
@gadmm
Copy link
Contributor Author

gadmm commented Apr 8, 2021

I will come back to this PR in a moment. In the meanwhile, I'd like to know what you think @gasche :

See gadmm#1 for a modest proposal.

What strikes me wrong with this approach is that caml_{startup,shutdown}_needed are named like predicates, but actually increment/decrement global counters on calls, which is completely unobvious.

How about just renaming caml_stratup_aux to caml_startup_common and documenting it properly?

I tend to agree with @murmour, for this reason and because in the end the code to call the two functions ends up duplicated between the three call sites. Would this alternative be fine with you?

@gadmm
Copy link
Contributor Author

gadmm commented Apr 8, 2021

@murmour :

Both of these problems (an unlikely UB and a few minor leaks) were introduced by me in cef4a94, while adding caml_cleanup_on_exit.

To make sure I did not miss anything, the bug with Caml_state and the event buffer being allocated before pool initialization is more recent, right?

@gasche
Copy link
Member

gasche commented Apr 8, 2021

Would this alternative be fine with you?

I think; I timed out on this problem and I am okay with a code proposal that you and @murmour would both like.

@murmour
Copy link
Contributor

murmour commented Apr 8, 2021

@murmour :

Both of these problems (an unlikely UB and a few minor leaks) were introduced by me in cef4a94, while adding caml_cleanup_on_exit.

To make sure I did not miss anything, the bug with Caml_state and the event buffer being allocated before pool initialization is more recent, right?

Right. It was introduced in eac4860 (#8713, 4.10.0).

@murmour
Copy link
Contributor

murmour commented Apr 8, 2021

@gadmm, regarding your ideas on caml_stat_* API: it's important stuff, and I'll give thorough feedback on it a bit later (running short of time right now). It is out of scope of this PR anyway.

We move duplicated code to startup_aux to simplify further
modifications. This should not change the meaning of the code.
caml_init_domain and CAML_EVENTLOG_INIT call caml_stat_alloc
functions, so they are moved after the pool initialisation.

Suubsequently, the call to caml_record_backtrace, which accesses the
Caml_state, must be moved later.

We document that the pool is only available after some point, to hope
avoid similar bugs in the future.
This commit prevents reparsing OCAMLRUNPARAM on subsequent calls to
caml_startup*. In rare conditions, e.g. when using setenv/putenv to
change the value of caml_use_huge_pages after OCaml has already
started, this can lead to UB (a segfault most probably). This is
unlikely to happen, but it shows that it does not make sense to
reparse OCAMLRUNPARAM. It is unlikely that the reparsing of
OCAMLRUNPARAM was relied on in practice.
The calls to caml_init_atom_table can be factored.

Then, make static what can be.
Fix ocaml#10179 (wrong example of thing that is released on shutdown), and
clarify the difference between startup_pooled and the
"cleanup_on_exit" function (at least to me).

Fix typo.
Also the previous caml_startup_common should be static; rename it.
@gadmm
Copy link
Contributor Author

gadmm commented Apr 9, 2021

Here is a rebased version. Comments have been added in the logs, otherwise only the last three commits have changed.

Thanks for the review. I took your comments into account.

You are right that I could have made your task easier by including detailed commit logs and mentioning that it is best reviewed commit-per-commit. I will be more careful next time.

I do not see how to improve the Changes entry. I think all the user-visible changes have been described.

@gadmm
Copy link
Contributor Author

gadmm commented Apr 10, 2021

The failing test appears unrelated.

@gadmm
Copy link
Contributor Author

gadmm commented Apr 14, 2022

Happy belated birthday, PR#10304!

I am closing this PR since I no longer have time not interest in this problem in light of the absence of review and the work needed to rebase it on top of multicore (and duplication of work to fix the issues in both OCaml 4 and 5).

I have tried to use the pooled mode for our library, and as explained in #10304 (comment), this mode needs a rework before it can be useful, so we are going to do without it. For now I simply recommend to stay clear from the "c" runparam.

But anyone who cares about these issues should feel free to adopt it and/or open an issue about it. I am not deleting the branch so the code will remain available for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants