Fix caml pooled mode and recursive startup #10304

gadmm · 2021-03-20T22:45:08Z

Terminology:

pooled mode: the option that ensures that all memory allocated with caml_stat_alloc* is freed upon shutdown.
recursive startup: the ability to call caml_startup* several times (not technically recursively but whatever) described in the manual.

This PR fixes a crash in the instrumented runtime in pooled mode, and fixes some leaks in the normal runtime in pooled mode or when calling caml_startup* several times. It also factors some duplicated startup code and adds some comments to avoid similar issues in the future.

This affects all OCaml versions starting from 4.10.

gadmm · 2021-03-20T23:15:25Z

In addition, the last commit prevents reparsing OCAMLRUNPARAM on subsequent calls to caml_startup*. In rare conditions, e.g. when using setenv/putenv to change the value of caml_use_huge_pages after OCaml has already started, this can lead to UB (a segfault most probably). This is unlikely to happen, but it shows that it does not make sense to reparse OCAMLRUNPARAM.

gasche

There are various things I don't understand with this change. I guess the main point is that there are several changes and it is not clear what does what.

The description announces that the PR fixes a crash, but how do we observe the crash? (Is it something that we should consider adding in our testsuite as a regression test?)

More importantly: before caml_startup_aux had a very simple role, it would count the number of times it was called to cutoff initialization, and also call caml_stat_create_pool which is related to the startup/shutdown logic. With your PR, a lot of the "init" logic is moved inside this function (why? this is not explained; is it to fix a bug or for code factorization?), but some of the logic remains in startup_byt and startup_nat. Looking at the diff there is no clear reason for what is moved where.

My guess would be that what stays in startup_nat is exactly, in your design, the initialization functions that need to be implemented differently in the bytecode and native runtime. Here is maybe a naming scheme that could make this simpler to follow/understand:

caml_starting_aux is left as it was before, and possibly renamed into caml_on_first_startup
the new stuff you moved in it moves to a different caml_init_aux function
the remaining init_* bits in the byt/nat caml_startup_* functions are moved to caml_init_{byt,nat} functions, that start by calling caml_init_aux (so as a function they are in charge of all initialization)

runtime/startup_aux.c

runtime/startup_byt.c

gasche · 2021-03-21T06:03:42Z

Ah, I realize now that some of my questions are answered by reviewing commit-by-commit, I had only read the whole diff. So if I understand correctly, the main fix is to ensure that caml_init_domain is not called several times. I think that my comments generally still apply.

gadmm · 2021-03-21T13:31:46Z

Mea culpa, I forgot to describe the main bug and fix: the second commit moves functions that use caml_stat_alloc after the initialisation of the pool. Initialising the pool after something has been allocated will prevent the latter from being deallocated on shutdown, and will cause UB (most likely a crash) when attempting to free it manually. The logic of startup is a bit convoluted and recent evolutions forgot to take this logic into account (also it is hard to see how this can be caught by testing, but I think adding comments in the source will avoid these issues in the future).

gasche · 2021-03-21T15:22:54Z

also it is hard to see how this can be caught by testing

We could expose caml_stat_alloc_unpooled and caml_stat_alloc_outside_pool. The unpooled version would set a global flag that unpooled allocations happened (this is bad if the runtime intends to be pooled), possibly only in the debug runtime. The outside_pool version is for allocations we explicitly want to happen outside the pool even in pooled mode (for example #10250, #10266). The pool == NULL fallback of caml_stat_alloc would call caml_stat_alloc_unpooled.

Then a test could a C program starting the pooled runtime, interpreting a toy bytecode program, shutting down the runtime and cleaning the pool, then checking that the unpooled flag is not set.

gadmm · 2021-03-21T17:52:05Z

I started implementing the idea of a flag but then realised that for people who call caml_stat_alloc*_noexc functions without the runtime lock, this flag would need to be synchronised. The documentation of some _noexc functions explicitly mention that the runtime lock is not needed. The pooled mode was not made the default because of backwards compatibility concerns (with lots of positives found in the original PR), but in fact the new requirements were not added to the documentation (again: in pooled mode you need the runtime lock to call the _noexc functions) and are not enforced in any way with default OCaml settings, so library writers are unlikely to have started to follow the new discipline. These requirements are not even necessarily easy to follow, think about needing to acquire the runtime lock (and thus knowing whether you already have it) if you need to free a block given to foreign code during clean-up after an error condition (when the documentation of caml_stat_free does not even mention needing the runtime lock).

It seems to me that a better course of action would be to:

introduce caml_stat_alloc_pooled* functions (they would fail if the pool is not initialised),
use pooled functions everwhere in the runtime (except at the few places where one needs explicit management),
remove the optional pooled mode which is no longer necessary to prevent leaks when shutting down the runtime.

The caml_stat_alloc* functions would recover their original semantics compatible with the current documentation, and libraries could start using the pooled allocations explicitly if they wanted. And the memory allocated by the runtime will be freed for everybody.

I want to stress that to take advantage of the pooled allocations the program or library must be explicitly designed for it. Indeed, for a library to correctly release memory allocated with caml_stat_alloc*, it must still call caml_stat_free explicitly because pooling is not the default behaviour. In this scenario, when activating the pooled mode, either the pooled mode has no useful effect (if the deletions are all done before shutdown and with runtime lock held), or it always causes UB (otherwise). So it does not make sense to have a switch to choose whether allocations are pooled or not.

Anyway, these comments go fairly beyond the purpose of this PR. For the current PR I prefer to stop at its current state I think.

gasche · 2021-03-21T19:26:00Z

Sounds reasonable, thanks. Do you have an opinion on the refactoring of the init functions proposed in my review comment?

gadmm · 2021-03-22T19:28:59Z

Yes, I do not see what is wrong about moving stuff to startup_aux. Even if it had a specific purpose I do not see the issue with repurposing it.

gasche · 2021-03-22T19:35:42Z

The function moves from "one purpose" to "two purposes" (what it was doing now + some of the runtime initialization logic), so it is getting strictly more complex, which is bad.

gasche · 2021-03-24T14:02:29Z

cc @murmour: do you still care about pooled allocations? What do you think of @gadmm's different API proposal? (Would you want to implement it?)

xavierleroy · 2021-03-24T16:12:32Z

(I also failed to find evidence that anyone uses the OCAMLRUNPARAM c but this is harder to make sure of.)

For the record: the "c" parameter is tested as part of the CI at Inria, in the "other configs" script:

ocaml/tools/ci/inria/other-configs/script

Line 46 in 414bdec

OCAMLRUNPARAM="c=1" ${main}

and also in the "sanitizers" script, in conjunction with a leak detector:

ocaml/tools/ci/inria/sanitizers/script

Lines 88 to 94 in 414bdec

    
           # Build the system.  We want to check for memory leaks, hence 
        
           # 1- force ocamlrun to free memory before exiting 
        
           # 2- add an exception for ocamlyacc, which doesn't free memory 
        
           OCAMLRUNPARAM="c=1" \ 
        
           LSAN_OPTIONS="suppressions=$(pwd)/tools/ci/inria/sanitizers/lsan-suppr.txt" \ 
        
           make $jobs

So I'm reasonably confident that the "c" mode works and is maintained. The "pooled" API is another matter, though.

murmour · 2021-03-25T08:33:24Z

cc @murmour: do you still care about pooled allocations? What do you think of @gadmm's different API proposal? (Would you want to implement it?)

Hello, Gabriel, Guillaume, and Monsieur Leroy!

I still use the pooled (unloadable) runtime in 4.06, and totally care about it.

This PR is quite interesting, and I will do a proper review a bit later (in a few days).

gadmm · 2021-03-25T16:36:07Z

This got me puzzled as to why it did not catch that Caml_state was allocated before pool initialisation. Digging deeper, the tool considers global variables as live at the end of the program, so it did not consider Caml_state as leaking. I did not know, but it does not catch similar leaks (including anything reachable from globals, thread-local variables, etc.)

gadmm · 2021-03-25T18:34:32Z

I have pushed some simplifications (make static what does not need to be extern, etc.). I have also found a hidden function that could be factored (init_atom_table). Its place inside init_static in nat was nice and appropriate but I think it is even better to factor the call between nat and byt and move it to the common file.

As discussed with @gasche I do not see how to easily improve the organisation further but if you want to show me what you have in mind I will be happy to consider it as a patch.

gasche · 2021-03-25T21:13:18Z

See gadmm#1 for a modest proposal.

gadmm · 2021-03-28T17:56:01Z

I added @gasche's suggestion and take advantage of this open PR to introduce a minor clarification to the documentation (closes #10179). This is hopefully the last addition to this PR, further reviews welcome!

murmour

General comment: this PR lacks structure, which made it difficult to review. It is good that the changes are split into commits, but these commits lack proper messages (with explanations), and aligning these commits with the PR comments required effort.

Changes entry could be more detailed. Perhaps it should be split into 2-3 entries.

In addition, the last commit prevents reparsing OCAMLRUNPARAM on subsequent calls to caml_startup*. In rare conditions, e.g. when using setenv/putenv to change the value of caml_use_huge_pages after OCaml has already started, this can lead to UB (a segfault most probably). This is unlikely to happen, but it shows that it does not make sense to reparse OCAMLRUNPARAM.
[...]
Mea culpa, I forgot to describe the main bug and fix: the second commit moves functions that use caml_stat_alloc after the initialisation of the pool. Initialising the pool after something has been allocated will prevent the latter from being deallocated on shutdown, and will cause UB (most likely a crash) when attempting to free it manually. The logic of startup is a bit convoluted and recent evolutions forgot to take this logic into account (also it is hard to see how this can be caught by testing, but I think adding comments in the source will avoid these issues in the future).

Correct. Both of these problems (an unlikely UB and a few minor leaks) were introduced by me in cef4a94, while adding caml_cleanup_on_exit.

See gadmm#1 for a modest proposal.

What strikes me wrong with this approach is that caml_{startup,shutdown}_needed are named like predicates, but actually increment/decrement global counters on calls, which is completely unobvious.

How about just renaming caml_stratup_aux to caml_startup_common and documenting it properly?

I [...] take advantage of this open PR to introduce a minor clarification to the documentation (closes #10179).

This change is correct and fixes a mistake that I added in af5899f.

murmour · 2021-04-04T19:18:26Z

runtime/startup_aux.c

+  if (pooling)
+    caml_cleanup_on_exit = 1;


I don't see the meaning in this change.

Previously, if the user set OCAMLRUNPARAM=c, then caml_cleanup_on_exit became 1, which enabled pooling.

Now, if pooling is enabled (via a call to caml_startup_pooled), it sets caml_cleanup_on_exit to 1, which later forces caml_shutdown on exit. But the user didn't ask for this! Leak detection mode (caml_cleanup_on_exit) isn't the same as ability to unload the runtime manually.

I have removed this change, which stemmed from my misunderstanding. I also add a clarification of the difference between the two features in the documentation if you accept, to explain what you just wrote.

manual/src/cmds/intf-c.etex

gadmm · 2021-04-08T11:59:24Z

I will come back to this PR in a moment. In the meanwhile, I'd like to know what you think @gasche :

See gadmm#1 for a modest proposal.

What strikes me wrong with this approach is that caml_{startup,shutdown}_needed are named like predicates, but actually increment/decrement global counters on calls, which is completely unobvious.

How about just renaming caml_stratup_aux to caml_startup_common and documenting it properly?

I tend to agree with @murmour, for this reason and because in the end the code to call the two functions ends up duplicated between the three call sites. Would this alternative be fine with you?

gadmm · 2021-04-08T12:03:04Z

@murmour :

Both of these problems (an unlikely UB and a few minor leaks) were introduced by me in cef4a94, while adding caml_cleanup_on_exit.

To make sure I did not miss anything, the bug with Caml_state and the event buffer being allocated before pool initialization is more recent, right?

gasche · 2021-04-08T12:06:52Z

Would this alternative be fine with you?

I think; I timed out on this problem and I am okay with a code proposal that you and @murmour would both like.

murmour · 2021-04-08T13:48:01Z

@murmour :

Both of these problems (an unlikely UB and a few minor leaks) were introduced by me in cef4a94, while adding caml_cleanup_on_exit.

To make sure I did not miss anything, the bug with Caml_state and the event buffer being allocated before pool initialization is more recent, right?

Right. It was introduced in eac4860 (#8713, 4.10.0).

murmour · 2021-04-08T14:02:25Z

@gadmm, regarding your ideas on caml_stat_* API: it's important stuff, and I'll give thorough feedback on it a bit later (running short of time right now). It is out of scope of this PR anyway.

We move duplicated code to startup_aux to simplify further modifications. This should not change the meaning of the code.

caml_init_domain and CAML_EVENTLOG_INIT call caml_stat_alloc functions, so they are moved after the pool initialisation. Suubsequently, the call to caml_record_backtrace, which accesses the Caml_state, must be moved later. We document that the pool is only available after some point, to hope avoid similar bugs in the future.

This commit prevents reparsing OCAMLRUNPARAM on subsequent calls to caml_startup*. In rare conditions, e.g. when using setenv/putenv to change the value of caml_use_huge_pages after OCaml has already started, this can lead to UB (a segfault most probably). This is unlikely to happen, but it shows that it does not make sense to reparse OCAMLRUNPARAM. It is unlikely that the reparsing of OCAMLRUNPARAM was relied on in practice.

The calls to caml_init_atom_table can be factored. Then, make static what can be.

Fix ocaml#10179 (wrong example of thing that is released on shutdown), and clarify the difference between startup_pooled and the "cleanup_on_exit" function (at least to me). Fix typo.

Also the previous caml_startup_common should be static; rename it.

gadmm · 2021-04-09T22:24:37Z

Here is a rebased version. Comments have been added in the logs, otherwise only the last three commits have changed.

Thanks for the review. I took your comments into account.

You are right that I could have made your task easier by including detailed commit logs and mentioning that it is best reviewed commit-per-commit. I will be more careful next time.

I do not see how to improve the Changes entry. I think all the user-visible changes have been described.

gadmm · 2021-04-10T10:50:10Z

The failing test appears unrelated.

gadmm · 2022-04-14T23:40:27Z

Happy belated birthday, PR#10304!

I am closing this PR since I no longer have time not interest in this problem in light of the absence of review and the work needed to rebase it on top of multicore (and duplication of work to fix the issues in both OCaml 4 and 5).

I have tried to use the pooled mode for our library, and as explained in #10304 (comment), this mode needs a rework before it can be useful, so we are going to do without it. For now I simply recommend to stay clear from the "c" runparam.

But anyone who cares about these issues should feel free to adopt it and/or open an issue about it. I am not deleting the branch so the code will remain available for now.

gasche reviewed Mar 21, 2021

View reviewed changes

runtime/startup_aux.c Outdated Show resolved Hide resolved

runtime/startup_byt.c Outdated Show resolved Hide resolved

gadmm force-pushed the pooled branch from 46dfe94 to fdcf77e Compare March 21, 2021 14:59

gadmm force-pushed the pooled branch from c9ac766 to 5dd8eae Compare March 28, 2021 17:32

murmour suggested changes Apr 4, 2021

View reviewed changes

gadmm added 6 commits April 9, 2021 23:59

Factor startup code

309c7a4

We move duplicated code to startup_aux to simplify further modifications. This should not change the meaning of the code.

More factoring and simplifications

2291368

The calls to caml_init_atom_table can be factored. Then, make static what can be.

Clarify documentation for caml_shutdown

78cbba0

Fix ocaml#10179 (wrong example of thing that is released on shutdown), and clarify the difference between startup_pooled and the "cleanup_on_exit" function (at least to me). Fix typo.

Better name for caml_startup_aux

a02e384

Also the previous caml_startup_common should be static; rename it.

Changes

9af24d8

gadmm force-pushed the pooled branch from 4515fc3 to 9af24d8 Compare April 9, 2021 22:20

gadmm mentioned this pull request Jan 31, 2022

Code and documentation are inconsistent for caml_shutdown #10179

Closed

gadmm closed this Apr 14, 2022

gadmm mentioned this pull request Apr 3, 2024

Force all domain threads to exit before the main thread #13010

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix caml pooled mode and recursive startup #10304

Fix caml pooled mode and recursive startup #10304

gadmm commented Mar 20, 2021 •

edited

gadmm commented Mar 20, 2021

gasche left a comment

gasche commented Mar 21, 2021

gadmm commented Mar 21, 2021

gasche commented Mar 21, 2021

gadmm commented Mar 21, 2021 •

edited

gasche commented Mar 21, 2021

gadmm commented Mar 22, 2021

gasche commented Mar 22, 2021

gasche commented Mar 24, 2021

xavierleroy commented Mar 24, 2021

murmour commented Mar 25, 2021

gadmm commented Mar 25, 2021

gadmm commented Mar 25, 2021

gasche commented Mar 25, 2021

gadmm commented Mar 28, 2021

murmour left a comment

murmour Apr 4, 2021

gadmm Apr 9, 2021

gadmm commented Apr 8, 2021

gadmm commented Apr 8, 2021

gasche commented Apr 8, 2021

murmour commented Apr 8, 2021

murmour commented Apr 8, 2021

gadmm commented Apr 9, 2021

gadmm commented Apr 10, 2021

gadmm commented Apr 14, 2022

Fix caml pooled mode and recursive startup #10304

Fix caml pooled mode and recursive startup #10304

Conversation

gadmm commented Mar 20, 2021 • edited

gadmm commented Mar 20, 2021

gasche left a comment

Choose a reason for hiding this comment

gasche commented Mar 21, 2021

gadmm commented Mar 21, 2021

gasche commented Mar 21, 2021

gadmm commented Mar 21, 2021 • edited

gasche commented Mar 21, 2021

gadmm commented Mar 22, 2021

gasche commented Mar 22, 2021

gasche commented Mar 24, 2021

xavierleroy commented Mar 24, 2021

murmour commented Mar 25, 2021

gadmm commented Mar 25, 2021

gadmm commented Mar 25, 2021

gasche commented Mar 25, 2021

gadmm commented Mar 28, 2021

murmour left a comment

Choose a reason for hiding this comment

murmour Apr 4, 2021

Choose a reason for hiding this comment

gadmm Apr 9, 2021

Choose a reason for hiding this comment

gadmm commented Apr 8, 2021

gadmm commented Apr 8, 2021

gasche commented Apr 8, 2021

murmour commented Apr 8, 2021

murmour commented Apr 8, 2021

gadmm commented Apr 9, 2021

gadmm commented Apr 10, 2021

gadmm commented Apr 14, 2022

gadmm commented Mar 20, 2021 •

edited

gadmm commented Mar 21, 2021 •

edited