Weak array improvements #11743

kayceesrk · 2022-11-22T16:50:18Z

In #11733, it was observed that there were time and memory regressions with frama-c. On investigation, this pointed to inefficiencies with weak array blit and get_field* operations. This PR addresses some of the inefficiencies with weak arrays.

Optimize `get_field*`

The first commit 25e7f08 optimises get_field and get_field_copy operations. On trunk, the Some value is allocated on the major heap. This was necessary with the earlier concurrent minor GC scheme where domains cannot directly access the minor heap arena of other domains. Currently, OCaml uses parallel minor GC where there are no restrictions on inter minor heap arena pointers. Hence, these values may be allocated on the minor heap.

Optimize `blit` operation

The ephemeron / weak array implementation on trunk uses two per-domain lists to keep track of ephemerons.

ocaml/runtime/caml/weak.h

Lines 31 to 38 in ee75054

    
           struct caml_ephe_info { 
        
             value todo; /* These are ephemerons which need to be marked and swept in the 
        
                            current cycle. If the ephemeron is alive, after marking, they 
        
                            go into the live list after cleaning them off the unreachable 
        
                            keys and releasing values if any of the keys are unreachable. 
        
                            */ 
        
             value live; /* These are ephemerons which are alive in the current cycle, 
        
                            whose keys and data are live (or not set). */

In particular, the assumption is that all of the ephemerons on the live list are marked and all of the keys and the data field are either marked (or unset or is a primitive value). In order to preserve this invariant, during a blit operation, the fields being copied are marked; it may be possible that the destination of a bit operation may be an ephemeron on the live list. It was pointed out in the PR that introduced the marking ocaml-multicore/ocaml-multicore#316 that this may cause noticeable difference to performance, which is what was observed with frama-c benchmarks.

The second commit relaxes the invariant on the live list. For the ephemerons on the live list, the ephemerons are marked and the data field is marked (or unset or is a primitive value), but the keys may be unmarked (due to a blit operation that copies unmarked keys into the destination ephemeron). During a blit operation, the keys are no longer darkened. As a result, the live list has to be swept during Phase_sweep_ephe to release the dead keys. This is done by moving the ephemerons on the live list to the todo list at the beginning of Phase_sweep_ephe.

Domain creation and termination

Care must be taken to ensure that the new protocol works with domain creation and termination. On trunk, the number of domains to perform ephemeron sweeping is determined at the start of the new cycle. This count is maintained in num_domains_to_ephe_sweep variable. Newly created domains in a cycle will only have ephemerons in its live list, and the ephemerons on the live list have the strong invariant that obviates the need to mark or sweep these ephemerons in that cycle.

Since we have relaxed the invariant, newly created domains may have ephemerons on the live list that may have unmarked keys. Hence, the number of ephemerons to perform ephemeron sweeping can only be determined at the beginning of the Phase_sweep_ephe. To that end, num_domains_to_ephe_sweep is determined at the beginning of Phase_sweep_ephe in major_gc.c:try_complete_gc_phase. During domain termination, the terminating domain ensures that the num_domains_to_ephe_sweep count is decremented appropriately.

Performance

I repeated the experiments suggested in #11733. There are 3 variants in the experiment:

4.12 (baseline)
5.0.0_beta1 + Ensure that old frame descriptor tables are freed #11673
5.0.0_beta1 + Ensure that old frame descriptor tables are freed #11673 + patches from this PR.

The raw results are available in this sheet. Here are the time and memory overheads of the two latter variants compared to the baseline:

Time

The numbers in the parenthesis next to the benchmark name represent the execution time in seconds for the baseline variant. On average, the time increase from 4.12 to 5.0.0_beta1 + pr11673 was 85%, and this drops to 59% with this PR.

Memory

The numbers in the parenthesis next to the benchmark name represent the maximum resident set size in kilo bytes for the baseline variant. On average, the memory increase 4.12 to 5.0.0_beta1 + pr11673 was 108% which drops to 28% with this PR. On debie1.eva the memory increase on 5.0.0_beta1 + pr11673 was 913%, which comes down to 76% with this PR.

gasche · 2022-11-22T17:09:01Z

This looks like a non-trivial change to a very tricky part of the runtime. @stedolan would be an excellent reviewer as he shared your ephemeron travels, but is there someone else who could review this? (@bobot?)

I have two other "meta" questions:

how confident are you that the change does not introduce bugs? (do we have a very good ephemeron testsuite by now?)
do we want to consider getting this change in 5.0? It sounds like Frama-C will not consider moving without something similar, but this introduces a correctness risk at the last minute, but but Frama-C may be the only one project to strongly depend on the correctness of ephemerons anyway, so it may be okay?

kayceesrk · 2022-11-22T17:19:57Z

In addition to the two reviewers you have suggested, @sadiqj would be a great person to review this.

how confident are you that the change does not introduce bugs? (do we have a very good ephemeron testsuite by now?)

The ephemeron tests in the testsuite are reasonably good. We have a number of parallel tests too. I'm fairly confident of the correctness of this change.

do we want to consider getting this change in 5.0?

In #11733, the frama-c folks mention

Patch #11673 indeed fixes the most critical memory issue, but when running several tests in parallel, we are still having some issues with high memory usage. These are not critical for OCaml 5.0, but we hope they might be fixed for 5.1.

Hence, they may be ok to wait for this change in 5.1? In any case, the PR only affects those programs that use weak arrays and ephemerons. This is a small minority.

maroneze · 2022-11-22T19:45:08Z

Thank you for the quick and thorough work, @kayceesrk!

I confirm, it's not a problem if this patch comes after 5.0. For now, we will "officially" remain in 4.14 (but still perform tests with 5.x), and when the difference in a 5.x release will be small enough, we'll migrate to it.

kayceesrk · 2022-11-23T03:34:40Z

when the difference in a 5.x release will be small enough, we'll migrate to it.

It would be great to know, from the frama-c team's perspective, what the acceptable performance difference will be in terms of time and memory usage.

I would also be curious to know whether frama-c intends to use parallelism. The added benefit of parallelism may offset the impact of the sequential performance hit.

bobot

I think the changes are sound, and should indeed improve the memory consumption, by keeping the advantage of blit_* to not darken unnecessarily.

However I think we can reuse some part of the strategy used in 4.14. Which clean the keys blitted in order to avoid needing to move the lived ephemerons back to the todo list.

bobot · 2022-11-22T18:41:00Z

runtime/weak.c

+  /* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be
+     unmarked. The ephemerons on the live list are not scanned during ephemeron
+     marking. Hence, unconditionally darken the data value. */
+  return Val_unit;
 }


The blit_* functions have the interest of avoiding marking the values as much as possible. A possible optimization would be to not mark the data if the previous data was unmarked (as in 4.14).

runtime/weak.c

sadiqj

This looks good, only one comment on whether we could avoid marking the data in blit as well.

sadiqj · 2022-11-27T18:26:03Z

runtime/weak.c

-                          ed, CAML_EPHE_DATA_OFFSET, 1);
+  ephe_blit_field (es, CAML_EPHE_DATA_OFFSET, ed, CAML_EPHE_DATA_OFFSET, 1);
+  caml_darken(0, Field(ed, CAML_EPHE_DATA_OFFSET), 0);
+  /* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be


Suggested change

/* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be

/* [ed] may be in [Caml_state->ephe_info->live] list. The data value may be

sadiqj · 2022-11-27T18:44:59Z

runtime/weak.c

+  ephe_blit_field (es, CAML_EPHE_DATA_OFFSET, ed, CAML_EPHE_DATA_OFFSET, 1);
+  caml_darken(0, Field(ed, CAML_EPHE_DATA_OFFSET), 0);
+  /* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be
+     unmarked. The ephemerons on the live list are not scanned during ephemeron
+     marking. Hence, unconditionally darken the data value. */


Could we avoid doing this if we're in Phase_sweep_and_mark_main or Phase_mark_final? We'll have to go through Phase_sweep_ephe before the cycle ends which means all domains will have moved their live list into todo.

kayceesrk · 2022-11-28T03:55:48Z

Thanks for the reviews. Both @bobot and @sadiqj have raised the question of whether we can avoid marking data field unconditionally. The challenge is that marking happens asynchronously and there isn't synchronization between data field blit and the decision to mark. Here is a problematic example.

Suppose the destination ephemeron ed is reachable, the keys are empty and the data field is unmarked. During the marking phase, this data field will be marked and the ephemeron will be moved from the todo list to the live list. A concurrent blit may observe that the old data field is unmarked, and choose not to mark the new data field. The ephe_mark procedure would now move the ephemeron from the marking domain's todo list to the live list. Now, you have an ephemeron ed, which is in the live list, but with unmarked data field. While it is true that at the beginning of Phase_sweep_ephe the ephemerons are moved to the live list, we cannot perform any more marking; the marking phase is done when we reach Phase_sweep_ephe. All we can do is release the data unmarked at this point. But this is semantically incorrect -- the ephemeron is reachable, all the keys are empty but we're releasing the data!

One could come up with a more complicated scheme to go back to the marking phase from Phase_sweep_ephe. But I would rather not implement it now since we're hoping to simplify ephemeron interface anyway in the near future: #10737. In that case, given that ephemerons will be immutable, we will not have the case when we're overwriting old data values which are not empty.

bobot

Ok optimizations can come once the new API is used. (Why is it not used yet?)

kayceesrk · 2022-11-28T08:53:14Z

Thanks @bobot. Currently, the new/proposed API is implemented over the old mutable ephemeron interface. Eventually, the plan is to deprecate the mutable ephemeron interface and separate the implementation of immutable ephemerons from (mutable) weak arrays. This is a bit of work and a breaking change for ephemeron users. Hence, it was planned to be done after 5.0.

maroneze · 2022-11-28T08:57:39Z

It would be great to know, from the frama-c team's perspective, what the acceptable performance difference will be in terms of time and memory usage.

Ideally, we would like the difference to be < 5%, but it's not clear if this will be achievable. I believe it will depend, in the end, of how much effort is necessary to compensate the remaining X%.

I would also be curious to know whether frama-c intends to use parallelism. The added benefit of parallelism may offset the impact of the sequential performance hit.

In the short term, this analysis in particular (Eva) is not going to use parallelism, since it requires major re-engineering efforts and most industrial partners are not that interested in sponsoring such development. Some analyzers already use some form of parallelism (e.g. WP calls solvers in parallel when it can), but overall, there is work on related parts that could, in the future, make use of parallelism. But it will take a while.

By the way, for the averages and graphs, it would be clearer if the .parse and .eva targets were separated.

Noticeably, the time increase in the .parse targets is probably largely related to this:

There is still a regression in time complexity when loading plugins with many modules. However, this should however be less noticeable in user code and an hypothetical fix can wait for 5.1

However, for the .eva targets, this is likely not the main cause.

For the memory increase, it's also a very different behavior; I'm afraid that mixing both kinds of targets in the same graph will impact the averages and possibly lead to less relevant numbers. For instance, when running our largest dataset, which is SATE 6, the overhead in Eva analysis time is very minor compared to the loading of dynamic modules. On the other hand, for our long-running industrial use cases, the time and memory consumption of the .eva part are the main issue.

kayceesrk · 2022-11-28T09:33:29Z

Thanks for the clarification. I've added separate sheets for separating out .eva and .parse targets.

For .eva targets, compared to 4.14 baseline:

Time increase for 5.0_beta1+pr11673 = 60.15%
Time increase for this PR = 32.20%
Memory increase for 5.0_beta1+pr11673 = 220.11%
Memory increase for this PR = 61.67%

For .parse targets, compared to 4.14 baseline:

Time increase for 5.0_beta1+pr11673 = 109.77%
Time increase for this PR = 85.31%
Memory increase for 5.0_beta1+pr11673 = -4.59%
Memory increase for this PR = -4.59%

kayceesrk · 2022-11-29T05:17:28Z

We've got 2 approvals on this PR. If there's nothing further needed, this PR can be merged.

sadiqj · 2022-11-29T16:11:03Z

Thanks @kayceesrk !

nojb · 2023-03-15T14:01:55Z

Just a note to say that some internal LexiFi code that uses weak arrays heavily was taking 20% longer to run on 5.0 but it is level with 4.14 on trunk (which includes this patch). Thanks!

kayceesrk · 2023-03-15T15:26:34Z

That's great to hear @nojb!

Make ephe_get_field* allocate in the minor heap.

25e7f08

kayceesrk changed the title ~~Weak arrays improvements~~ Weak array improvements Nov 22, 2022

kayceesrk mentioned this pull request Nov 22, 2022

Time/memory regression in Frama-C with OCaml 5.0 #11733

Open

Optimise weak array blit operation

53c28c7

kayceesrk force-pushed the ephemeron_improvements branch from 16a5a1e to 53c28c7 Compare November 22, 2022 18:15

bobot reviewed Nov 25, 2022

View reviewed changes

sadiqj self-assigned this Nov 27, 2022

sadiqj approved these changes Nov 27, 2022

View reviewed changes

Add Changes entry

b89c33e

bobot approved these changes Nov 28, 2022

View reviewed changes

gasche mentioned this pull request Nov 28, 2022

Major slowdown in single domain code with OCaml 5 related to GC #11769

Closed

kayceesrk force-pushed the ephemeron_improvements branch from bc527c6 to b89c33e Compare November 29, 2022 15:06

sadiqj merged commit 6978b36 into ocaml:trunk Nov 29, 2022

This was referenced Jan 19, 2023

[ocaml5-issue] BUS error on the Weak Hashset tests on macOS 5.0.0 ocaml-multicore/multicoretests#283

Closed

Initial tests of the Weak module ocaml-multicore/multicoretests#214

Merged

kayceesrk mentioned this pull request Mar 23, 2023

Simplify weak hash sets #12131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weak array improvements #11743

Weak array improvements #11743

kayceesrk commented Nov 22, 2022 •

edited

gasche commented Nov 22, 2022

kayceesrk commented Nov 22, 2022

maroneze commented Nov 22, 2022

kayceesrk commented Nov 23, 2022

bobot left a comment

bobot Nov 22, 2022 •

edited

sadiqj left a comment

sadiqj Nov 27, 2022

sadiqj Nov 27, 2022

kayceesrk commented Nov 28, 2022

bobot left a comment

kayceesrk commented Nov 28, 2022

maroneze commented Nov 28, 2022

kayceesrk commented Nov 28, 2022

kayceesrk commented Nov 29, 2022

sadiqj commented Nov 29, 2022

nojb commented Mar 15, 2023 •

edited

kayceesrk commented Mar 15, 2023

	struct caml_ephe_info {
	value todo; /* These are ephemerons which need to be marked and swept in the
	current cycle. If the ephemeron is alive, after marking, they
	go into the live list after cleaning them off the unreachable
	keys and releasing values if any of the keys are unreachable.
	*/
	value live; /* These are ephemerons which are alive in the current cycle,
	whose keys and data are live (or not set). */

	/* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be
	/* [ed] may be in [Caml_state->ephe_info->live] list. The data value may be

Weak array improvements #11743

Weak array improvements #11743

Conversation

kayceesrk commented Nov 22, 2022 • edited

Optimize get_field*

Optimize blit operation

Domain creation and termination

Performance

Time

Memory

gasche commented Nov 22, 2022

kayceesrk commented Nov 22, 2022

maroneze commented Nov 22, 2022

kayceesrk commented Nov 23, 2022

bobot left a comment

Choose a reason for hiding this comment

bobot Nov 22, 2022 • edited

Choose a reason for hiding this comment

sadiqj left a comment

Choose a reason for hiding this comment

sadiqj Nov 27, 2022

Choose a reason for hiding this comment

sadiqj Nov 27, 2022

Choose a reason for hiding this comment

kayceesrk commented Nov 28, 2022

bobot left a comment

Choose a reason for hiding this comment

kayceesrk commented Nov 28, 2022

maroneze commented Nov 28, 2022

kayceesrk commented Nov 28, 2022

kayceesrk commented Nov 29, 2022

sadiqj commented Nov 29, 2022

nojb commented Mar 15, 2023 • edited

kayceesrk commented Mar 15, 2023

kayceesrk commented Nov 22, 2022 •

edited

Optimize `get_field*`

Optimize `blit` operation

bobot Nov 22, 2022 •

edited

nojb commented Mar 15, 2023 •

edited