-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weak array improvements #11743
Weak array improvements #11743
Conversation
This looks like a non-trivial change to a very tricky part of the runtime. @stedolan would be an excellent reviewer as he shared your ephemeron travels, but is there someone else who could review this? (@bobot?) I have two other "meta" questions:
|
In addition to the two reviewers you have suggested, @sadiqj would be a great person to review this.
The ephemeron tests in the testsuite are reasonably good. We have a number of parallel tests too. I'm fairly confident of the correctness of this change.
In #11733, the
Hence, they may be ok to wait for this change in 5.1? In any case, the PR only affects those programs that use weak arrays and ephemerons. This is a small minority. |
16a5a1e
to
53c28c7
Compare
Thank you for the quick and thorough work, @kayceesrk! I confirm, it's not a problem if this patch comes after 5.0. For now, we will "officially" remain in 4.14 (but still perform tests with 5.x), and when the difference in a 5.x release will be small enough, we'll migrate to it. |
It would be great to know, from the frama-c team's perspective, what the acceptable performance difference will be in terms of time and memory usage. I would also be curious to know whether frama-c intends to use parallelism. The added benefit of parallelism may offset the impact of the sequential performance hit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes are sound, and should indeed improve the memory consumption, by keeping the advantage of blit_*
to not darken unnecessarily.
However I think we can reuse some part of the strategy used in 4.14. Which clean the keys blitted in order to avoid needing to move the lived ephemerons back to the todo list.
/* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be | ||
unmarked. The ephemerons on the live list are not scanned during ephemeron | ||
marking. Hence, unconditionally darken the data value. */ | ||
return Val_unit; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blit_*
functions have the interest of avoiding marking the values as much as possible. A possible optimization would be to not mark the data if the previous data was unmarked (as in 4.14).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, only one comment on whether we could avoid marking the data in blit as well.
runtime/weak.c
Outdated
ed, CAML_EPHE_DATA_OFFSET, 1); | ||
ephe_blit_field (es, CAML_EPHE_DATA_OFFSET, ed, CAML_EPHE_DATA_OFFSET, 1); | ||
caml_darken(0, Field(ed, CAML_EPHE_DATA_OFFSET), 0); | ||
/* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be | |
/* [ed] may be in [Caml_state->ephe_info->live] list. The data value may be |
runtime/weak.c
Outdated
ephe_blit_field (es, CAML_EPHE_DATA_OFFSET, ed, CAML_EPHE_DATA_OFFSET, 1); | ||
caml_darken(0, Field(ed, CAML_EPHE_DATA_OFFSET), 0); | ||
/* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be | ||
unmarked. The ephemerons on the live list are not scanned during ephemeron | ||
marking. Hence, unconditionally darken the data value. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid doing this if we're in Phase_sweep_and_mark_main
or Phase_mark_final
? We'll have to go through Phase_sweep_ephe
before the cycle ends which means all domains will have moved their live list into todo.
Thanks for the reviews. Both @bobot and @sadiqj have raised the question of whether we can avoid marking data field unconditionally. The challenge is that marking happens asynchronously and there isn't synchronization between data field blit and the decision to mark. Here is a problematic example. Suppose the destination ephemeron One could come up with a more complicated scheme to go back to the marking phase from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok optimizations can come once the new API is used. (Why is it not used yet?)
Thanks @bobot. Currently, the new/proposed API is implemented over the old mutable ephemeron interface. Eventually, the plan is to deprecate the mutable ephemeron interface and separate the implementation of immutable ephemerons from (mutable) weak arrays. This is a bit of work and a breaking change for ephemeron users. Hence, it was planned to be done after 5.0. |
Ideally, we would like the difference to be < 5%, but it's not clear if this will be achievable. I believe it will depend, in the end, of how much effort is necessary to compensate the remaining X%.
In the short term, this analysis in particular (Eva) is not going to use parallelism, since it requires major re-engineering efforts and most industrial partners are not that interested in sponsoring such development. Some analyzers already use some form of parallelism (e.g. WP calls solvers in parallel when it can), but overall, there is work on related parts that could, in the future, make use of parallelism. But it will take a while. By the way, for the averages and graphs, it would be clearer if the .parse and .eva targets were separated. Noticeably, the time increase in the
However, for the For the memory increase, it's also a very different behavior; I'm afraid that mixing both kinds of targets in the same graph will impact the averages and possibly lead to less relevant numbers. For instance, when running our largest dataset, which is SATE 6, the overhead in Eva analysis time is very minor compared to the loading of dynamic modules. On the other hand, for our long-running industrial use cases, the time and memory consumption of the |
Thanks for the clarification. I've added separate sheets for separating out For
For
|
We've got 2 approvals on this PR. If there's nothing further needed, this PR can be merged. |
bc527c6
to
b89c33e
Compare
Thanks @kayceesrk ! |
Just a note to say that some internal LexiFi code that uses weak arrays heavily was taking 20% longer to run on 5.0 but it is level with 4.14 on |
That's great to hear @nojb! |
In #11733, it was observed that there were time and memory regressions with frama-c. On investigation, this pointed to inefficiencies with weak array
blit
andget_field*
operations. This PR addresses some of the inefficiencies with weak arrays.Optimize
get_field*
The first commit 25e7f08 optimises
get_field
andget_field_copy
operations. On trunk, theSome
value is allocated on the major heap. This was necessary with the earlier concurrent minor GC scheme where domains cannot directly access the minor heap arena of other domains. Currently, OCaml uses parallel minor GC where there are no restrictions on inter minor heap arena pointers. Hence, these values may be allocated on the minor heap.Optimize
blit
operationThe ephemeron / weak array implementation on trunk uses two per-domain lists to keep track of ephemerons.
ocaml/runtime/caml/weak.h
Lines 31 to 38 in ee75054
In particular, the assumption is that all of the ephemerons on the live list are marked and all of the keys and the data field are either marked (or unset or is a primitive value). In order to preserve this invariant, during a
blit
operation, the fields being copied are marked; it may be possible that the destination of a bit operation may be an ephemeron on the live list. It was pointed out in the PR that introduced the marking ocaml-multicore/ocaml-multicore#316 that this may cause noticeable difference to performance, which is what was observed withframa-c
benchmarks.The second commit relaxes the invariant on the live list. For the ephemerons on the live list, the ephemerons are marked and the data field is marked (or unset or is a primitive value), but the keys may be unmarked (due to a
blit
operation that copies unmarked keys into the destination ephemeron). During a blit operation, the keys are no longer darkened. As a result, the live list has to be swept duringPhase_sweep_ephe
to release the dead keys. This is done by moving the ephemerons on the live list to the todo list at the beginning ofPhase_sweep_ephe
.Domain creation and termination
Care must be taken to ensure that the new protocol works with domain creation and termination. On trunk, the number of domains to perform ephemeron sweeping is determined at the start of the new cycle. This count is maintained in
num_domains_to_ephe_sweep
variable. Newly created domains in a cycle will only have ephemerons in its live list, and the ephemerons on the live list have the strong invariant that obviates the need to mark or sweep these ephemerons in that cycle.Since we have relaxed the invariant, newly created domains may have ephemerons on the live list that may have unmarked keys. Hence, the number of ephemerons to perform ephemeron sweeping can only be determined at the beginning of the
Phase_sweep_ephe
. To that end,num_domains_to_ephe_sweep
is determined at the beginning ofPhase_sweep_ephe
inmajor_gc.c:try_complete_gc_phase
. During domain termination, the terminating domain ensures that thenum_domains_to_ephe_sweep
count is decremented appropriately.Performance
I repeated the experiments suggested in #11733. There are 3 variants in the experiment:
The raw results are available in this sheet. Here are the time and memory overheads of the two latter variants compared to the baseline:
Time
The numbers in the parenthesis next to the benchmark name represent the execution time in seconds for the baseline variant. On average, the time increase from 4.12 to 5.0.0_beta1 + pr11673 was 85%, and this drops to 59% with this PR.
Memory
The numbers in the parenthesis next to the benchmark name represent the maximum resident set size in kilo bytes for the baseline variant. On average, the memory increase 4.12 to 5.0.0_beta1 + pr11673 was 108% which drops to 28% with this PR. On
debie1.eva
the memory increase on 5.0.0_beta1 + pr11673 was 913%, which comes down to 76% with this PR.