Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weak array improvements #11743

Merged
merged 3 commits into from
Nov 29, 2022
Merged

Weak array improvements #11743

merged 3 commits into from
Nov 29, 2022

Conversation

kayceesrk
Copy link
Contributor

@kayceesrk kayceesrk commented Nov 22, 2022

In #11733, it was observed that there were time and memory regressions with frama-c. On investigation, this pointed to inefficiencies with weak array blit and get_field* operations. This PR addresses some of the inefficiencies with weak arrays.

Optimize get_field*

The first commit 25e7f08 optimises get_field and get_field_copy operations. On trunk, the Some value is allocated on the major heap. This was necessary with the earlier concurrent minor GC scheme where domains cannot directly access the minor heap arena of other domains. Currently, OCaml uses parallel minor GC where there are no restrictions on inter minor heap arena pointers. Hence, these values may be allocated on the minor heap.

Optimize blit operation

The ephemeron / weak array implementation on trunk uses two per-domain lists to keep track of ephemerons.

struct caml_ephe_info {
value todo; /* These are ephemerons which need to be marked and swept in the
current cycle. If the ephemeron is alive, after marking, they
go into the live list after cleaning them off the unreachable
keys and releasing values if any of the keys are unreachable.
*/
value live; /* These are ephemerons which are alive in the current cycle,
whose keys and data are live (or not set). */

In particular, the assumption is that all of the ephemerons on the live list are marked and all of the keys and the data field are either marked (or unset or is a primitive value). In order to preserve this invariant, during a blit operation, the fields being copied are marked; it may be possible that the destination of a bit operation may be an ephemeron on the live list. It was pointed out in the PR that introduced the marking ocaml-multicore/ocaml-multicore#316 that this may cause noticeable difference to performance, which is what was observed with frama-c benchmarks.

The second commit relaxes the invariant on the live list. For the ephemerons on the live list, the ephemerons are marked and the data field is marked (or unset or is a primitive value), but the keys may be unmarked (due to a blit operation that copies unmarked keys into the destination ephemeron). During a blit operation, the keys are no longer darkened. As a result, the live list has to be swept during Phase_sweep_ephe to release the dead keys. This is done by moving the ephemerons on the live list to the todo list at the beginning of Phase_sweep_ephe.

Domain creation and termination

Care must be taken to ensure that the new protocol works with domain creation and termination. On trunk, the number of domains to perform ephemeron sweeping is determined at the start of the new cycle. This count is maintained in num_domains_to_ephe_sweep variable. Newly created domains in a cycle will only have ephemerons in its live list, and the ephemerons on the live list have the strong invariant that obviates the need to mark or sweep these ephemerons in that cycle.

Since we have relaxed the invariant, newly created domains may have ephemerons on the live list that may have unmarked keys. Hence, the number of ephemerons to perform ephemeron sweeping can only be determined at the beginning of the Phase_sweep_ephe. To that end, num_domains_to_ephe_sweep is determined at the beginning of Phase_sweep_ephe in major_gc.c:try_complete_gc_phase. During domain termination, the terminating domain ensures that the num_domains_to_ephe_sweep count is decremented appropriately.

Performance

I repeated the experiments suggested in #11733. There are 3 variants in the experiment:

The raw results are available in this sheet. Here are the time and memory overheads of the two latter variants compared to the baseline:

Time

image

The numbers in the parenthesis next to the benchmark name represent the execution time in seconds for the baseline variant. On average, the time increase from 4.12 to 5.0.0_beta1 + pr11673 was 85%, and this drops to 59% with this PR.

Memory

image

The numbers in the parenthesis next to the benchmark name represent the maximum resident set size in kilo bytes for the baseline variant. On average, the memory increase 4.12 to 5.0.0_beta1 + pr11673 was 108% which drops to 28% with this PR. On debie1.eva the memory increase on 5.0.0_beta1 + pr11673 was 913%, which comes down to 76% with this PR.

@kayceesrk kayceesrk changed the title Weak arrays improvements Weak array improvements Nov 22, 2022
@gasche
Copy link
Member

gasche commented Nov 22, 2022

This looks like a non-trivial change to a very tricky part of the runtime. @stedolan would be an excellent reviewer as he shared your ephemeron travels, but is there someone else who could review this? (@bobot?)

I have two other "meta" questions:

  • how confident are you that the change does not introduce bugs? (do we have a very good ephemeron testsuite by now?)
  • do we want to consider getting this change in 5.0? It sounds like Frama-C will not consider moving without something similar, but this introduces a correctness risk at the last minute, but but Frama-C may be the only one project to strongly depend on the correctness of ephemerons anyway, so it may be okay?

@kayceesrk
Copy link
Contributor Author

In addition to the two reviewers you have suggested, @sadiqj would be a great person to review this.

how confident are you that the change does not introduce bugs? (do we have a very good ephemeron testsuite by now?)

The ephemeron tests in the testsuite are reasonably good. We have a number of parallel tests too. I'm fairly confident of the correctness of this change.

do we want to consider getting this change in 5.0?

In #11733, the frama-c folks mention

Patch #11673 indeed fixes the most critical memory issue, but when running several tests in parallel, we are still having some issues with high memory usage. These are not critical for OCaml 5.0, but we hope they might be fixed for 5.1.

Hence, they may be ok to wait for this change in 5.1? In any case, the PR only affects those programs that use weak arrays and ephemerons. This is a small minority.

@maroneze
Copy link
Contributor

Thank you for the quick and thorough work, @kayceesrk!

I confirm, it's not a problem if this patch comes after 5.0. For now, we will "officially" remain in 4.14 (but still perform tests with 5.x), and when the difference in a 5.x release will be small enough, we'll migrate to it.

@kayceesrk
Copy link
Contributor Author

when the difference in a 5.x release will be small enough, we'll migrate to it.

It would be great to know, from the frama-c team's perspective, what the acceptable performance difference will be in terms of time and memory usage.

I would also be curious to know whether frama-c intends to use parallelism. The added benefit of parallelism may offset the impact of the sequential performance hit.

Copy link
Contributor

@bobot bobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes are sound, and should indeed improve the memory consumption, by keeping the advantage of blit_* to not darken unnecessarily.

However I think we can reuse some part of the strategy used in 4.14. Which clean the keys blitted in order to avoid needing to move the lived ephemerons back to the todo list.

/* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be
unmarked. The ephemerons on the live list are not scanned during ephemeron
marking. Hence, unconditionally darken the data value. */
return Val_unit;
}
Copy link
Contributor

@bobot bobot Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blit_* functions have the interest of avoiding marking the values as much as possible. A possible optimization would be to not mark the data if the previous data was unmarked (as in 4.14).

runtime/weak.c Outdated Show resolved Hide resolved
@sadiqj sadiqj self-assigned this Nov 27, 2022
Copy link
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, only one comment on whether we could avoid marking the data in blit as well.

runtime/weak.c Outdated
ed, CAML_EPHE_DATA_OFFSET, 1);
ephe_blit_field (es, CAML_EPHE_DATA_OFFSET, ed, CAML_EPHE_DATA_OFFSET, 1);
caml_darken(0, Field(ed, CAML_EPHE_DATA_OFFSET), 0);
/* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be
/* [ed] may be in [Caml_state->ephe_info->live] list. The data value may be

runtime/weak.c Outdated
Comment on lines 436 to 440
ephe_blit_field (es, CAML_EPHE_DATA_OFFSET, ed, CAML_EPHE_DATA_OFFSET, 1);
caml_darken(0, Field(ed, CAML_EPHE_DATA_OFFSET), 0);
/* [ed] may be in [Caml_state->ephe_info-live] list. The data value may be
unmarked. The ephemerons on the live list are not scanned during ephemeron
marking. Hence, unconditionally darken the data value. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid doing this if we're in Phase_sweep_and_mark_main or Phase_mark_final? We'll have to go through Phase_sweep_ephe before the cycle ends which means all domains will have moved their live list into todo.

@kayceesrk
Copy link
Contributor Author

Thanks for the reviews. Both @bobot and @sadiqj have raised the question of whether we can avoid marking data field unconditionally. The challenge is that marking happens asynchronously and there isn't synchronization between data field blit and the decision to mark. Here is a problematic example.

Suppose the destination ephemeron ed is reachable, the keys are empty and the data field is unmarked. During the marking phase, this data field will be marked and the ephemeron will be moved from the todo list to the live list. A concurrent blit may observe that the old data field is unmarked, and choose not to mark the new data field. The ephe_mark procedure would now move the ephemeron from the marking domain's todo list to the live list. Now, you have an ephemeron ed, which is in the live list, but with unmarked data field. While it is true that at the beginning of Phase_sweep_ephe the ephemerons are moved to the live list, we cannot perform any more marking; the marking phase is done when we reach Phase_sweep_ephe. All we can do is release the data unmarked at this point. But this is semantically incorrect -- the ephemeron is reachable, all the keys are empty but we're releasing the data!

One could come up with a more complicated scheme to go back to the marking phase from Phase_sweep_ephe. But I would rather not implement it now since we're hoping to simplify ephemeron interface anyway in the near future: #10737. In that case, given that ephemerons will be immutable, we will not have the case when we're overwriting old data values which are not empty.

Copy link
Contributor

@bobot bobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok optimizations can come once the new API is used. (Why is it not used yet?)

@kayceesrk
Copy link
Contributor Author

Thanks @bobot. Currently, the new/proposed API is implemented over the old mutable ephemeron interface. Eventually, the plan is to deprecate the mutable ephemeron interface and separate the implementation of immutable ephemerons from (mutable) weak arrays. This is a bit of work and a breaking change for ephemeron users. Hence, it was planned to be done after 5.0.

@maroneze
Copy link
Contributor

It would be great to know, from the frama-c team's perspective, what the acceptable performance difference will be in terms of time and memory usage.

Ideally, we would like the difference to be < 5%, but it's not clear if this will be achievable. I believe it will depend, in the end, of how much effort is necessary to compensate the remaining X%.

I would also be curious to know whether frama-c intends to use parallelism. The added benefit of parallelism may offset the impact of the sequential performance hit.

In the short term, this analysis in particular (Eva) is not going to use parallelism, since it requires major re-engineering efforts and most industrial partners are not that interested in sponsoring such development. Some analyzers already use some form of parallelism (e.g. WP calls solvers in parallel when it can), but overall, there is work on related parts that could, in the future, make use of parallelism. But it will take a while.

By the way, for the averages and graphs, it would be clearer if the .parse and .eva targets were separated.

Noticeably, the time increase in the .parse targets is probably largely related to this:

There is still a regression in time complexity when loading plugins with many modules. However, this should however be less noticeable in user code and an hypothetical fix can wait for 5.1

However, for the .eva targets, this is likely not the main cause.

For the memory increase, it's also a very different behavior; I'm afraid that mixing both kinds of targets in the same graph will impact the averages and possibly lead to less relevant numbers. For instance, when running our largest dataset, which is SATE 6, the overhead in Eva analysis time is very minor compared to the loading of dynamic modules. On the other hand, for our long-running industrial use cases, the time and memory consumption of the .eva part are the main issue.

@kayceesrk
Copy link
Contributor Author

Thanks for the clarification. I've added separate sheets for separating out .eva and .parse targets.

For .eva targets, compared to 4.14 baseline:

  • Time increase for 5.0_beta1+pr11673 = 60.15%
  • Time increase for this PR = 32.20%
  • Memory increase for 5.0_beta1+pr11673 = 220.11%
  • Memory increase for this PR = 61.67%

For .parse targets, compared to 4.14 baseline:

  • Time increase for 5.0_beta1+pr11673 = 109.77%
  • Time increase for this PR = 85.31%
  • Memory increase for 5.0_beta1+pr11673 = -4.59%
  • Memory increase for this PR = -4.59%

@kayceesrk
Copy link
Contributor Author

We've got 2 approvals on this PR. If there's nothing further needed, this PR can be merged.

@sadiqj sadiqj merged commit 6978b36 into ocaml:trunk Nov 29, 2022
@sadiqj
Copy link
Contributor

sadiqj commented Nov 29, 2022

Thanks @kayceesrk !

@nojb
Copy link
Contributor

nojb commented Mar 15, 2023

Just a note to say that some internal LexiFi code that uses weak arrays heavily was taking 20% longer to run on 5.0 but it is level with 4.14 on trunk (which includes this patch). Thanks!

@kayceesrk
Copy link
Contributor Author

That's great to hear @nojb!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants