Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption of Frama-C in OCaml 5 beta1 is about 10x larger #11662

Closed
maroneze opened this issue Oct 20, 2022 · 10 comments
Closed

Memory consumption of Frama-C in OCaml 5 beta1 is about 10x larger #11662

maroneze opened this issue Oct 20, 2022 · 10 comments
Labels
Performance PR or issues affecting runtime performance of the compiled programs
Milestone

Comments

@maroneze
Copy link
Contributor

maroneze commented Oct 20, 2022

My colleague just noticed that, when running Frama-C in a switch with OCaml.5.0.0~beta1, memory consumption of Frama-C is about 10x larger when loading dynamic modules (i.e., any of Frama-C's plug-ins).

I created a Dockerfile to test it on an isolated environment and obtained the same results.

When comparing the same code compiled under OCaml 4.14.0 and OCaml 5.0.0~beta1, I obtained the following differences:

Plugin Memory 4.14.0 (KB) Memory 5.0.0~beta1 (KB) Difference
none 23724 24980 5%
all 76424 967336 1166%
aorai 36364 339072 832%
callgraph 35084 317140 804%
dive 35380 338568 857%
e-acsl 27280 143444 426%
eva 34568 296416 757%
from 36104 344628 855%
impact 40264 482276 1098%
inout 36944 366408 892%
instantiate 25284 63284 150%
loop-analysis 35208 315164 795%
markdown-report 25804 49444 92%
metrics 35876 322344 798%
nonterm 35072 305268 770%
obfuscator 24968 42404 70%
occurrence 35004 306828 777%
pdg 37288 384804 932%
postdominators 35092 306764 774%
reduc 37348 384940 931%
report 25096 45068 80%
rtegen 25084 45000 79%
scope 38160 419024 998%
server 25452 61524 142%
slicing 39924 467248 1070%
sparecode 38684 431408 1015%
studia 34948 310884 790%
users 35172 325944 827%
variadic 25204 65292 159%

The two first lines are the most important ones; the others are provided only as possible clues for further investigation.

Memory usage is obtained by running /usr/bin/time -f "%M" (it reports maximum resident set size). Note that this is not just a theoretical issue (e.g. memory reserved but unused in practice): my colleague had his PC start swapping heavily when running a few dozen parallel instances of Frama-C, which doesn't happen with OCaml <= 4.14. This is when we decided to measure it.

I compiled and installed Frama-C, then ran it under the following configurations:

  • Without any modules dynamically loaded: frama-c -no-autoload-plugins
  • Default (all modules): frama-c
  • A single module per execution: frama-c -no-autoload-plugins -load-module <plugin>

We notice that, without any modules, memory usage is the same in both versions. But when all modules are loaded, it is almost 1GB in OCaml 5, versus less than 100 MB in OCaml 4.14.

Here's a hackish Dockerfile to help reproduce it:

# We don't actually use the Frama-C installed in this image, it serves only
# as a opam-ready base image
FROM framac/frama-c:dev-stripped.fedora

RUN opam update && opam switch create 5.0.0~beta1 -y
# why3 installation may fail, hence the "|| true" below
RUN opam install --deps-only frama-c --confirm-level=unsafe-yes || true
RUN opam install dune-site --confirm-level=unsafe-yes
RUN sudo dnf install time -y
RUN git clone https://git.frama-c.com/pub/frama-c.git --depth=1
RUN cd frama-c && eval $(opam env) && make && make install
RUN eval $(opam env) && /usr/bin/time -f "Memory: %M" frama-c -no-autoload-plugins
RUN eval $(opam env) && /usr/bin/time -f "Memory: %M" frama-c
RUN eval $(opam env) && for p in aorai callgraph dive e-acsl eva from impact inout instantiate loop-analysis markdown-report metrics nonterm obfuscator occurrence pdg postdominators reduc report rtegen scope server slicing sparecode studia users variadic; do /usr/bin/time -f "Plugin $p: Memory: %M" frama-c -no-autoload-plugins -load-module $p; done

In this Dockerfile, all we do is (1) install a new opam switch, (2) install Frama-C dependencies, then (3) clone and install Frama-C itself (currently, it is not possible to run opam install frama-c on a 5.0 switch due to the lack of a compatible Why3). Finally, (4) run Frama-C and measure memory usage.

I simply replaced 5.0.0~beta1 with 4.14.0 in the first RUN to compare it with OCaml 4.14.

If you have any clues on what could be causing it and how to further investigate, please tell us. We'll try to take a look into it ourselves.

@maroneze
Copy link
Contributor Author

Sorry, I meant to create this issue on https://github.com/ocaml/ocaml. Is there a way to migrate it, or should I just close it here and copy the contents there?

@mseri
Copy link
Member

mseri commented Oct 20, 2022

I don't think there is, unfortunately. You have to copy it there.

@maroneze
Copy link
Contributor Author

maroneze commented Oct 20, 2022

Apparently there is a way, since both are owned by the same organization, but it requires someone with write access to both repositories.

@mseri
Copy link
Member

mseri commented Oct 20, 2022

Oh. Interesting. Ping @kit-ty-kate

@dra27 dra27 transferred this issue from ocaml/opam-repository Oct 24, 2022
@dra27 dra27 added this to the 5.0 milestone Oct 24, 2022
@stedolan
Copy link
Contributor

This looks like a memory leak in OCaml 5's frametable handling during native Dynlink (i.e. plugin loading), which happily looks easy to fix.

The frame table stores the information about the stack and register contents needed by the GC, kept in a large hashtable (about 2MB here). The frametable is completely rebuilt every time a new module is loaded, which in frama-c -no-autoload-plugins -load-module impact is about 200 times. (Note that this is once per module, not once per loaded cmxs file. A single plugin.cmxs can contain many modules).

When the frame table is rebuilt, the old ones are not immediately freed, as they may still be in use by GC happening on other domains. Instead, they are added to a list to be freed at the next GC cycle, when we can be certain they are no longer in use.

Here's the bug: when the next GC cycle happens, only the first old frametable gets freed, even though the frametable may have been rebuilt many times in a single cycle.

@Octachron
Copy link
Member

The memory leak has been fixed by #11673. There is still a regression in time complexity when loading plugins with many modules. However, this should however be less noticeable in user code and an hypothetical fix can wait for 5.1 .

@gasche
Copy link
Member

gasche commented Feb 24, 2023

The quadratic-time behavior of plugin loading has been fixed by #11980, which is now merged in trunk. Updated performance results for Frama-C are available at #11980 (comment) . Some large slowdowns, in particular affecting short-running programs (the dynlink issue was only a large startup cost), are now gone, for example basic-cwe-examples was 75% slower in 5.0 than in 4.14.1, and is now 12% faster in trunk than in 4.14.1.

There remain performance slowdowns of up to 20% for longer-running examples. Those would require new performance analysis to suggest the cause of the slowdown. I would be willing to hazard the guess that ephemerons / weak pointers may be responsible -- the 5.0 implementation is trickier and probably slower. But there may be other sources of slowdown that we haven't discovered yet.

@maroneze
Copy link
Contributor Author

Thank you very much for these improvements!

Concerning the slowdowns, as suggested by @bobot, I performed some tests using Frama-C's -deterministic option, which changes whether weak hash tables are used for hashconsing:

module Hashconsing_tbl =
 (val if Cmdline.deterministic
   then (module Hashconsing_tbl_not_weak: Hashconsing_tbl)
   else (module Hashconsing_tbl_weak: Hashconsing_tbl))

These tables are notably used by the Eva plug-in, in a performance-critical part of the analysis.

I tried several configurations: 4.14.1 and 5.0.1, with and without the -deterministic option. Normally, the 5.0.1 version I used included both branches mentioned in our discussions (so, the one just merged was also included).

Here's what I got as results (I only compared some long-running analyses, which seemed the most relevant for the remaining slowdown):

deterministic

Some results are repeated several times, just to put them next to each other for relative comparisons.

If we look at the last line, we have the "current" version we are using (4.14.1, without -deterministic), compared to the 5.0.1 branch (including the latest branch), but this time using the -deterministic version, that is, without weak hash tables.

We see that, overall, the analysis time is equivalent to our "status quo", or even faster. However, memory consumption is very high, more than triple in one case.

If we look at the middle line, normal 4.14.1 versus -deterministic 4.14.1, we see why this options is not currently enabled by default: while it improves analysis time, it causes a large increase in memory consumption. The third column shows that, compared to using it in 5.0.1, everything is degraded compared to 4.14.1: analysis time increases and memory consumption as well.

Finally, maybe the most interesting comparison is between normal, 5.0.1 and -deterministic, 5.0.1 (first line, last columns): except for 2 data points where memory consumption is slightly higher, -deterministic trumps everywhere. Which means, if we were forced to move to 5.0.1 immediately, then enabling -deterministic by default would be a no-brainer.

However, the extra memory consumption is still a bit high in some cases, and can be problematic for some of our industrial partners who have larger code bases (and analyses that run much longer). So, for now, we still think remaining in 4.14.1 and not using -deterministic is the best approach; but I wanted to include these in the discussion, since the results were surprising: I did not expect -deterministic to improve on both analysis time and memory consumption in 5.0.1.

(With the caveat that I may have mis-configured something when evaluating; if these numbers seem illogical, I can try to redo the analyses using the current trunk.)

@gasche
Copy link
Member

gasche commented Feb 24, 2023

Thanks for the numbers, very interesting!

I think that the numbers confirm our intuition that weak hash tables / ephemerons dominate the performance differences between 4.x and 5.x. This would be the single most important point of improvement to recover Frama-C performance in the non-deterministic variant. I don't know much about weak/ephemerons; my understanding was that #10737 proposed an API that could be implemented in a more efficient way later, but I don't know if this is still relevant and whether there is any work in this direction (@kayceesrk would know).

Another very interesting comparison in your numbers is to compare "deterministic 4.14.1" with "deterministic 5.0.1". If I understand correctly, this gives an accurate picture of the performance impact of OCaml 5 when weak hashtables are out of the picture. (Completely? Or do you still use them in other parts of the code, just not for hash-consing which is their main usage?) From the point of view of "finding performance regressions in OCaml 5 that we can fix promptly", the lower-hanging fruits, if there are any, are probably there.
In particular:

  • chrony-regress has a 20% time regression
  • debie1 has a 42% memory regression

It would be interesting to profile these specific examples to understand where the regressions come from. For memory it is tricky as we don't have good tools for OCaml 5 memory profiling, but for time it may be that just comparing perf profiles on chrony-regress (or the profiler of your choice) could lead us in the right direction. Would you be able to give it a try?

@kayceesrk
Copy link
Contributor

kayceesrk commented Feb 25, 2023

The work proposed in #10737 is still relevant.

While the current API for ephemerons makes them immutable, they are still implemented over a mutable low-level representation that is also used by weak references. The goal would be to have separate low-level representations for weak arrays and ephemerons. Ephemerons will be implemented on top of an implementation of Bucket module directly in the runtime whereas weak references can continue to use the current representation dropping the data field.

(I am hazy on the details but) I believe that the separation can help implement optimisations for Weak arrays and hash sets on OCaml 5, which wasn't possible because the low-level implementation was shared by ephemerons; we had to assume that everything is potentially an ephemeron at runtime, which forbids some optimisations.

I should say that this is not currently being actively looked at by anyone at Tarides.

@kayceesrk kayceesrk added the Performance PR or issues affecting runtime performance of the compiled programs label Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance PR or issues affecting runtime performance of the compiled programs
Projects
None yet
Development

No branches or pull requests

7 participants