Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time/memory regression in Frama-C with OCaml 5.0 #11733

Open
maroneze opened this issue Nov 16, 2022 · 26 comments
Open

Time/memory regression in Frama-C with OCaml 5.0 #11733

maroneze opened this issue Nov 16, 2022 · 26 comments
Assignees
Labels
Performance PR or issues affecting runtime performance of the compiled programs

Comments

@maroneze
Copy link
Contributor

This is a follow-up to #11662.

Patch #11673 indeed fixes the most critical memory issue, but when running several tests in parallel, we are still having some issues with high memory usage. These are not critical for OCaml 5.0, but we hope they might be fixed for 5.1.

When running our set of analyses, we observed the following differences between OCaml 4.14 and OCaml 5.0.0+trunk (ocaml --version gives 5.0.0+dev8-2022-10-12; note that we were unable to test with 5.1.0+trunk since the ppxlib opam package is currently incompatible with it):

Screenshot_20221116_092330

Overall, the startup time for Frama-C is about 1.2s longer, which, for short benchmarks, can increase execution by 300% or 400%. This is mentioned in #11662. Even when only a subset of dynamic modules are loaded, there's still about 100%-200% increase in execution time. For large benchmarks such as SATE 6, which consist in >40k runs of Frama-C with a minimal set of plug-ins, this can double or triple total execution time. It was not entirely clear to us if the regression in time complexity mentioned in #11662 accounts for this difference. Otherwise, we'd like to help investigate further if there are other possible causes.

The main issue I'd like to mention here, however, is memory consumption: as seen in the table above, large case studies still show a significant increase in memory usage, leading to some heavy swapping (before #11662, swapping happened almost instantly; now it takes several minutes, but still happens).

I tried to obtain a minimal reproducible example that was very robust, but I was unable to, since memory usage is not entirely deterministic, at least in OCaml 5. In OCaml <= 4.14, we obtain a consistent value for maximum memory usage across different runs, but in OCaml 5, sometimes simply changing whitespace in the source file, or re-running Frama-C, can result in as much as 3x memory usage.

The file attached below (which is a C file, but Github does not recognize it, so I renamed it to .txt) is currently the best minimal example I could come up with. It is based on debie1, which had a 934% increase in memory usage in the table.

a-v5.txt

The command-line I used with it is the following:

/usr/bin/time -f 'user_time=%U\nmemory=%M' frama-c a-v5.c -eva -eva-msg-key=-final-states,-initial-state -no-autoload-plugins -load-module inout,eva,scope

In this "reduced" version, I get less than 200 MB of memory usage in OCaml <= 4.14, but > 1 GB in OCaml 5.

There is a "loop unroll" annotation in the code (line 2603) which can be modified to increase the memory usage difference between versions of the compiler. If we set it to 100, for instance, the OCaml 4 version uses only 100 MB, while the OCaml 5 version uses 150 MB. If I set it to 30000, the OCaml 4 version uses 392 MB while the OCaml 5 version uses 3.7 GB. Thus, this annotation could possibly help as a parameter during debugging.

My colleagues pointed out that there are possibly multiple issues; for instance, we know that analyses with -eva make heavy use of caches based on hashconsed values. These have been optimized over the years to perform as efficiently as possible, and it is likely that some of the heuristics used up to OCaml 4.14 may requires changes. There are some Frama-C options, such as -deterministic, which influence the behavior of such values. Playing with them might provide some insight.

In any case, please tell me if there are other tests that I could perform, or different setups to help understand the causes of the regression.

@kayceesrk
Copy link
Contributor

Thanks @maroneze. I'll start to investigate some of these. Could you share the instructions for running the set of analysis which produced the table above?

In the table, there are two sets of results with the same set of benchmarks. Could you tell us the difference between the two?

@maroneze
Copy link
Contributor Author

maroneze commented Nov 17, 2022

Ok, here's a set of commands for OCaml 4.14:

# install Frama-C (I used the current beta release, but you can also use the development master branch)
opam pin add https://git.frama-c.com/pub/frama-c.git\#stable/iron
# Clone the OSCS repository
git clone --depth 1 https://git.frama-c.com/pub/open-source-case-studies.git

cd open-source-case-studies
# clean versioned analyses (unintuitive, but currently this is necessary)
make clean
# run analyses sequentially (this may take hours; running `make -j all` is faster but will impact timing measurements)
make all

(Note that OSCS has some written and unwritten dependencies, such as a system-specific time command other than a shell builtin, Python 3.7, GNU make 4, etc)

After the analyses are run, in each target directory (all but frama-c: 2048, basic-cwe-examples, etc), there will be a .frama-c folder with one or more <target>.parse directories, as well as corresponding <target>.eva directories. Each of them contains a stats.txt file with a user_time and a memory line denoting user time in seconds and maximum memory usage in kB.

Then I installed a 5.0.0+trunk switch, and had to manually compile Frama-C (since some of its opam dependencies are missing in OCaml 5). The following should work:

git clone --branch stable/iron https://git.frama-c.com/pub/frama-c --depth 1
cd frama-c
opam install ocamlgraph zarith ppx_import ppx_deriving_yojson dune-site dune-configurator
make # will run dune to build Frama-C
make install

You should then be able to enter the open-source-case-studies directory and run make all with Frama-C compiled with OCaml 5.

If you want to run only a specific case study, e.g. debie1, you can either run make debie1 from the root directory of OSCS, or cd debie1/.frama-c && make.

In the table, there are two sets of results with the same set of benchmarks. Could you tell us the difference between the two?

Each target is split in 2 calls to Frama-C: the first one performs only parsing (*.parse), which is why it is fast most of the time (< 10s for most programs), while the second one (*.eva) performs the lengthy analysis with the Eva plug-in. The latter depends on the former: in the first Frama-C run (parse), we parse the code and save it to a session file. The second run (eva) loads this file and applies the analysis.

I then simply grepped all user_time/memory lines from all stats.txt files and put then in a CSV file, to produce the above table (filtering short case studies). I used the following script, with hardcoded directory names:

printf "dir;target;Time (4.14);Memory (4.14);Time (5.0);Memory (5.0)\n"
cd oscs-ocaml-4.14
for d in 2048 basic-cwe-examples bench-moerman2018 cerberus chrony c-testsuite debie1 genann gzip124 hiredis icpc ioccc itc-benchmarks jsmn kgflags khash kilo libmodbus libspng libyaml microstrain mini-gmp miniz monocypher papabench polarssl qlz safestringlib semver solitaire tsvc tweetnacl-usable verisec x509-parser
do
    cd $d
    cd .frama-c
    for t in *.{parse,eva}
    do
        cd $t
        time=$(grep "^user_time" stats.txt | cut -d= -f2)
        mem=$(grep "^memory" stats.txt | cut -d= -f2)
        stats5=/tmp/compare/oscs-ocaml-5/$d/.frama-c/$t/stats.txt
        time5=$(grep "^user_time" $stats5 | cut -d= -f2)
        mem5=$(grep "^memory" $stats5 | cut -d= -f2)
        target_no_suffix=${t%.eva}
        target_no_suffix=${target_no_suffix%.parse}
        printf "$d;$t;$time;$mem;$time5;$mem5\n"
        cd ..
    done
    cd ../..
done

For more details, only in case you need to tweak the commands executed by OSCS: each .frama-c directory contains a GNUmakefile which contains targets such as 2048.parse and 2048.eva. If you enter such .frama-c directories and run make 2048.parse, for instance, you will get a Command: line showing the line run by Frama-C (minus a few options), which can then be copied/pasted and adapted to be re-run manually (for instance, to add option -deterministic). The "minus a few options" part means that the command shown omits the -save and -load options used by Frama-C to save/load sessions. Thus the <target>.parse command actually contains a -save <target>.parse/framac.sav option, and the <target>.eva command has a -load <target>.parse/framac.sav option. This allows reloading the result of parsing and not having to repeat lots of command-line options when only the analysis needs to be tweaked.

In any case, just tell me if you want me to prepare some source codes or command-lines to test Frama-C.

@kayceesrk
Copy link
Contributor

Thanks for the instructions. I was able to build and run the a-v5.c program on 4.14 and 5.0.0+trunk with PR 11673 patch applied. I confirm that I can also see the behaviour reported in the original message:

I get less than 200 MB of memory usage in OCaml <= 4.14, but > 1 GB in OCaml 5.

I ran perf on the two runs and observed that they look different

On 4.14:

Samples: 55K of event 'cycles', Event count (approx.): 39781651203                                                                                                                 
Overhead  Command  Shared Object           Symbol                                                                                                                                  
  10.72%  frama-c  frama-c                 [.] do_some_marking                                                                                                                     
   5.21%  frama-c  frama-c                 [.] caml_page_table_lookup                                                                                                              
   3.93%  frama-c  frama-c                 [.] camlStdlib__Map__merge_697                                                                                                          
   2.92%  frama-c  frama-c                 [.] caml_oldify_mopup                                                                                                                   
   2.64%  frama-c  frama-c                 [.] camlStdlib__Weak__find_or_781                                                                                                       
   2.62%  frama-c  frama-c                 [.] caml_oldify_one                                                                                                                     
   2.57%  frama-c  frama-c                 [.] sweep_slice                                                                                                                         
   2.33%  frama-c  frama-c                 [.] caml_apply2                                                                                                                         
   1.84%  frama-c  frama-c                 [.] clean_slice                                                                                                                         
   1.84%  frama-c  frama-c                 [.] bf_allocate                                                                                                                         
   1.83%  frama-c  frama-c                 [.] bf_merge_block                                                                                                                      
   1.58%  frama-c  frama-c                 [.] caml_alloc_shr_for_minor_gc                                                                                                         
   1.27%  frama-c  frama-c                 [.] camlStdlib__Weak__loop_791                                                                                                          
   1.10%  frama-c  frama-c                 [.] caml_modify                                                                                                                         
   1.09%  frama-c  frama-c                 [.] caml_oldify_local_roots                                                                                                             
   1.03%  frama-c  frama-c                 [.] bf_insert_block                                                                                                                     
   1.01%  frama-c  frama-c                 [.] camlStdlib__Hashtbl__find_1059                                                                                                      
   0.93%  frama-c  frama-c                 [.] caml_ephemeron_get_key_copy                                                                                                         
   0.80%  frama-c  eva.cmxs                [.] camlEva__Trace_partitioning__join_4762                                                                                              
   0.80%  frama-c  frama-c                 [.] caml_next_frame_descriptor 
<snip>

On 5.0

Samples: 73K of event 'cycles', Event count (approx.): 53507477874                                                                                                                 
Overhead  Command  Shared Object           Symbol                                                                                                                                  
  19.40%  frama-c  frama-c                 [.] ephe_mark                                                                                                                          ◆
   3.30%  frama-c  frama-c                 [.] caml_apply2                                                                                                                        ▒
   3.28%  frama-c  frama-c                 [.] caml_shared_try_alloc                                                                                                              ▒
   3.03%  frama-c  frama-c                 [.] mark                                                                                                                               ▒
   2.82%  frama-c  frama-c                 [.] camlStdlib__Map__merge_698                                                                                                         ▒
   2.40%  frama-c  frama-c                 [.] camlStdlib__Weak__find_or_775                                                                                                      ▒
   2.03%  frama-c  frama-c                 [.] oldify_one                                                                                                                         ▒
   2.03%  frama-c  frama-c                 [.] pool_sweep                                                                                                                         ▒
   1.95%  frama-c  frama-c                 [.] oldify_mopup                                                                                                                       ▒
   1.31%  frama-c  frama-c                 [.] mark_stack_push                                                                                                                    ▒
   1.29%  frama-c  frama-c                 [.] build_frame_descriptors                                                                                                            ▒
   1.26%  frama-c  frama-c                 [.] caml_darken.part.0                                                                                                                 ▒
   1.26%  frama-c  frama-c                 [.] ephe_get_field_copy                                                                                                                ▒
   1.04%  frama-c  frama-c                 [.] camlStdlib__Weak__loop_785                                                                                                         ▒
   0.89%  frama-c  frama-c                 [.] caml_apply3                                                                                                                        ▒
   0.88%  frama-c  frama-c                 [.] caml_modify                                                                                                                        ▒
   0.77%  frama-c  frama-c                 [.] ephe_check_field                                                                                                                   ▒
   0.77%  frama-c  frama-c                 [.] camlStdlib__Hashtbl__find_1081                                                                                                     ▒
   0.71%  frama-c  [unknown]               [k] 0xffffffff8fc8cf47                                                                                                                 ▒
   0.65%  frama-c  frama-c                 [.] caml_hash                                                                                                                          ▒
   0.64%  frama-c  frama-c                 [.] alloc_shared.isra.0                                                                                                                ▒
   0.58%  frama-c  eva.cmxs                [.] camlEva__Trace_partitioning__join_4745                                                                                             ▒
   0.53%  frama-c  eva.cmxs                [.] camlEva__Partition__compare_4190                                                                                                   ▒
   0.53%  frama-c  frama-c                 [.] camlStdlib__Weak__loop_764                                                                                                         ▒
   0.49%  frama-c  frama-c                 [.] camlFrama_c_kernel__Hptmap__merge_5380                                                                                             ▒
   0.47%  frama-c  frama-c                 [.] camlStdlib__Hashtbl__key_index_1050

Does frama-c make use of ephemerons heavily? Ephemerons are hard to implement (or even to specify) in a multicore setting (see #10737). OCaml 5.0 avoids some of the complexity by not freeing data during the minor GC. For example, see https://github.com/ocaml/ocaml/blob/trunk/runtime/minor_gc.c#L414.

I shall investigate more tomorrow.

@alainfrisch
Copy link
Contributor

Ephemerons are hard to implement (or even to specify) in a multicore setting (see #10737). OCaml 5.0 avoids some of the complexity by not freeing data during the minor GC.

Out of curiosity, does this also apply to good old weak arrays/tables?

@kayceesrk
Copy link
Contributor

Weak arrays/tables are well specified under parallelism. Ephemerons as they are defined now, where mutations are used to initialise and modify keys and data, is problematic.

In OCaml 5.0, weak arrays are implemented as ephemerons with empty data fields as in OCaml 4.14. Going forward, the plan is to implement weak arrays separately from ephemerons with a simpler API.

@yakobowski
Copy link
Contributor

Frama-C Eva, which is the plugin being tested in Andre's benchmark, uses weak hash tables in an absolutely essential way. A very large part of the data structures being used are hash-consed. If weak hash sets have been reimplemented on top ephemerons for 5.00, this is probably the issue

Note however that those sets are used in a purely monotone way: the only one modifying a key is the GC, to remove a dead one. And IIRC correctly keys are not reused. Perhaps this makes this use-case easier to support?

@kayceesrk
Copy link
Contributor

If weak hash sets have been reimplemented on top ephemerons for 5.00, this is probably the issue

To be clear, weak arrays have always been implemented on top of ephemerons in OCaml -- a weak array is just an ephemeron without the data field. 5.00 does not change this.

I'm continuing to investigate this to confirm whether weak arrays are the cause of the regression.

@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 22, 2022

I have opened #11743 for improvements to weak arrays that significantly improves the performance on these benchmarks.

@nojb
Copy link
Contributor

nojb commented Mar 15, 2023

What is the status of this issue? Were the performance issues fixed by #11743?

@kayceesrk
Copy link
Contributor

There is also #12131 which improves performance. Would be useful to know from @maroneze whether the performance issues are improved.

@maroneze
Copy link
Contributor Author

Sorry, I didn't have much time to check it. Which specific version of the compiler should I try? 5.1.0~alpha1? 5.2.0+trunk? Or another version?

@gasche
Copy link
Member

gasche commented May 15, 2023

Right now the answer is "neither", #12131 has not been merged at all. I will try to merge today so that it appears in trunk (but maybe not the opam switch, I don't know how up-to-date that is). It is not part of alpha1.

@maroneze
Copy link
Contributor Author

Ok, thanks.

I believe the numbers in #11662 should still be relevant, but if you believe there have been other changes that might impact it, I can try re-running the benchmarks.

The conclusions from February were that there was still a time/memory regression, minimized when enabling our -deterministic option, but it still greatly increases memory consumption.

So, it seemed that overall, the performance issues were not entirely fixed by #11743, even if things definitely improved.

From our side, keeping this issue open would still be relevant (we are still remaining on 4.14 for the time being), but if you prefer to close it, I can try re-running the benchmarks when OCaml 5.1 will be released, and create a new issue if necessary. Whatever best suits you in terms of managing issues is fine for me.

@gasche
Copy link
Member

gasche commented May 15, 2023

#12131 is now merged in trunk (and the in-development 5.1 branch, so it should be part of the next beta release). This is a relatively small change that should improve the time performance of code using weak pointers (so, no change in deterministic mode) and made a 5-10% improvement on a weak-using benchmark (so, may not explain all the time difference between normal-4.14 and normal-5.x).

@kayceesrk
Copy link
Contributor

kayceesrk commented Jul 3, 2023

I can try re-running the benchmarks when OCaml 5.1 will be released,

Since the 5.1 alpha versions have been released, I would be curious to see the updated results. FWIW the beta version should be released sometime this week.

@kayceesrk kayceesrk added the Performance PR or issues affecting runtime performance of the compiled programs label Jul 3, 2023
@maroneze
Copy link
Contributor Author

Sorry for the delay. Here's some tests done with 5.2~beta2, compared to 4.14.2. The raw data is here: stats.ods

Screenshot_20240515_103100

The yellow lines (".eva" tasks) are the important ones, since they take the longest time and consume the most memory.

@gasche
Copy link
Member

gasche commented May 15, 2024

(I would be interested in a geometric average of the memory ratio (new / old) and of the time ratio (new / old).)

Skimming the results suggests that the obvious time regressions have been solved, 5.2 is in fact noticeably a few percent faster than 4.14 on average, but that the memory usage remains much higher, probably around 60% higher on average and up to 100% higher (double memory usage) in some cases.

I have no idea what would explain the higher memory usage, I guess that one should look deeper at GC statistics to get an idea. This might be simply a difference of GC pacing (it spends less time collecting memory), maybe it is possible to change the space_overhead GC parameter to go slower but use significantly less memory.

@maroneze
Copy link
Contributor Author

Just in case, there's some context in #11662.

In particular, there's option -deterministic, which applied to 4.14, also resulted in increased memory usage. Compared to 4.14 with -deterministic and 5.2~beta2 without it, 5.2 is better than 5.0.1 in terms of memory usage (e.g. debie1 was at +215%, now it's at +92%). But I think most of the fixes were already in 5.1.

I did not use option -deterministic at all in this latest benchmarking.

@kayceesrk
Copy link
Contributor

It would be useful to check whether GC compactions are triggered in 4.14 on the benchmarks using a lot of memory. In 5.2, GC compaction support has been added, but needs to be triggered manually with Gc.compact.

@maroneze
Copy link
Contributor Author

So, this means patching Frama-C to call Gc.stat at the end of its execution and print the compactions field? And then re-run the benchs with OCaml 4.14? Or is there another way to do it?

@gasche
Copy link
Member

gasche commented May 15, 2024

There is an OCaml setting to have it print some Gc statistics at the end using a runtime variable, no need to change the code. OCAMLRUNPARAM="v=0x400" should do the trick, I think. (See OCaml manual > runtime.)

@maroneze
Copy link
Contributor Author

Thanks! I just tried it with 4.14, and it does print compactions: 0 everywhere.

@kayceesrk
Copy link
Contributor

Useful to know. Thank you.

@maroneze
Copy link
Contributor Author

A colleague mentioned me this issue, and suggested I try OCAMLRUNPARAM="o=40". I just did a single run so far, but results are interesting: slightly slower than without it (as expected), with a much lower memory overhead (I added the blue columns at the end).

Screenshot_20240516_182020

In the worst case, instead of 100%, it's now 30%; on the other hand, the 5% speedup practically disappeared.

@gasche
Copy link
Member

gasche commented May 16, 2024

Yes, feel free to play with the space_overhead value (the o setting in OCAMLRUNPARAM) to see if you find a point that you like better on the time/memory tradeoff. (That also applies to OCaml 4.x, of course. The point is that the GCs are different enough that the time/memory curves look different, the behavior with similar space_overhead settings can be different, so you may need to adjust again.)

My optimistic feeling from your number is that we are looking at what is probably a "case closed": the numbers are now good enough to run Frama-C in production with OCaml 5, and we could consider closing the issue.

(The 30% and 11% examples are outliers in your delta graph, maybe there is still something interesting happening there, in the sense of a regression that could be fixed in 5.x to improve all programs? Unfortunately it is much harder to profile for memory overheads than for time overheads, so I wouldn't know how to look to find a root cause if it exists.)

@jberdine
Copy link
Contributor

I am not sure if I have followed developments closely enough, so apologies if this is just noise. My understanding is that 4.14 included a number of optimizations to prevent weak pointers and ephemerons from keeping values they refer to from being collected for 2-3 GC cycles after the point at which they would ideally be considered dead. If I understand correctly, some but not all of these have been reimplemented for 5.x (one concrete example of the sort of thing I have in mind is mentioned in this comment by @bobot), and also that some independent optimizations have been implemented for 5.x (e.g. @kayceesrk's #11743 and @NickBarnes's #12131). Is that correct? Would there be some value in tracking any remaining such missing optimizations? I worry that some value in exploring the design space and finding worthwhile optimizations that has already been done before might be lost, even though the bottom line benchmark numbers have currently caught up, but perhaps for not quite the same reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance PR or issues affecting runtime performance of the compiled programs
Projects
None yet
Development

No branches or pull requests

7 participants