Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefetching optimisations for sweeping #9934

Merged
merged 3 commits into from Feb 3, 2021

Conversation

stedolan
Copy link
Contributor

This PR contains two patches that optimise sweep_slice: a small refactoring that moves some globals to locals, and a use of prefetching. The goal is to reduce cache misses during GC.

Sweeping is a linear traversal of memory, which should already be fast. However, it is not a normal linear traversal: the next pointer is known only once you've loaded the length from the current one, making the algorithm more like a linked list traversal. This defeats some hardware prefetching mechanisms: the address dependencies mean that the next load is not exposed until the current one returns data (meaning out-of-order execution doesn't help), and the stride is irregular since not all objects are the same size. Stream prefetching does help somewhat by noticing sequential accesses, but (on Intel) doesn't cross 4k page boundaries and doesn't always prefetch data all the way to L1. See the intel optimisation manual for more details on hardware prefetching. (Currently, this code hasn't been benchmarked on AMD processors, and is a no-op on non-x86 architectures)

The prefetching in this patch is very straightforward: it prefetches 4k ahead of the sweep pointer.

On a small benchmark, this speeds up sweeping by around 25%. (Sweeping is about a quarter of the runtime of this benchmark, leading to a more modest overall improvement of a few percent).

This is a prelude to a more complicated patch that adds prefetching to marking, where it causes a more dramatic improvement.

(joint work with Will Hasenplaugh)


#ifdef CAML_INTERNALS
#if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
#define caml_prefetch(p) __builtin_prefetch((p), 1, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think - but I have not experimented - that the MSVC equivalent is #include <winnt.h> and PreFetchCacheLine((p), PF_NON_TEMPORAL_LEVEL_ALL) (I'm not sure about the constant)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: why only on x86? __builtin_prefetch exists on all GCC-supported platforms, even though it can be a no-op. And I'm sure ARM and others would benefit too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would document a bit:

#define caml_prefetch(p) __builtin_prefetch((p), 1, 3)
/* 1 = intent to write; 3 = all cache levels */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: why only on x86? __builtin_prefetch exists on all GCC-supported platforms, even though it can be a no-op. And I'm sure ARM and others would benefit too.

I left others out because I don't really know anything about non-x86 memory hierarchies. We can turn it on for ARM if you like, but I can't judge how much / whether it'll help, and I don't have the expertise / time to do any serious benchmarking.

(I'll add the comments)

@stedolan
Copy link
Contributor Author

Incidentally, when writing this I noticed another possible place for optimisation / cleanup, but haven't attempted it:

The caml_fl_merge_block function does a certain amount of work to determine whether the most recently found free block is mergeable with the current one (i.e. is it still free, and is it adjacent to the current block?). This information is known to the sweeper, as it processes blocks in order. We could possibly shave some more time off sweeping by changing the interface, and having the sweeper pass the previous block or NULL to caml_fl_merge_block, rather than the latter redetecting it. (This interface is somewhat subtle and difficult to debug, though, so this could be a delicate change)

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds interesting. Thanks for looking into this.


#ifdef CAML_INTERNALS
#if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
#define caml_prefetch(p) __builtin_prefetch((p), 1, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: why only on x86? __builtin_prefetch exists on all GCC-supported platforms, even though it can be a no-op. And I'm sure ARM and others would benefit too.


#ifdef CAML_INTERNALS
#if defined(__GNUC__) && (defined(__i386__) || defined(__x86_64__))
#define caml_prefetch(p) __builtin_prefetch((p), 1, 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would document a bit:

#define caml_prefetch(p) __builtin_prefetch((p), 1, 3)
/* 1 = intent to write; 3 = all cache levels */

@shubhamkumar13
Copy link
Contributor

The 2 graphs represent the normalized running time of benchmarks on sandmark when #9934 is run with trunk as a baseline.

The first one is on an Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz.
normalized-turing-9934
The following table refers to trunk's raw running time result on the same Intel system.

name time_secs gc.major_collections gc.major_words
0 lu-decomposition. 1.32719 11 7775014
1 levinson-durbin. 2.74204 1894 250063925
2 menhir.sql-parser 6.98978 23 59727559
3 fft. 4.37065 41 140412857
4 setrip.-enc_-rseed_1067894368 1.52092 9 13301
5 yojson_ydump.sample.json 0.76486 17 2979936
6 revcomp2. 2.9139 25 49759392
7 test_decompress.64_524_288 4.1468 555 105774397
8 LU_decomposition.1024 4.00685 4 4194470
9 game_of_life.256 11.927 2 2101398
10 regexredux2. 18.9592 32 240256398
11 grammatrix. 93.5887 26 68529245
12 evolutionary_algorithm.10000_10000 70.5337 32 1300975436
13 floyd_warshall.512 4.47551 22 6396941
14 quicksort.4000000 2.91457 0 4000146
15 mandelbrot6.16_000 40.6443 0 0
16 matrix_multiplication.1024 9.70372 6 3152023
17 fasta6.25_000_000 5.81388 1 25417192
18 qr-decomposition. 2.38781 63 2696897
19 cubicle.szymanski_at.cub 495.709 1532 6372966728
20 lexifi-g2pp. 17.1283 8 183647
21 durand-kerner-aberth. 0.1383 92 237481
22 nbody.50_000_000 7.47168 0 0
23 fannkuchredux2.12 94.0485 0 0
24 fannkuchredux.12 83.9848 0 0
25 knucleotide3. 43.9427 7 34073208
26 cpdf.blacktext 4.31274 14 27976325
27 bdd.26 5.3187 12 2471440
28 kb_no_exc. 2.5074 227 23320680
29 menhir.ocamly 235.341 35 1147662209
30 cpdf.scale 13.7828 33 94564877
31 zarith_pi.5000 1.47259 642 392716558
32 spectralnorm2.5_500 7.50886 5 121482
33 binarytrees5.21 11.8288 63 270360189
34 cubicle.german_pfs.cub 224.007 377 3254161011
35 cpdf.squeeze 16.3786 37 140393871
36 minilight.roomfront 22.2329 68 10639641
37 kb. 3.97839 329 24449437
38 pidigits5.10_000 6.02947 2691 1664137387
39 naive-multilayer. 4.17674 242 1293140
40 menhir.sysver 86.3455 63 467046511
41 sequence_cps.10000 1.58179 779 219311
42 knucleotide. 43.8687 13 50878638
43 fasta3.25_000_000 7.48009 0 551

The second graph is on an AMD EPYC 7702 64-Core Processor
normalized-sherwood-9934

Similarly, trunk's running time on the Amd machine

name time_secs gc.major_collections gc.major_words
0 menhir.sql-parser 4.95038 23 59727559
1 spectralnorm2.5_500 5.02338 5 121482
2 test_decompress.64_524_288 2.55676 555 105774397
3 minilight.roomfront 13.6939 68 10648042
4 sequence_cps.10000 1.25108 777 219410
5 lexifi-g2pp. 7.1579 8 183441
6 durand-kerner-aberth. 0.0961349 92 237481
7 cpdf.blacktext 3.01654 14 27976325
8 cpdf.squeeze 11.6939 37 140393871
9 knucleotide3. 27.4065 7 34073208
10 LU_decomposition.1024 2.77154 4 4194470
11 grammatrix. 50.1481 26 68529245
12 menhir.ocamly 173.388 35 1147662209
13 lu-decomposition. 0.850682 11 7775014
14 levinson-durbin. 1.57413 1894 250063925
15 kb. 2.58478 329 24449437
16 zarith_pi.5000 0.787027 642 392716558
17 setrip.-enc_-rseed_1067894368 0.771906 9 13307
18 binarytrees5.21 7.58026 63 270360189
19 matrix_multiplication.1024 4.2848 6 3152023
20 fannkuchredux.12 50.0576 0 0
21 fannkuchredux2.12 47.999 0 0
22 qr-decomposition. 1.27351 63 2696897
23 kb_no_exc. 1.63776 227 23320680
24 knucleotide. 27.7144 13 50878638
25 pidigits5.10_000 3.21054 2691 1664137387
26 nbody.50_000_000 4.87069 0 0
27 evolutionary_algorithm.10000_10000 44.0215 33 1300975511
28 yojson_ydump.sample.json 0.530261 17 2979936
29 revcomp2. 1.88539 25 49759392
30 menhir.sysver 58.3835 63 466977775
31 mandelbrot6.16_000 21.3379 0 0
32 quicksort.4000000 1.64895 0 4000146
33 cubicle.german_pfs.cub 183.453 376 3249735172
34 bdd.26 2.99953 12 2471440
35 naive-multilayer. 2.78721 239 1285722
36 floyd_warshall.512 2.68485 22 6396941
37 cubicle.szymanski_at.cub 317.303 1535 6373523793
38 regexredux2. 11.9367 32 240256398
39 fasta3.25_000_000 5.02575 0 551
40 game_of_life.256 9.40374 2 2101398
41 fft. 2.57991 41 140412857
42 fasta6.25_000_000 3.76762 1 25417192
43 cpdf.scale 9.81784 32 95330715

@xavierleroy
Copy link
Contributor

Thank you for the benchmarking work and the nice graphics! But I don't know what to conclude from these. A 10% speedup on some tests is very nice indeed, but a 6% slowdown on some other tests is a concern.

Also, I'm surprised that the effect on performance can be that strong: typically, GC takes 30% of execution times, and sweeping takes less time than marking, so the whole sweeping phase should be 10% or so of total execution time, and improving the sweeping phase can hardly improve the total running time by 10%.

@lpw25
Copy link
Contributor

lpw25 commented Sep 30, 2020

@damiendoligez

Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me.

if (caml_gc_sweep_hp < sweep_limit){
hp = caml_gc_sweep_hp;
if (sweep_hp < limit){
caml_prefetch(sweep_hp + 4096);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this will prefetch the cache line at sweep_hp + 4k. Isn't this likely to conflict with the cache line that contains sweep_hp itself? I know caches are associative, but the associativity is rather low, so what is the probability of evicting sweep_hp itself when we still need it? Would it be hard to benchmark some variation of this number (for example 4032)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point, I'll try that.

@damiendoligez
Copy link
Member

The caml_fl_merge_block function does a certain amount of work to determine whether the most recently found free block is mergeable with the current one (i.e. is it still free, and is it adjacent to the current block?). This information is known to the sweeper, as it processes blocks in order. We could possibly shave some more time off sweeping by changing the interface, and having the sweeper pass the previous block or NULL to caml_fl_merge_block, rather than the latter redetecting it. (This interface is somewhat subtle and difficult to debug, though, so this could be a delicate change)

You have to be careful because caml_fl_merge (the "most recently found free block") is also modified by the allocation functions. If you find a more efficient API for this function, I'll gladly review the PR.

@stedolan
Copy link
Contributor Author

stedolan commented Oct 5, 2020

You have to be careful because caml_fl_merge (the "most recently found free block") is also modified by the allocation functions. If you find a more efficient API for this function, I'll gladly review the PR.

Yeah, that's the tricky subtlety I was referring to. An observation that might help is that allocation cannot occur during sweep_slice, so the synchronisation with the allocator only needs to happen at the start and end of that function, not in the inner loop of sweeping.

@gasche
Copy link
Member

gasche commented Nov 25, 2020

What's the status of this PR? The CI failure is a missing Changes entry (and indeed this probably deserves one). Could we move ahead and eventually merge?

@xavierleroy
Copy link
Contributor

Performance evaluation is inconclusive, to me at least. This is supposed to make programs run faster, and it's unclear it does.

@gasche
Copy link
Member

gasche commented Nov 25, 2020

Ah, indeed, Damien's approval should probably be interpreted as "approval for correctness".

I looked at the graphs again. Several of the speedup results are found (in varying proportions) of both machines, but the most striking slowdown, bdd, is only found on the Intel machine. To me this suggests that the numbers may be partially noisy due to processor-specific code-cache effects. (Here the changes are in the runtime so we probably cannot use the random-nop-padding approach used in another PR to avoid those.)

Looking at the change, it is of course possible that the addition of one prefetching instruction would result in wide variation, but there is a more invasive refactoring in sweep_slice that reorders computations around to compute what to prefetch; this invasive change is more likely to be the cause of processor-specific variation, if there is some indeed. This suggests that one could:

  • write a PR with the same reordering changes but without the prefetching instructions, and compare that one with this one (so only the prefetching instructions differ), and that one with trunk (so we see if there is a cost to the reordering)
  • consider benchmarking the change that adds prefetching to freelist.c by itself, as it requires no refactoring. This may be more stable, but my understanding is that Stephen expects less of an improvement for that one change, sweeping was the motivation for this PR.

@dra27
Copy link
Member

dra27 commented Nov 25, 2020

@stedolan says it's a prelude to another PR - is it also a prerequisite?

@lpw25
Copy link
Contributor

lpw25 commented Nov 25, 2020

I would like to raise a concern I have with using the sandmark benchmark suite for assessing the performance of changes to the GC: Why do we think this is a good and representative set of benchmarks for these kinds of changes? I've seen lots of great work from OCamlLabs on how to get reliable numbers out of these benchmarks, but I have yet to see any assessment of the quality of this particular set of benchmarks.

For example, most of these benchmarks seem to do very few major cycles and allocate very few major words. That does not resemble most of the real programs that I deal with.

It would also be good to see some analysis of the noise in these benchmarks, both between runs and between changes to code layout.

@gasche
Copy link
Member

gasche commented Nov 25, 2020

Parroting the usual answer from Landmark maintainers, and I think they have a point: if you think that a particular workflow is missing from their benchmark suite, you should probably contribute a benchmark to the suite.

@lpw25
Copy link
Contributor

lpw25 commented Nov 25, 2020

if you think that a particular workflow is missing from their benchmark suite, you should probably contribute a benchmark to the suite.

That doesn't help with assessing the quality of the benchmarks that are already in there. If the Sandmark maintainers are going to add benchmark results to other people's PRs then they need to provide some context as to why they think the numbers are relevant.

@xavierleroy
Copy link
Contributor

The Sandmark numbers are better than nothing. (Many performance-oriented PRs came with no benchmarking whatsoever until recently.) But the numbers need to be interpreted! As I wrote earlier, the sweep phase is at most 10% of the total running time of the program, so variations of -10% / +6% in total running time don't just come from the sweep phase.

@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 27, 2020

I agree that it may not have been the best idea to run Sandmark on a PR not submitted by Multicore OCaml folks, especially when it remains difficult for the wider community to easily run the benchmarks on their end. But I am hoping that the process gets easier. We will refrain from running Sandmark on PRs not related to multicore.

That said, I wanted to bring the performance questions to a conclusion. I suspect that the original numbers were run on this PR and trunk and that point in time, which may have had unrelated commits. So I reran Sandmark on this PR (commit b419956) and the commit that this PR is based on (7d9e60d). The commit history is here: https://github.com/stedolan/ocaml/commits/sweep-optimisation. The normalized running time graph is here:

image

The baseline is 7d9360d. The graph shows the performance impact of this PR against the baseline. Lower is better. The numbers in the parenthesis is the running time in seconds for the baseline version. Overall, there is positive improvement.

I analysed the outliers in detail.

bdd

bdd is 2.6% slower in this PR according to the Sandmark run. bdd does not spend significant amount of time in the GC. The function sweep_slice takes 0.15% of the running time as reported by perf. Here is the perf stat output:

$ perf stat ./_build/4.12.0+7d9e60d_1/benchmarks/bdd/bdd.exe 26 # BASELINE

 Performance counter stats for './_build/4.12.0+7d9e60d_1/benchmarks/bdd/bdd.exe 26':

       5169.389420      task-clock (msec)         #    1.000 CPUs utilized
                 9      context-switches          #    0.002 K/sec
                 0      cpu-migrations            #    0.000 K/sec
             4,728      page-faults               #    0.915 K/sec
   11,34,59,02,093      cycles                    #    2.195 GHz
   22,49,12,29,020      instructions              #    1.98  insn per cycle
    4,95,24,82,879      branches                  #  958.040 M/sec
       5,87,93,810      branch-misses             #    1.19% of all branches

       5.169854280 seconds time elapsed

$ perf stat ./_build/4.12.0+b419956_1/benchmarks/bdd/bdd.exe 26 # THIS PR 

 Performance counter stats for './_build/4.12.0+b419956_1/benchmarks/bdd/bdd.exe 26':

       5470.182940      task-clock (msec)         #    1.000 CPUs utilized          
                10      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
             4,727      page-faults               #    0.864 K/sec                  
   12,00,60,98,103      cycles                    #    2.195 GHz                    
   22,48,76,42,055      instructions              #    1.87  insn per cycle         
    4,95,23,07,655      branches                  #  905.328 M/sec                  
       7,80,00,231      branch-misses             #    1.58% of all branches        

       5.470739910 seconds time elapsed

There are more branch misses in the PR. However, the slowdown cannot just be explained with the changes introduced.

revcomp2

On the other side, we see a 6.8% improvement on revcomp2. This improvement is real. On the baseline, the program spends, ~10% of its time sweeping:

  47.64%  revcomp2.exe  revcomp2.exe      [.] camlDune__exe__Revcomp2__wr_238
  10.05%  revcomp2.exe  revcomp2.exe      [.] sweep_slice
   7.33%  revcomp2.exe  revcomp2.exe      [.] mark_slice
   6.71%  revcomp2.exe  revcomp2.exe      [.] caml_input_scan_line
   3.08%  revcomp2.exe  revcomp2.exe      [.] caml_page_table_lookup
   2.79%  revcomp2.exe  revcomp2.exe      [.] caml_oldify_one
   1.73%  revcomp2.exe  revcomp2.exe      [.] caml_alloc_string
   1.55%  revcomp2.exe  revcomp2.exe      [.] caml_alloc_shr_for_minor_gc

with this PR, sweeping takes 5.3% of the total time:

  52.14%  revcomp2.exe  revcomp2.exe      [.] camlDune__exe__Revcomp2__wr_238
   8.13%  revcomp2.exe  revcomp2.exe      [.] mark_slice
   6.85%  revcomp2.exe  revcomp2.exe      [.] caml_input_scan_line
   5.30%  revcomp2.exe  revcomp2.exe      [.] sweep_slice
   3.28%  revcomp2.exe  revcomp2.exe      [.] caml_page_table_lookup
   2.47%  revcomp2.exe  revcomp2.exe      [.] caml_oldify_one
   1.93%  revcomp2.exe  revcomp2.exe      [.] caml_alloc_shr_for_minor_gc

Conclusions

I'm doing an experiment to quantify the noise in Sandmark. This is especially fiddly to quantify accurately due to the microarchitectural optimisations on modern processors. See the work in #10039.

Given the overall improvement, I am for accepting this PR.

@lpw25
Copy link
Contributor

lpw25 commented Nov 27, 2020

Thank you very much for the analysis KC. That makes things much clearer.

I agree that it may not have been the best idea to run Sandmark on a PR not submitted by Multicore OCaml folks, especially when it remains difficult for the wider community to easily run the benchmarks on their end. But I am hoping that the process gets easier. We will refrain from running Sandmark on PRs not related to multicore.

I don't want to dissuade you too much from adding benchmark results to PRs.

I think my concern is mostly that dropping a benchmark results graph in a comment, without the much more involved work needed to investigate the results and provide context for people who are not familiar with the nature of the particular benchmarks, is as likely to do harm as it is to do good.

When that additional work is done -- as you have very helpfully done in your previous comment -- then the results start to become very useful and greatly appreciated.

@kayceesrk
Copy link
Contributor

Thanks Leo. We'll make sure we provide an interpretation of the numbers and not just the raw results.

@stedolan
Copy link
Contributor Author

I am unable to reproduce the bdd slowdown: on several Intel machines with different build settings, PIE vs. non-PIE, etc, I can detect no difference in the performance of bdd before and after this patch.

I suspect the observed slowdown may possibly be caused by Intel's workaround for their JCC bug. Many recent Intel processors have a serious bug in their decoded instruction cache. Intel's workaround, distributed in a microcode update, is to disable the decoded instruction cache around jump instructions crossing a 32-byte boundary. This has a performance cost, which is usually low but has been observed to cause +/- 20% performance swings, particularly in microbenchmarks.

By configuring OCaml as follows, padding bytes can be inserted to ensure that no jumps cross 32-byte boundaries and the workaround never triggers:

./configure CC='gcc -Wa,-mbranches-within-32B' AS='as -mbranches-within-32B'

It might be worth building OCaml like this in future Sandmark runs on Intel processors.

@damiendoligez I played around with the offset number a bit and didn't notice any of the cache aliasing you mentioned. I've left it at 4000 just in case. The performance is not strongly affected by this parameter: it needs to be big enough that the prefetch has time to complete before the data is needed, and small enough that the data hasn't already fallen out of cache by the time sweeping gets there. I saw good results anywhere from 1k to 100k, and I left the parameter close to the bottom end of the range (very large values cause additional cache pollution, by prefetching many kilobytes beyond the end of the region being swept).

@stedolan
Copy link
Contributor Author

@damiendoligez Does your "approved" still stand? (There was discussion since, but as far as I'm concerned this is ready to merge)

@damiendoligez
Copy link
Member

Sure. Merging now. I'll make a note to do some benchmarking on other architectures.

@damiendoligez damiendoligez merged commit 8a90546 into ocaml:trunk Feb 3, 2021
garrigue pushed a commit to garrigue/ocaml that referenced this pull request Mar 3, 2021
smuenzel pushed a commit to smuenzel/ocaml that referenced this pull request Mar 30, 2021
stedolan added a commit to stedolan/ocaml that referenced this pull request Aug 10, 2021
poechsel pushed a commit to ocaml-flambda/ocaml that referenced this pull request Sep 3, 2021
chambart pushed a commit to chambart/ocaml-1 that referenced this pull request Sep 9, 2021
stedolan added a commit to stedolan/ocaml that referenced this pull request Oct 5, 2021
stedolan added a commit to stedolan/ocaml that referenced this pull request Dec 13, 2021
chambart pushed a commit to chambart/ocaml-1 that referenced this pull request Feb 1, 2022
23a7f73 flambda-backend: Fix some Debuginfo.t scopes in the frontend (ocaml#248)
33a04a6 flambda-backend: Attempt to shrink the heap before calling the assembler (ocaml#429)
8a36a16 flambda-backend: Fix to allow stage 2 builds in Flambda 2 -Oclassic mode (ocaml#442)
d828db6 flambda-backend: Rename -no-extensions flag to -disable-all-extensions (ocaml#425)
68c39d5 flambda-backend: Fix mistake with extension records (ocaml#423)
423f312 flambda-backend: Refactor -extension and -standard flags (ocaml#398)
585e023 flambda-backend: Improved simplification of array operations (ocaml#384)
faec6b1 flambda-backend: Typos (ocaml#407)
8914940 flambda-backend: Ensure allocations are initialised, even dead ones (ocaml#405)
6b58001 flambda-backend: Move compiler flag -dcfg out of ocaml/ subdirectory (ocaml#400)
4fd57cf flambda-backend: Use ghost loc for extension to avoid expressions with overlapping locations (ocaml#399)
8d993c5 flambda-backend: Let's fix instead of reverting flambda_backend_args (ocaml#396)
d29b133 flambda-backend: Revert "Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382)" (ocaml#395)
d0cda93 flambda-backend: Revert ocaml#373 (ocaml#393)
1c6eee1 flambda-backend: Fix "make check_all_arches" in ocaml/ subdirectory (ocaml#388)
a7960dd flambda-backend: Move flambda-backend specific flags out of ocaml/ subdirectory (ocaml#382)
bf7b1a8 flambda-backend: List and Array Comprehensions (ocaml#147)
f2547de flambda-backend: Compile more stdlib files with -O3 (ocaml#380)
3620c58 flambda-backend: Four small inliner fixes (ocaml#379)
2d165d2 flambda-backend: Regenerate ocaml/configure
3838b56 flambda-backend: Bump Menhir to version 20210419 (ocaml#362)
43c14d6 flambda-backend: Re-enable -flambda2-join-points (ocaml#374)
5cd2520 flambda-backend: Disable inlining of recursive functions by default (ocaml#372)
e98b277 flambda-backend: Import ocaml#10736 (stack limit increases) (ocaml#373)
82c8086 flambda-backend: Use hooks for type tree and parse tree (ocaml#363)
33bbc93 flambda-backend: Fix parsecmm.mly in ocaml subdirectory (ocaml#357)
9650034 flambda-backend: Right-to-left evaluation of arguments of String.get and friends (ocaml#354)
f7d3775 flambda-backend: Revert "Magic numbers" (ocaml#360)
0bd2fa6 flambda-backend: Add [@inline ready] attribute and remove [@inline hint] (not [@inlined hint]) (ocaml#351)
cee74af flambda-backend: Ensure that functions are evaluated after their arguments (ocaml#353)
954be59 flambda-backend: Bootstrap
dd5c299 flambda-backend: Change prefix of all magic numbers to avoid clashes with upstream.
c2b1355 flambda-backend: Fix wrong shift generation in Cmm_helpers (ocaml#347)
739243b flambda-backend: Add flambda_oclassic attribute (ocaml#348)
dc9b7fd flambda-backend: Only speculate during inlining if argument types have useful information (ocaml#343)
aa190ec flambda-backend: Backport fix from PR#10719 (ocaml#342)
c53a574 flambda-backend: Reduce max inlining depths at -O2 and -O3 (ocaml#334)
a2493dc flambda-backend: Tweak error messages in Compenv.
1c7b580 flambda-backend: Change Name_abstraction to use a parameterized type (ocaml#326)
07e0918 flambda-backend: Save cfg to file (ocaml#257)
9427a8d flambda-backend: Make inlining parameters more aggressive (ocaml#332)
fe0610f flambda-backend: Do not cache young_limit in a processor register (upstream PR 9876) (ocaml#315)
56f28b8 flambda-backend: Fix an overflow bug in major GC work computation (ocaml#310)
8e43a49 flambda-backend: Cmm invariants (port upstream PR 1400) (ocaml#258)
e901f16 flambda-backend: Add attributes effects and coeffects (#18)
aaa1cdb flambda-backend: Expose Flambda 2 flags via OCAMLPARAM (ocaml#304)
62db54f flambda-backend: Fix freshening substitutions
57231d2 flambda-backend: Evaluate signature substitutions lazily (upstream PR 10599) (ocaml#280)
a1a07de flambda-backend: Keep Sys.opaque_identity in Cmm and Mach (port upstream PR 9412) (ocaml#238)
faaf149 flambda-backend: Rename Un_cps -> To_cmm (ocaml#261)
ecb0201 flambda-backend: Add "-dcfg" flag to ocamlopt (ocaml#254)
32ec58a flambda-backend: Bypass Simplify (ocaml#162)
bd4ce4a flambda-backend: Revert "Semaphore without probes: dummy notes (ocaml#142)" (ocaml#242)
c98530f flambda-backend: Semaphore without probes: dummy notes (ocaml#142)
c9b6a04 flambda-backend: Remove hack for .depend from runtime/dune  (ocaml#170)
6e5d4cf flambda-backend: Build and install Semaphore (ocaml#183)
924eb60 flambda-backend: Special constructor for %sys_argv primitive (ocaml#166)
2ac6334 flambda-backend: Build ocamldoc (ocaml#157)
c6f7267 flambda-backend: Add -mbranches-within-32B to major_gc.c compilation (where supported)
a99fdee flambda-backend: Merge pull request ocaml#10195 from stedolan/mark-prefetching
bd72dcb flambda-backend: Prefetching optimisations for sweeping (ocaml#9934)
27fed7e flambda-backend: Add missing index param for Obj.field (ocaml#145)
cd48b2f flambda-backend: Fix camlinternalOO at -O3 with Flambda 2 (ocaml#132)
9d85430 flambda-backend: Fix testsuite execution (ocaml#125)
ac964ca flambda-backend: Comment out `[@inlined]` annotation. (ocaml#136)
ad4afce flambda-backend: Fix magic numbers (test suite) (ocaml#135)
9b033c7 flambda-backend: Disable the comparison of bytecode programs (`ocamltest`) (ocaml#128)
e650abd flambda-backend: Import flambda2 changes (`Asmpackager`) (ocaml#127)
14dcc38 flambda-backend: Fix error with Record_unboxed (bug in block kind patch) (ocaml#119)
2d35761 flambda-backend: Resurrect [@inline never] annotations in camlinternalMod (ocaml#121)
f5985ad flambda-backend: Magic numbers for cmx and cmxa files (ocaml#118)
0e8b9f0 flambda-backend: Extend conditions to include flambda2 (ocaml#115)
99870c8 flambda-backend: Fix Translobj assertions for Flambda 2 (ocaml#112)
5106317 flambda-backend: Minor fix for "lazy" compilation in Matching with Flambda 2 (ocaml#110)
dba922b flambda-backend: Oclassic/O2/O3 etc (ocaml#104)
f88af3e flambda-backend: Wire in the remaining Flambda 2 flags (ocaml#103)
678d647 flambda-backend: Wire in the Flambda 2 inlining flags (ocaml#100)
1a8febb flambda-backend: Formatting of help text for some Flambda 2 options (ocaml#101)
9ae1c7a flambda-backend: First set of command-line flags for Flambda 2 (ocaml#98)
bc0bc5e flambda-backend: Add config variables flambda_backend, flambda2 and probes (ocaml#99)
efb8304 flambda-backend: Build our own ocamlobjinfo from tools/objinfo/ at the root (ocaml#95)
d2cfaca flambda-backend: Add mutability annotations to Pfield etc. (ocaml#88)
5532555 flambda-backend: Lambda block kinds (ocaml#86)
0c597ba flambda-backend: Revert VERSION, etc. back to 4.12.0 (mostly reverts 822d0a0 from upstream 4.12) (ocaml#93)
037c3d0 flambda-backend: Float blocks
7a9d190 flambda-backend: Allow --enable-middle-end=flambda2 etc (ocaml#89)
9057474 flambda-backend: Root scanning fixes for Flambda 2 (ocaml#87)
08e02a3 flambda-backend: Ensure that Lifthenelse has a boolean-valued condition (ocaml#63)
77214b7 flambda-backend: Obj changes for Flambda 2 (ocaml#71)
ecfdd72 flambda-backend: Cherry-pick 9432cfdadb043a191b414a2caece3e4f9bbc68b7 (ocaml#84)
d1a4396 flambda-backend: Add a `returns` field to `Cmm.Cextcall` (ocaml#74)
575dff5 flambda-backend: CMM traps (ocaml#72)
8a87272 flambda-backend: Remove Obj.set_tag and Obj.truncate (ocaml#73)
d9017ae flambda-backend: Merge pull request ocaml#80 from mshinwell/fb-backport-pr10205
3a4824e flambda-backend: Backport PR#10205 from upstream: Avoid overwriting closures while initialising recursive modules
f31890e flambda-backend: Install missing headers of ocaml/runtime/caml (ocaml#77)
83516f8 flambda-backend: Apply node created for probe should not be annotated as tailcall (ocaml#76)
bc430cb flambda-backend: Add Clflags.is_flambda2 (ocaml#62)
ed87247 flambda-backend: Preallocation of blocks in Translmod for value let rec w/ flambda2 (ocaml#59)
a4b04d5 flambda-backend: inline never on Gc.create_alarm (ocaml#56)
cef0bb6 flambda-backend: Config.flambda2 (ocaml#58)
ff0e4f7 flambda-backend: Pun labelled arguments with type constraint in function applications (ocaml#53)
d72c5fb flambda-backend: Remove Cmm.memory_chunk.Double_u (ocaml#42)
9d34d99 flambda-backend: Install missing artifacts
10146f2 flambda-backend: Add ocamlcfg (ocaml#34)
819d38a flambda-backend: Use OC_CFLAGS, OC_CPPFLAGS, and SHAREDLIB_CFLAGS for foreign libs (#30)
f98b564 flambda-backend: Pass -function-sections iff supported. (#29)
e0eef5e flambda-backend: Bootstrap (#11 part 2)
17374b4 flambda-backend: Add [@@Builtin] attribute to Primitives (#11 part 1)
85127ad flambda-backend: Add builtin, effects and coeffects fields to Cextcall (#12)
b670bcf flambda-backend: Replace tuple with record in Cextcall (#10)
db451b5 flambda-backend: Speedups in Asmlink (#8)
2fe489d flambda-backend: Cherry-pick upstream PR#10184 from upstream, dynlink invariant removal (rev 3dc3cd7 upstream)
d364bfa flambda-backend: Local patch against upstream: enable function sections in the Dune build
886b800 flambda-backend: Local patch against upstream: remove Raw_spacetime_lib (does not build with -m32)
1a7db7c flambda-backend: Local patch against upstream: make dune ignore ocamldoc/ directory
e411dd3 flambda-backend: Local patch against upstream: remove ocaml/testsuite/tests/tool-caml-tex/
1016d03 flambda-backend: Local patch against upstream: remove ocaml/dune-project and ocaml/ocaml-variants.opam
93785e3 flambda-backend: To upstream: export-dynamic for otherlibs/dynlink/ via the natdynlinkops files (still needs .gitignore + way of generating these files)
63db8c1 flambda-backend: To upstream: stop using -O3 in otherlibs/Makefile.otherlibs.common
eb2f1ed flambda-backend: To upstream: stop using -O3 for dynlink/
6682f8d flambda-backend: To upstream: use flambda_o3 attribute instead of -O3 in the Makefile for systhreads/
de197df flambda-backend: To upstream: renamed ocamltest_unix.xxx files for dune
bf3773d flambda-backend: To upstream: dune build fixes (depends on previous to-upstream patches)
6fbc80e flambda-backend: To upstream: refactor otherlibs/dynlink/, removing byte/ and native/
71a03ef flambda-backend: To upstream: fix to Ocaml_modifiers in ocamltest
686d6e3 flambda-backend: To upstream: fix dependency problem with Instruct
c311155 flambda-backend: To upstream: remove threadUnix
52e6e78 flambda-backend: To upstream: stabilise filenames used in backtraces: stdlib/, otherlibs/systhreads/, toplevel/toploop.ml
7d08e0e flambda-backend: To upstream: use flambda_o3 attribute in stdlib
403b82e flambda-backend: To upstream: flambda_o3 attribute support (includes bootstrap)
65032b1 flambda-backend: To upstream: use nolabels attribute instead of -nolabels for otherlibs/unix/
f533fad flambda-backend: To upstream: remove Compflags, add attributes, etc.
49fc1b5 flambda-backend: To upstream: Add attributes and bootstrap compiler
a4b9e0d flambda-backend: Already upstreamed: stdlib capitalisation patch
4c1c259 flambda-backend: ocaml#9748 from xclerc/share-ev_defname (cherry-pick 3e937fc)
00027c4 flambda-backend: permanent/default-to-best-fit (cherry-pick 64240fd)
2561dd9 flambda-backend: permanent/reraise-by-default (cherry-pick 50e9490)
c0aa4f4 flambda-backend: permanent/gc-tuning (cherry-pick e9d6d2f)

git-subtree-dir: ocaml
git-subtree-split: 23a7f73
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants