Restore prefetching for GC marking #11827

fabbing · 2022-12-20T11:06:55Z

This PR restores the prefetching used in 4.14 to speed up the GC marking. See the initial PR Speed up GC by prefetching during marking #10195 for more explanation.

The tracing algorithm with prefetching

Attempt to pop a block from the prefetch_buffer with pb_pop if it's filled enough (pb_above_waterline) for the prefetching to have completed.
- The poped block is marked.
- Construct a mark_entry from the block fields.
Otherwise, pop a mark_entry from the mark_stack.
Otherwise, the mark_stack being empty, set the prefetch_buffer to draining mode with pb_drain, and try again. The marking is complete if the prefetch_buffer is already in draining mode.
Attempt to prefetch each field of the mark_entry, adding them to the prefetch_buffer with pb_push. Blocks that can't be prefetched because the prefetch_buffer is full (pb_full) are added back to the mark_stack.

Changes to 4.14 implementation

The prefetch_buffer_t type is introduced to hold implementation details for the prefetch ring buffer.
prefetch_buffer_t operations have their own set of associated inline functions (pb_*) to increase readability.
A waterline member is added to the prefetch_buffer_t struct acting as the min_pb in 4.14.
Calling the function pb_drain to set the prefetch_buffer_t in draining mode is equivalent to setting the waterline (min_pb previously) to 0. Likewise, calling pb_fill resets it to filling mode, setting the waterline to PREFETCH_BUFFER_MIN (Pb_min previously).
The prefetch_buffer_t is reset to filling mode (pb_fill) as soon it's filled with more than PREFETCH_BUFFER_MIN elements.
Also, the Chunk_size was increased from 0x400 to 0x4000 to better use the prefetch_buffer.

Performance

Using @stedolan micro-benchmark (using GC.major rather GC.full_major) yields as good performance as 4.14:

Results on 4.14:

$ perf stat -- ./4.14.out
0.292 s/gc

 Performance counter stats for './4.14.out':

          6 882,45 msec task-clock                       #    1,000 CPUs utilized          
                21      context-switches                 #    3,051 /sec                   
                 4      cpu-migrations                   #    0,581 /sec                   
           204 288      page-faults                      #   29,682 K/sec                  
    31 417 903 050      cycles                           #    4,565 GHz                    
    79 234 045 125      instructions                     #    2,52  insn per cycle         
    15 569 722 764      branches                         #    2,262 G/sec                  
         1 747 089      branch-misses                    #    0,01% of all branches        
   156 756 988 485      slots                            #   22,776 G/sec                  
    62 702 795 394      topdown-retiring                 #     39,9% Retiring              
    10 450 465 899      topdown-bad-spec                 #      6,7% Bad Speculation       
     7 024 075 924      topdown-fe-bound                 #      4,5% Frontend Bound        
    76 841 661 022      topdown-be-bound                 #     48,9% Backend Bound         

       6,883408392 seconds time elapsed

       6,778832000 seconds user
       0,100007000 seconds sys

Results with this PR latest commit:

0.297 s/gc

 Performance counter stats for './a.out':

          6 975,48 msec task-clock                       #    0,996 CPUs utilized          
               381      context-switches                 #   54,620 /sec                   
                 6      cpu-migrations                   #    0,860 /sec                   
           202 655      page-faults                      #   29,052 K/sec                  
    31 338 550 747      cycles                           #    4,493 GHz                    
    97 787 430 050      instructions                     #    3,12  insn per cycle         
    26 528 909 898      branches                         #    3,803 G/sec                  
         3 306 533      branch-misses                    #    0,01% of all branches        
   156 123 397 625      slots                            #   22,382 G/sec                  
    68 973 229 344      topdown-retiring                 #     39,0% Retiring              
    46 530 894 978      topdown-bad-spec                 #     26,3% Bad Speculation       
     9 527 351 720      topdown-fe-bound                 #      5,4% Frontend Bound        
    51 699 140 857      topdown-be-bound                 #     29,3% Backend Bound         

       7,002423462 seconds time elapsed

       6,727727000 seconds user
       0,239205000 seconds sys

Results on trunk (actually the parent commit to this PR):

$ perf stat -- ./97aa649.out 
1.051 s/gc

 Performance counter stats for './97aa649.out':

         22 742,08 msec task-clock                       #    1,000 CPUs utilized          
                54      context-switches                 #    2,374 /sec                   
                 3      cpu-migrations                   #    0,132 /sec                   
           202 708      page-faults                      #    8,913 K/sec                  
   104 298 566 270      cycles                           #    4,586 GHz                    
   101 710 911 014      instructions                     #    0,98  insn per cycle         
    24 616 971 674      branches                         #    1,082 G/sec                  
         3 696 047      branch-misses                    #    0,02% of all branches        
   521 433 112 175      slots                            #   22,928 G/sec                  
    85 977 421 213      topdown-retiring                 #     16,3% Retiring              
    36 807 043 212      topdown-bad-spec                 #      7,0% Bad Speculation       
     6 860 349 204      topdown-fe-bound                 #      1,3% Frontend Bound        
   397 861 864 068      topdown-be-bound                 #     75,4% Backend Bound         

      22,743431435 seconds time elapsed

      22,556285000 seconds user
       0,179963000 seconds sys

Acknowldgement

Thanks to @stedolan for his help, @ctk21 and @sadiqj for their advice, and @Engil for the initial bootstrap.

gasche · 2022-12-20T11:24:24Z

Could we also see the result of trunk on your micro-benchmark? (It's an optimization on top of the current runtime, so I would be curious to know how much it actually optimizes the current runtime.)

(cc @gadmm who is interested in prefreching)

gasche

One obvious change in the PR compared to the same logic in mark_slice_darken is that mark_slice_darken does something careful and complex with Lazy/Forcing tags to account for concurrent update from a mutator forcing the thunk:

ocaml/runtime/major_gc.c

Lines 714 to 725 in e6340ce

    
           again: 
        
               if (Tag_hd(chd) == Lazy_tag || Tag_hd(chd) == Forcing_tag){ 
        
                 if(!atomic_compare_exchange_strong(Hp_atomic_val(child), &chd, 
        
                       With_status_hd(chd, caml_global_heap_state.MARKED))){ 
        
                         chd = Hd_val(child); 
        
                         goto again; 
        
                 } 
        
               } else { 
        
                 atomic_store_relaxed( 
        
                   Hp_atomic_val(child), 
        
                   With_status_hd(chd, caml_global_heap_state.MARKED)); 
        
               }

Your PR on the other hand does nothing of the sort, it handles Lazy/Forcing blocks like any other.

Can you explain why the Lazy/Forcing logic is not necessary anymore, or maybe was not needed in the first place?

fabbing · 2022-12-20T13:52:30Z

Could we also see the result of trunk on your micro-benchmark? (It's an optimization on top of the current runtime, so I would be curious to know how much it actually optimizes the current runtime.)

Sure, I added the result to the benchmark with the parent commit (97aa649e3c26e5f9ba2e3e5fc88115bc60afa5f9). We go from 22'742 seconds wall time to 6'889 (~70% improvement).

Your PR on the other hand does nothing of the sort, it handles Lazy/Forcing blocks like any other.

Can you explain why the Lazy/Forcing logic is not necessary anymore, or maybe was not needed in the first place?

Yes, indeed, that's something I forgot to mention: we dropped the lazy/forcing optimization to simplify the prefetching code as the performance gain may not be obvious.

gasche · 2022-12-20T14:01:48Z

we dropped the lazy/forcing optimization to simplify the prefetching code as the performance gain may not be obvious.

Was it really an optimization, or was it there for correctness? The trunk code seems to be designed to avoid the issue where a concurrent update to the block tag (from Lazy to Forcing, or from Forcing to Forward) would "reset" the mark bits from MARKED to UNMARKED. Now with the PR code, I suppose that this could happen. () Would it be unsafe if it happened, for example, could it result in the value being collected?

(For reference, the concurrent tag update would come from this code:

ocaml/runtime/obj.c

Lines 198 to 217 in e6340ce

    
           static int obj_update_tag (value blk, int old_tag, int new_tag) 
        
           { 
        
             header_t hd; 
        
             tag_t tag; 
        
             SPIN_WAIT { 
        
               hd = Hd_val(blk); 
        
               tag = Tag_hd(hd); 
        
               if (tag != old_tag) return 0; 
        
               if (caml_domain_alone()) { 
        
                 Tag_val (blk) = new_tag; 
        
                 return 1; 
        
               } 
        
               if (atomic_compare_exchange_strong(Hp_atomic_val(blk), &hd, 
        
                                                  (hd & ~0xFF) | new_tag)) 
        
                 return 1; 
        
             } 
        
           }

)

fabbing · 2022-12-20T16:11:56Z

Was it really an optimization, or was it there for correctness? The trunk code seems to be designed to avoid the issue where a concurrent update to the block tag (from Lazy to Forcing, or from Forcing to Forward) would "reset" the mark bits from MARKED to UNMARKED. Now with the PR code, I suppose that this could happen. () Would it be unsafe if it happened, for example, could it result in the value being collected?

(For reference, the concurrent tag update would come from this code:

ocaml/runtime/obj.c

Lines 198 to 217 in e6340ce

static int obj_update_tag (value blk, int old_tag, int new_tag)

{

header_t hd;

tag_t tag;

SPIN_WAIT {

hd = Hd_val(blk);

tag = Tag_hd(hd);

if (tag != old_tag) return 0;

if (caml_domain_alone()) {

Tag_val (blk) = new_tag;

return 1;

}

if (atomic_compare_exchange_strong(Hp_atomic_val(blk), &hd,

(hd & ~0xFF) | new_tag))

return 1;

}

}

)

You actually may be very right; this doesn't seem to be an optimization!
If another thread is allowed to perform a call to obj_update_tag the current thread doing the marking could lose the new tag since hd in With_status_hd from the atomic_store_relaxed could be the old, un-updated value.

While trying to fix this, I went back to using an attempt to refactor the code to avoid code duplication that I deemed invalid and hurting performance. But performance only drops a little.
We lose the optimization of having the caml_global_heap_state and caml_global_heap_state globals cached locally.

New benchmark results:

0.297 s/gc

 Performance counter stats for './a.out':

          6 975,48 msec task-clock                       #    0,996 CPUs utilized          
               381      context-switches                 #   54,620 /sec                   
                 6      cpu-migrations                   #    0,860 /sec                   
           202 655      page-faults                      #   29,052 K/sec                  
    31 338 550 747      cycles                           #    4,493 GHz                    
    97 787 430 050      instructions                     #    3,12  insn per cycle         
    26 528 909 898      branches                         #    3,803 G/sec                  
         3 306 533      branch-misses                    #    0,01% of all branches        
   156 123 397 625      slots                            #   22,382 G/sec                  
    68 973 229 344      topdown-retiring                 #     39,0% Retiring              
    46 530 894 978      topdown-bad-spec                 #     26,3% Bad Speculation       
     9 527 351 720      topdown-fe-bound                 #      5,4% Frontend Bound        
    51 699 140 857      topdown-be-bound                 #     29,3% Backend Bound         

       7,002423462 seconds time elapsed

       6,727727000 seconds user
       0,239205000 seconds sys

fabbing · 2022-12-20T16:22:26Z

I also updated results for the 4.14 benchmark as the comparison wasn't fair: the compiler wasn't compiled with --disable-naked-pointers.

gasche

Thanks for the change. The new version in fact looks easier to review.

runtime/major_gc.c

shindere · 2022-12-21T15:37:15Z

Gabriel Scherer (2022/12/21 06:50 -0800):

@gasche commented on this pull request. > +Caml_inline bool pb_full(const prefetch_buffer_t *pb) +{ + return pb->enqueued == (pb->dequeued + PREFETCH_BUFFER_SIZE); +} + +Caml_inline uintnat pb_size(const prefetch_buffer_t *pb) +{ + return pb->enqueued - pb->dequeued; +} + +Caml_inline bool pb_above_waterline(const prefetch_buffer_t *pb) +{ + return ((pb->enqueued - pb->dequeued) > pb->waterline); +} + +Caml_inline void pb_drain(prefetch_buffer_t *pb) I read `pb_fill` and `pb_drain` as actions, just like `pb_push` and `pb_pop`. I wonder if `pb_fill_mode` and `pb_drain_mode` would be clearer -- we are not draining or filling the queue, we are changing its mode to drain or fill over time.

Uninformed suggestion. If things are what you describe, wouldn't it be even clearer to have a `queue_state` variable and to define an enum with the possible values with meaningful names, e.g. `FILL_MODE` and `DRAIN_MODE` ?

fabbing · 2022-12-21T16:14:27Z

Uninformed suggestion. If things are what you describe, wouldn't it be even clearer to have a queue_state variable and to define an enum with the possible values with meaningful names, e.g. FILL_MODE and DRAIN_MODE ?

Ideally, probably yes. But this is performance sensitive code and this would involve one more branch (to test the mode first and then the number of elements against the waterline).

shindere · 2022-12-21T16:20:05Z

Fabrice (2022/12/21 08:14 -0800):

Ideally, probably yes. But this is performance sensitive code and this would involve one more branch (to test the mode first and then the number of elements against the waterline).

Okay, too bad. I thought it was jsut a matter of replacing an integer variable with a variable with a slightly richer type so I didn't expect the proposal to involve additional tests. But many thanks for having taken it into account.

sadiqj

This looks good @fabbing and I think the implementation matches your description.

One question, do you have some sandmark suite (macro) benchmarking results for runtime and latency? I'm interested in the latter because the Chunk_size has increased significantly.

fabbing · 2023-01-16T17:46:14Z

Here are the pausetime_seq results from a Sandmark run with 3 versions considered

trunk: this PR parent commit (97aa649)
like414: this PR last commit (2f6bd79)
chunksize: this PR last commit and reverting the chunk_size to its previous value (0x400)

Normalized values are computed against trunk value (ie: max_normalized_like414 = max_like414 / max_trunk)

Full table

Benchmark	Version	Mean	Max	Max normalized
alt-ergo.fill.why	chunksize	0	2,809855	0,89264798151075
alt-ergo.fill.why	trunk	0	3,147775	1
alt-ergo.fill.why	like414	0	2,463743	0,782693489846002
alt-ergo.yyll.why	chunksize	0	5,062655	0,202622918906207
alt-ergo.yyll.why	trunk	0	24,985599	1
alt-ergo.yyll.why	like414	0	5,890047	0,235737674329921
bdd.26	chunksize	0	2,635775	0,587134944528164
bdd.26	trunk	0	4,489215	1
bdd.26	like414	0	2,922495	0,651003571893973
binarytrees5.21	chunksize	0	5,931007	0,647584934687462
binarytrees5.21	trunk	0	9,158655	1
binarytrees5.21	like414	0	4,407295	0,481216401316569
chameneos_redux_lwt.600000	chunksize	0	0,201727	0,981319958942826
chameneos_redux_lwt.600000	trunk	0	0,205567	1
chameneos_redux_lwt.600000	like414	0	0,207615	1,00996268856383
crout-decomposition.	chunksize	0	0,563199	0,867507682354844
crout-decomposition.	trunk	0	0,649215	1
crout-decomposition.	like414	0	0,571391	0,880125998321049
cubicle.german_pfs.cub	chunksize	0	7,065599	0,325226213284371
cubicle.german_pfs.cub	trunk	0	21,725183	1
cubicle.german_pfs.cub	like414	0	7,098367	0,32673450897974
cubicle.szymanski_at.cub	chunksize	0	6,549503	0,44840153262266
cubicle.szymanski_at.cub	trunk	0	14,606335	1
cubicle.szymanski_at.cub	like414	0	4,665343	0,319405449758615
durand-kerner-aberth.	chunksize	0	0,104127	0,796766319526808
durand-kerner-aberth.	trunk	0	0,130687	1
durand-kerner-aberth.	like414	0	0,144895	1,10871777606036
evolutionary_algorithm.10000_10000	chunksize	0	22,282239	1,30518235952673
evolutionary_algorithm.10000_10000	trunk	0	17,072127	1
evolutionary_algorithm.10000_10000	like414	0	17,055743	0,999040307045514
fannkuchredux.12	chunksize	0	0,047999	0,983606221438964
fannkuchredux.12	trunk	0	0,048799	1
fannkuchredux.12	like414	0	0,047295	0,969179696305252
fasta3.25_000_000	chunksize	0	0,127679	1,03475131897788
fasta3.25_000_000	trunk	0	0,123391	1
fasta3.25_000_000	like414	0	0,128575	1,04201278861505
fasta6.25_000_000	chunksize	0	0,143615	1,06049194006926
fasta6.25_000_000	trunk	0	0,135423	1
fasta6.25_000_000	like414	0	0,140799	1,03969783567046
fft.	chunksize	0	5,201919	1,08269396435546
fft.	trunk	0	4,804607	1
fft.	like414	0	5,009407	1,0426257548224
floyd_warshall.512	chunksize	0	1,783807	0,62979018021221
floyd_warshall.512	trunk	0	2,832383	1
floyd_warshall.512	like414	0	1,686527	0,595444542634241
game_of_life.256	chunksize	0	0,907263	0,569408470115976
game_of_life.256	trunk	0	1,593343	1
game_of_life.256	like414	0	0,805887	0,505783751521173
grammatrix.	chunksize	0	1,215487	1,01539778873987
grammatrix.	trunk	0	1,197055	1
grammatrix.	like414	0	1,266687	1,05816942412838
graph500_seq_kernel1.edges_data	chunksize	2	316,932095	0,145873551730425
graph500_seq_kernel1.edges_data	trunk	6	2172,649471	1
graph500_seq_kernel1.edges_data	like414	2	254,148607	0,116976350944924
hamming.1_000_000	chunksize	0	3,414015	0,187051122425079
hamming.1_000_000	trunk	0	18,251775	1
hamming.1_000_000	like414	0	1,622015	0,0888688908339052
kb_no_exc.	chunksize	0	0,817663	0,840526151856653
kb_no_exc.	trunk	0	0,972799	1
kb_no_exc.	like414	0	0,699903	0,719473395840251
kb.	chunksize	0	0,486911	0,636972063604194
kb.	trunk	0	0,764415	1
kb.	like414	0	0,425727	0,556931771354565
knucleotide.	chunksize	0	3,266559	1,02704443863831
knucleotide.	trunk	0	3,180543	1
knucleotide.	like414	0	3,147775	0,989697356709216
knucleotide3.	chunksize	0	3,241983	0,973554727605947
knucleotide3.	trunk	0	3,330047	1
knucleotide3.	like414	0	3,203071	0,961869607245784
levinson-durbin.	chunksize	0	0,081151	0,803037949631389
levinson-durbin.	trunk	0	0,101055	1
levinson-durbin.	like414	0	0,085375	0,84483696996685
lexifi-g2pp.	chunksize	0	0,335871	0,988696274185999
lexifi-g2pp.	trunk	0	0,339711	1
lexifi-g2pp.	like414	0	0,335615	0,987942692465066
LU_decomposition.1024	chunksize	0	0,771071	1,15049678681363
LU_decomposition.1024	trunk	0	0,670207	1
LU_decomposition.1024	like414	0	0,662527	0,988540853795917
mandelbrot6.16_000	chunksize	0	0,047711	0,993999874997396
mandelbrot6.16_000	trunk	0	0,047999	1
mandelbrot6.16_000	like414	0	0,047743	0,994666555553241
matrix_multiplication.1024	chunksize	0	1,471487	0,474884166289349
matrix_multiplication.1024	trunk	0	3,098623	1
matrix_multiplication.1024	like414	0	1,406975	0,454064595789807
menhir.ocamly	chunksize	0	378,535935	0,401111110476506
menhir.ocamly	trunk	0	943,718399	1
menhir.ocamly	like414	0	337,117183	0,35722222154111
menhir.sql-parser	chunksize	0	18,644991	2,53029481579051
menhir.sql-parser	trunk	0	7,368703	1
menhir.sql-parser	like414	0	20,283391	2,75264059360243
menhir.sysver	chunksize	0	35,684351	1,05933852316232
menhir.sysver	trunk	0	33,685503	1
menhir.sysver	like414	0	9,371647	0,278210095304203
mergesort.67108864	chunksize	0	64,782335	0,374148368830531
mergesort.67108864	trunk	0	173,146111	1
mergesort.67108864	like414	0	57,671679	0,333080995391228
minilight.roomfront	chunksize	0	0,244351	0,609513713067013
minilight.roomfront	trunk	0	0,400895	1
minilight.roomfront	like414	0	0,208639	0,52043303109293
naive-multilayer.	chunksize	0	0,196607	1,00721315170672
naive-multilayer.	trunk	0	0,195199	1
naive-multilayer.	like414	0	0,195583	1,00196722319274
nbody.50_000_000	chunksize	0	0,136959	1,0907244737869
nbody.50_000_000	trunk	0	0,125567	1
nbody.50_000_000	like414	0	0,125503	0,999490311945017
nqueens.14	chunksize	0	0,065215	1,01191676881779
nqueens.14	trunk	0	0,064447	1
nqueens.14	like414	0	0,066687	1,03475724238522
pidigits5.10_000	chunksize	0	0,188671	1,06889089065271
pidigits5.10_000	trunk	0	0,176511	1
pidigits5.10_000	like414	0	0,178687	1,01232784359048
qr-decomposition.	chunksize	0	0,324351	1,12024798383615
qr-decomposition.	trunk	0	0,289535	1
qr-decomposition.	like414	0	0,296703	1,02475693784862
quicksort.4000000	chunksize	0	0,128959	1,0974953830966
quicksort.4000000	trunk	0	0,117503	1
quicksort.4000000	like414	0	0,121279	1,03213534973575
regexredux2.	chunksize	1	22,904831	0,193521587870496
regexredux2.	trunk	1	118,358015	1
regexredux2.	like414	1	23,134207	0,195459572382994
revcomp2.	chunksize	0	1,273855	0,322613936330439
revcomp2.	trunk	0	3,948543	1
revcomp2.	like414	0	1,259519	0,318983230016743
sequence_cps.10000	chunksize	0	0,053951	0,961779124699171
sequence_cps.10000	trunk	0	0,056095	1
sequence_cps.10000	like414	0	0,053055	0,945806221588377
soli.2000	chunksize	0	0,084735	0,927819812322752
soli.2000	trunk	0	0,091327	1
soli.2000	like414	0	0,082751	0,906095678167464
spectralnorm2.5_500	chunksize	0	0,111871	0,965745560648832
spectralnorm2.5_500	trunk	0	0,115839	1
spectralnorm2.5_500	like414	0	0,113791	0,982320289367139
test_decompress.64_524_288	chunksize	0	0,279807	1,10404081455498
test_decompress.64_524_288	trunk	0	0,253439	1
test_decompress.64_524_288	like414	0	0,245375	0,968181692636098
test_lwt.200	chunksize	0	0,627199	0,511695508053588
test_lwt.200	trunk	0	1,225727	1
test_lwt.200	like414	0	0,614399	0,501252725933263
thread_ring_lwt_mvar.20_000	chunksize	0	0,305663	1,00251889994916
thread_ring_lwt_mvar.20_000	trunk	0	0,304895	1
thread_ring_lwt_mvar.20_000	like414	0	0,313343	1,02770789944079
thread_ring_lwt_stream.20_000	chunksize	0	0,693759	0,970630331402125
thread_ring_lwt_stream.20_000	trunk	0	0,714751	1
thread_ring_lwt_stream.20_000	like414	0	1,620991	2,26791008337169
yojson_ydump.sample.json	chunksize	0	0,454911	1,0434528301454
yojson_ydump.sample.json	trunk	0	0,435967	1
yojson_ydump.sample.json	like414	0	0,452863	1,03875522688644
zarith_pi.10_000	chunksize	0	0,224639	1,00400459455715
zarith_pi.10_000	trunk	0	0,223743	1
zarith_pi.10_000	like414	0	0,169087	0,755719732013962

The average on all benchmarks:

	Version	Average
Mean	like414	0,986666666666667
Mean	chunksize	0,986666666666667
Max	like414	0,820132697954935
Max	chunksize	0,840627800426733

And some graphs to better visualize the results per benchmark:

Mean

gasche · 2023-01-16T19:39:10Z

The results look good overall; there are a few regressions in "minimum times", but those are quantities so small that it is not really concerning. On the other hand, menhir.sql-parser has a strong regression in "maximum" pause time. I wonder what is going on?

stedolan · 2023-01-17T17:38:27Z

An increase in Chunk_size is necessary for performance. The prefetch buffer can hold 256 elements, but if it's only allowed to mark 1024 words then it barely has time to fill before it starts draining again, and spends very little time in the efficient steady-state. It's necessary to do significantly more work before draining to get good performance.

Having said that, @fabbing and I didn't put much thought into the new value (at least originally, perhaps @fabbing has since then). So there might be a better value to choose. It might also be worth moving the test for interrupts directly into the marking loop - it's pretty cheap, and it would save draining and refilling the prefetch buffer just to check for interrupts.

(As a general note: this is one bit of code where performance probably matters more than avoiding duplication. Many OCaml programs spend a double-digit percentage of their time in this function, so optimisation here is more beneficial than almost anywhere else. Refactoring that preserves performance is of course great, but if that turns out to be hard I'd choose the fast and ugly version here)

fabbing · 2023-01-23T16:40:54Z

I edited my previous comment to remove some values from the table and graph. Indeed I tried to get more accurate figures but got them wrong. Sorry about that.

Inlining performance

My last commit (1d30b83) yields better performance than the previous refactoring but not as good as manually inlining mark_slice_darken_only inside do_some_marking. The manually inlined code is ~10% better in the micro-benchmark.
In the following table and graph, lazy_fix is the code of this PR's first commit including the proper handling of lazy values and correct work accounting (diff).
@stedolan's micro-benchmark is used with 20 iterations of Gc.major and ran 10 times under perf:

Raw table

	trunk	like414	lazy_fix
GC time (s/gc)	1,7079	0,7739	0,6685
Task-clock (ms)	38315,979974	18003,776654	16155,050673
Cycles	68968490094	32406647395	29078944642

Table normalized

	trunk	like414 / trunk	lazy_fix / trunk	like414 / lazy_fix
GC time (s/gc)	1	45,31 %	39,14 %	115,77 %
Task-clock (ms)	1	46,99 %	42,16 %	111,44 %
Cycles	1	46,99 %	42,16 %	111,44 %

Graph

Impact of `Chunk_size`

To try and pinpoint a better Chunk_size value, I ran the micro-benchmark with different Chunk_size under perf to visualize the impact of the Chunk_size on the benchmark time.
We can see the benefits of a bigger Chunk_size starts to decrease around 4864 (0x1300).

Raw table

Chunk size	GC time	Task clock	Cycles	Time
1024	0,6979	16805,328572	30249448592	16,806483024
1280	0,6951	16732,013482	30117479617	16,733136599
1536	0,6937	16706,811324	30072127855	16,707922556
1792	0,6871	16567,36066	29821102200	16,568566201
2048	0,6841	16495,469809	29691704798	16,496622062
2304	0,6828	16466,350823	29639305615	16,467323184
2560	0,6814	16445,11087	29601059354	16,44638652
2816	0,6834	16474,487396	29653951267	16,475394348
3072	0,681	16421,606056	29558744516	16,42291328
3328	0,6793	16395,230906	29511263906	16,396623443
3584	0,6769	16342,616445	29416559047	16,343734888
3840	0,6792	16384,124633	29491275896	16,385201654
4096	0,6786	16379,60165	29483131708	16,380887947
4352	0,6762	16324,52771	29384012068	16,325613344
4608	0,6773	16351,347737	29432286122	16,352412785
4864	0,6746	16294,414584	29329812773	16,295357699
5120	0,6766	16334,798273	29402493093	16,335836607
5376	0,6777	16348,556629	29427252052	16,349665815
5632	0,6752	16299,866932	29339602207	16,301248907
5888	0,6736	16261,006062	29269664349	16,262259327
6144	0,6755	16325,16908	29385158645	16,326185925
6400	0,6751	16294,153145	29329342761	16,295169579
6656	0,674	16276,200831	29297010143	16,277461059
6912	0,6747	16280,793755	29305284957	16,282081765
7168	0,6747	16291,833534	29325172093	16,292688599
7424	0,6716	16219,457316	29194876488	16,220700768
7680	0,6762	16321,729245	29378970165	16,322965445
7936	0,6742	16276,702341	29297925168	16,277860709
8192	0,673	16248,306974	29246795417	16,24975623
8704	0,6732	16258,789049	29265670274	16,260099216
9216	0,6767	16322,196059	29379783405	16,324041907
9728	0,6716	16228,873859	29211847207	16,229761962
10240	0,6715	16213,640809	29184423046	16,214532721
10752	0,6733	16249,577139	29249100762	16,250830843
11264	0,6715	16220,96639	29197603817	16,222014844
11776	0,6727	16239,63162	29231212658	16,240520013
12288	0,6736	16260,950064	29269576454	16,262168886
12800	0,6704	16195,032308	29150928408	16,195974763
13312	0,6721	16230,044766	29213949460	16,230952626
13824	0,6731	16244,866726	29240625767	16,245814942
14336	0,6709	16202,85012	29164995133	16,203966744
14848	0,6705	16195,705349	29152148467	16,196400558
15360	0,6725	16241,234658	29234093471	16,242182077
15872	0,671	16209,758984	29177429382	16,210854804
16384	0,6699	16183,95254	29130970062	16,185223053
16896	0,673	16246,022189	29242698579	16,247179946
17408	0,6701	16175,515303	29115786722	16,176731544
17920	0,6717	16221,438155	29198430208	16,223005145
18432	0,6733	16256,313798	29261245569	16,257071129
18944	0,6693	16174,134664	29113311102	16,175170861
19456	0,6709	16209,219319	29176459096	16,210471309
19968	0,6728	16237,37891	29227151440	16,23837567

Graph GC time & cycles count

I also ran the micro-benchmarks under olly to get GC pause times. Unlike benchmarks under perf, those are only over one run, so they are probably noisier.
Oddly enough, the increase in Chunk_size doesn't cause an increase in the different GC pause times.

Raw table

chunk_size	Gc time (s/gc)	mean (ms)	stddev (ms)	max (ms)
1024	0,691	19,7	106,48	726,66
1280	0,686	19,52	105,63	720,37
1536	0,685	19,51	105,56	719,85
1792	0,677	19,33	104,35	710,93
2048	0,676	19,3	104,19	710,41
2304	0,675	19,37	103,92	707,79
2560	0,674	19,23	103,86	708,31
2816	0,676	19,27	104,09	708,84
3072	0,673	19,2	103,68	706,74
3328	0,671	19,15	103,32	704,64
3584	0,669	19,08	102,97	701,5
3840	0,671	19,14	103,32	704,12
4096	0,672	19,17	103,46	704,64
4352	0,667	19,05	102,79	700,45
4608	0,669	19,09	103,06	702,02
4864	0,666	19,02	102,56	699,92
5120	0,667	19,06	102,75	699,92
5376	0,67	19,11	103,15	703,07
5632	0,667	19,15	102,66	699,4
5888	0,665	19	102,46	698,35
6144	0,667	19,05	102,76	699,92
6400	0,666	19,03	102,66	699,92
6656	0,665	18,99	102,42	697,3
6912	0,667	19,05	102,75	700,45
7168	0,667	19,09	102,77	699,92
7424	0,666	19,02	102,59	699,92
7680	0,67	19,11	103,19	702,55
7936	0,668	19,07	102,91	701,5
8192	0,667	19,05	102,77	700,45
8704	0,666	19,03	102,64	699,4
9216	0,67	19,12	103,28	703,59
9728	0,667	19,16	102,73	700,45
10240	0,665	19	102,47	702,55
10752	0,668	19,07	102,92	702,55
11264	0,664	18,98	102,34	697,3
11776	0,665	18,98	102,4	697,83
12288	0,666	19,02	102,6	698,88
12800	0,662	18,91	101,9	694,68
13312	0,664	18,96	102,23	696,25
13824	0,664	18,97	102,33	697,3
14336	0,663	19,25	102,13	696,78
14848	0,664	18,98	102,29	697,3
15360	0,666	19,03	102,63	699,4
15872	0,662	18,92	101,97	694,68
16384	0,661	18,9	101,83	694,68
16896	0,664	18,97	102,35	697,3
17408	0,661	18,89	101,83	693,63
17920	0,662	18,93	102,02	695,21
18432	0,666	19	102,54	698,88
18944	0,661	18,9	101,81	694,16
19456	0,662	18,91	101,95	694,16
19968	0,664	18,96	102,25	696,78

Graph GC time & max latency

Graph GC time & mean latency

Pathological GC max pause time

I also locally re-ran the menhir.sql_parser and thread_ring_lwt_stream.20_000 benchmarks and confirmed the considerable increase in the max pause time. It affects the prefetching code with either the new Chunk_size or the previous, unmodified one.
I'll try and investigate to understand what happens here.

sadiqj · 2023-01-30T09:39:59Z

Thanks for running all of these @fabbing . From the GC pause time and runtime data, it looks like there's little reason not to stick with 0x4000?

fabbing · 2023-01-31T17:26:07Z

I re-ran the benchmarks on a different host and the results for menhir.sql_parser and thread_ring_lwt_stream.20_000 now show usual maximum latency, while another benchmark now shows unexpectedly high latency. So this doesn't seem to be related to this PR.
If increasing Chunk_size has a negative effect on the maximum latency, it globally seems to be compensated by the gain from prefetching.
Should I rebase on trunk and squash in one atomic commit?

Co-authored-by: Stephen Dolan <sdolan@janestreet.com> Co-authored-by: Gabriel Scherer <gasche@github>

fabbing · 2023-02-02T13:15:47Z

I think it's ready to merge. Thanks to all the reviewers.
I've rebased on trunk and squashed into an atomic commit (backup of the original branch here).

sadiqj · 2023-02-02T19:40:54Z

Thanks @fabbing and @stedolan , good to finally have this in trunk.

gasche · 2023-02-02T20:22:12Z

Has someone done a full review of the patch? (Maybe @sadiqj ? In this case it is best to use the "Approve" feature of Github to make this explicit.)

sadiqj · 2023-02-03T07:38:37Z

Sorry @gasche , I did a review but seems I left it as a comment rather than approve.

gasche reviewed Dec 20, 2022

View reviewed changes

runtime/major_gc.c Outdated Show resolved Hide resolved

ejgallego mentioned this pull request Dec 20, 2022

[ci] Resurrect CI job for OCaml trunk. coq/coq#15494

Closed

gasche reviewed Dec 21, 2022

View reviewed changes

runtime/major_gc.c Outdated Show resolved Hide resolved

runtime/major_gc.c Show resolved Hide resolved

sadiqj reviewed Dec 28, 2022

View reviewed changes

nojb added the runtime-system label Jan 22, 2023

sadiqj mentioned this pull request Jan 24, 2023

Performance Regression: GC.stat went from < 1s to ~50s in 4.14 -> 5.0 #11913

Open

Restore prefetching for GC marking

00d9212

Co-authored-by: Stephen Dolan <sdolan@janestreet.com> Co-authored-by: Gabriel Scherer <gasche@github>

fabbing force-pushed the prefetching_like_414 branch from 55d4161 to 00d9212 Compare February 2, 2023 13:11

sadiqj merged commit 70f89a4 into ocaml:trunk Feb 2, 2023

stedolan mentioned this pull request May 17, 2023

Fix budget accounting for Cont_tag in do_some_marking #12249

Merged

fabbing mentioned this pull request Jun 5, 2023

Add @stedolan markbench benchmark for prefetching ocaml-bench/sandmark#457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore prefetching for GC marking #11827

Restore prefetching for GC marking #11827

fabbing commented Dec 20, 2022 •

edited

gasche commented Dec 20, 2022

gasche left a comment

fabbing commented Dec 20, 2022

gasche commented Dec 20, 2022

fabbing commented Dec 20, 2022

fabbing commented Dec 20, 2022

gasche left a comment

shindere commented Dec 21, 2022 via email

fabbing commented Dec 21, 2022

shindere commented Dec 21, 2022 via email

sadiqj left a comment

fabbing commented Jan 16, 2023 •

edited

gasche commented Jan 16, 2023

stedolan commented Jan 17, 2023

fabbing commented Jan 23, 2023

sadiqj commented Jan 30, 2023

fabbing commented Jan 31, 2023

fabbing commented Feb 2, 2023

sadiqj commented Feb 2, 2023 •

edited

gasche commented Feb 2, 2023

sadiqj commented Feb 3, 2023

	again:
	if (Tag_hd(chd) == Lazy_tag \|\| Tag_hd(chd) == Forcing_tag){
	if(!atomic_compare_exchange_strong(Hp_atomic_val(child), &chd,
	With_status_hd(chd, caml_global_heap_state.MARKED))){
	chd = Hd_val(child);
	goto again;
	}
	} else {
	atomic_store_relaxed(
	Hp_atomic_val(child),
	With_status_hd(chd, caml_global_heap_state.MARKED));
	}

Restore prefetching for GC marking #11827

Restore prefetching for GC marking #11827

Conversation

fabbing commented Dec 20, 2022 • edited

The tracing algorithm with prefetching

Changes to 4.14 implementation

Performance

Acknowldgement

gasche commented Dec 20, 2022

gasche left a comment

Choose a reason for hiding this comment

fabbing commented Dec 20, 2022

gasche commented Dec 20, 2022

fabbing commented Dec 20, 2022

fabbing commented Dec 20, 2022

gasche left a comment

Choose a reason for hiding this comment

shindere commented Dec 21, 2022 via email

fabbing commented Dec 21, 2022

shindere commented Dec 21, 2022 via email

sadiqj left a comment

Choose a reason for hiding this comment

fabbing commented Jan 16, 2023 • edited

gasche commented Jan 16, 2023

stedolan commented Jan 17, 2023

fabbing commented Jan 23, 2023

Inlining performance

Impact of Chunk_size

Pathological GC max pause time

sadiqj commented Jan 30, 2023

fabbing commented Jan 31, 2023

fabbing commented Feb 2, 2023

sadiqj commented Feb 2, 2023 • edited

gasche commented Feb 2, 2023

sadiqj commented Feb 3, 2023

fabbing commented Dec 20, 2022 •

edited

fabbing commented Jan 16, 2023 •

edited

Impact of `Chunk_size`

sadiqj commented Feb 2, 2023 •

edited