Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore prefetching for GC marking #11827

Merged
merged 1 commit into from
Feb 2, 2023
Merged

Conversation

fabbing
Copy link
Contributor

@fabbing fabbing commented Dec 20, 2022

This PR restores the prefetching used in 4.14 to speed up the GC marking. See the initial PR Speed up GC by prefetching during marking #10195 for more explanation.

The tracing algorithm with prefetching

  • Attempt to pop a block from the prefetch_buffer with pb_pop if it's filled enough (pb_above_waterline) for the prefetching to have completed.
    • The poped block is marked.
    • Construct a mark_entry from the block fields.
  • Otherwise, pop a mark_entry from the mark_stack.
  • Otherwise, the mark_stack being empty, set the prefetch_buffer to draining mode with pb_drain, and try again. The marking is complete if the prefetch_buffer is already in draining mode.
  • Attempt to prefetch each field of the mark_entry, adding them to the prefetch_buffer with pb_push. Blocks that can't be prefetched because the prefetch_buffer is full (pb_full) are added back to the mark_stack.

Changes to 4.14 implementation

The prefetch_buffer_t type is introduced to hold implementation details for the prefetch ring buffer.
prefetch_buffer_t operations have their own set of associated inline functions (pb_*) to increase readability.
A waterline member is added to the prefetch_buffer_t struct acting as the min_pb in 4.14.
Calling the function pb_drain to set the prefetch_buffer_t in draining mode is equivalent to setting the waterline (min_pb previously) to 0. Likewise, calling pb_fill resets it to filling mode, setting the waterline to PREFETCH_BUFFER_MIN (Pb_min previously).
The prefetch_buffer_t is reset to filling mode (pb_fill) as soon it's filled with more than PREFETCH_BUFFER_MIN elements.
Also, the Chunk_size was increased from 0x400 to 0x4000 to better use the prefetch_buffer.

Performance

Using @stedolan micro-benchmark (using GC.major rather GC.full_major) yields as good performance as 4.14:

  • Results on 4.14:
$ perf stat -- ./4.14.out
0.292 s/gc

 Performance counter stats for './4.14.out':

          6 882,45 msec task-clock                       #    1,000 CPUs utilized          
                21      context-switches                 #    3,051 /sec                   
                 4      cpu-migrations                   #    0,581 /sec                   
           204 288      page-faults                      #   29,682 K/sec                  
    31 417 903 050      cycles                           #    4,565 GHz                    
    79 234 045 125      instructions                     #    2,52  insn per cycle         
    15 569 722 764      branches                         #    2,262 G/sec                  
         1 747 089      branch-misses                    #    0,01% of all branches        
   156 756 988 485      slots                            #   22,776 G/sec                  
    62 702 795 394      topdown-retiring                 #     39,9% Retiring              
    10 450 465 899      topdown-bad-spec                 #      6,7% Bad Speculation       
     7 024 075 924      topdown-fe-bound                 #      4,5% Frontend Bound        
    76 841 661 022      topdown-be-bound                 #     48,9% Backend Bound         

       6,883408392 seconds time elapsed

       6,778832000 seconds user
       0,100007000 seconds sys
  • Results with this PR latest commit:
0.297 s/gc

 Performance counter stats for './a.out':

          6 975,48 msec task-clock                       #    0,996 CPUs utilized          
               381      context-switches                 #   54,620 /sec                   
                 6      cpu-migrations                   #    0,860 /sec                   
           202 655      page-faults                      #   29,052 K/sec                  
    31 338 550 747      cycles                           #    4,493 GHz                    
    97 787 430 050      instructions                     #    3,12  insn per cycle         
    26 528 909 898      branches                         #    3,803 G/sec                  
         3 306 533      branch-misses                    #    0,01% of all branches        
   156 123 397 625      slots                            #   22,382 G/sec                  
    68 973 229 344      topdown-retiring                 #     39,0% Retiring              
    46 530 894 978      topdown-bad-spec                 #     26,3% Bad Speculation       
     9 527 351 720      topdown-fe-bound                 #      5,4% Frontend Bound        
    51 699 140 857      topdown-be-bound                 #     29,3% Backend Bound         

       7,002423462 seconds time elapsed

       6,727727000 seconds user
       0,239205000 seconds sys
  • Results on trunk (actually the parent commit to this PR):
$ perf stat -- ./97aa649.out 
1.051 s/gc

 Performance counter stats for './97aa649.out':

         22 742,08 msec task-clock                       #    1,000 CPUs utilized          
                54      context-switches                 #    2,374 /sec                   
                 3      cpu-migrations                   #    0,132 /sec                   
           202 708      page-faults                      #    8,913 K/sec                  
   104 298 566 270      cycles                           #    4,586 GHz                    
   101 710 911 014      instructions                     #    0,98  insn per cycle         
    24 616 971 674      branches                         #    1,082 G/sec                  
         3 696 047      branch-misses                    #    0,02% of all branches        
   521 433 112 175      slots                            #   22,928 G/sec                  
    85 977 421 213      topdown-retiring                 #     16,3% Retiring              
    36 807 043 212      topdown-bad-spec                 #      7,0% Bad Speculation       
     6 860 349 204      topdown-fe-bound                 #      1,3% Frontend Bound        
   397 861 864 068      topdown-be-bound                 #     75,4% Backend Bound         

      22,743431435 seconds time elapsed

      22,556285000 seconds user
       0,179963000 seconds sys

Acknowldgement

Thanks to @stedolan for his help, @ctk21 and @sadiqj for their advice, and @Engil for the initial bootstrap.

@gasche
Copy link
Member

gasche commented Dec 20, 2022

Could we also see the result of trunk on your micro-benchmark? (It's an optimization on top of the current runtime, so I would be curious to know how much it actually optimizes the current runtime.)

(cc @gadmm who is interested in prefreching)

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One obvious change in the PR compared to the same logic in mark_slice_darken is that mark_slice_darken does something careful and complex with Lazy/Forcing tags to account for concurrent update from a mutator forcing the thunk:

ocaml/runtime/major_gc.c

Lines 714 to 725 in e6340ce

again:
if (Tag_hd(chd) == Lazy_tag || Tag_hd(chd) == Forcing_tag){
if(!atomic_compare_exchange_strong(Hp_atomic_val(child), &chd,
With_status_hd(chd, caml_global_heap_state.MARKED))){
chd = Hd_val(child);
goto again;
}
} else {
atomic_store_relaxed(
Hp_atomic_val(child),
With_status_hd(chd, caml_global_heap_state.MARKED));
}

Your PR on the other hand does nothing of the sort, it handles Lazy/Forcing blocks like any other.

Can you explain why the Lazy/Forcing logic is not necessary anymore, or maybe was not needed in the first place?

@fabbing
Copy link
Contributor Author

fabbing commented Dec 20, 2022

Could we also see the result of trunk on your micro-benchmark? (It's an optimization on top of the current runtime, so I would be curious to know how much it actually optimizes the current runtime.)

Sure, I added the result to the benchmark with the parent commit (97aa649e3c26e5f9ba2e3e5fc88115bc60afa5f9). We go from 22'742 seconds wall time to 6'889 (~70% improvement).

Your PR on the other hand does nothing of the sort, it handles Lazy/Forcing blocks like any other.

Can you explain why the Lazy/Forcing logic is not necessary anymore, or maybe was not needed in the first place?

Yes, indeed, that's something I forgot to mention: we dropped the lazy/forcing optimization to simplify the prefetching code as the performance gain may not be obvious.

@gasche
Copy link
Member

gasche commented Dec 20, 2022

we dropped the lazy/forcing optimization to simplify the prefetching code as the performance gain may not be obvious.

Was it really an optimization, or was it there for correctness? The trunk code seems to be designed to avoid the issue where a concurrent update to the block tag (from Lazy to Forcing, or from Forcing to Forward) would "reset" the mark bits from MARKED to UNMARKED. Now with the PR code, I suppose that this could happen. () Would it be unsafe if it happened, for example, could it result in the value being collected?

(For reference, the concurrent tag update would come from this code:

ocaml/runtime/obj.c

Lines 198 to 217 in e6340ce

static int obj_update_tag (value blk, int old_tag, int new_tag)
{
header_t hd;
tag_t tag;
SPIN_WAIT {
hd = Hd_val(blk);
tag = Tag_hd(hd);
if (tag != old_tag) return 0;
if (caml_domain_alone()) {
Tag_val (blk) = new_tag;
return 1;
}
if (atomic_compare_exchange_strong(Hp_atomic_val(blk), &hd,
(hd & ~0xFF) | new_tag))
return 1;
}
}

)

@fabbing
Copy link
Contributor Author

fabbing commented Dec 20, 2022

Was it really an optimization, or was it there for correctness? The trunk code seems to be designed to avoid the issue where a concurrent update to the block tag (from Lazy to Forcing, or from Forcing to Forward) would "reset" the mark bits from MARKED to UNMARKED. Now with the PR code, I suppose that this could happen. () Would it be unsafe if it happened, for example, could it result in the value being collected?

(For reference, the concurrent tag update would come from this code:

ocaml/runtime/obj.c

Lines 198 to 217 in e6340ce

static int obj_update_tag (value blk, int old_tag, int new_tag)
{
header_t hd;
tag_t tag;
SPIN_WAIT {
hd = Hd_val(blk);
tag = Tag_hd(hd);
if (tag != old_tag) return 0;
if (caml_domain_alone()) {
Tag_val (blk) = new_tag;
return 1;
}
if (atomic_compare_exchange_strong(Hp_atomic_val(blk), &hd,
(hd & ~0xFF) | new_tag))
return 1;
}
}

)

You actually may be very right; this doesn't seem to be an optimization!
If another thread is allowed to perform a call to obj_update_tag the current thread doing the marking could lose the new tag since hd in With_status_hd from the atomic_store_relaxed could be the old, un-updated value.

While trying to fix this, I went back to using an attempt to refactor the code to avoid code duplication that I deemed invalid and hurting performance. But performance only drops a little.
We lose the optimization of having the caml_global_heap_state and caml_global_heap_state globals cached locally.

New benchmark results:

0.297 s/gc

 Performance counter stats for './a.out':

          6 975,48 msec task-clock                       #    0,996 CPUs utilized          
               381      context-switches                 #   54,620 /sec                   
                 6      cpu-migrations                   #    0,860 /sec                   
           202 655      page-faults                      #   29,052 K/sec                  
    31 338 550 747      cycles                           #    4,493 GHz                    
    97 787 430 050      instructions                     #    3,12  insn per cycle         
    26 528 909 898      branches                         #    3,803 G/sec                  
         3 306 533      branch-misses                    #    0,01% of all branches        
   156 123 397 625      slots                            #   22,382 G/sec                  
    68 973 229 344      topdown-retiring                 #     39,0% Retiring              
    46 530 894 978      topdown-bad-spec                 #     26,3% Bad Speculation       
     9 527 351 720      topdown-fe-bound                 #      5,4% Frontend Bound        
    51 699 140 857      topdown-be-bound                 #     29,3% Backend Bound         

       7,002423462 seconds time elapsed

       6,727727000 seconds user
       0,239205000 seconds sys

@fabbing
Copy link
Contributor Author

fabbing commented Dec 20, 2022

I also updated results for the 4.14 benchmark as the comparison wasn't fair: the compiler wasn't compiled with --disable-naked-pointers.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change. The new version in fact looks easier to review.

runtime/major_gc.c Outdated Show resolved Hide resolved
runtime/major_gc.c Outdated Show resolved Hide resolved
runtime/major_gc.c Show resolved Hide resolved
@shindere
Copy link
Contributor

shindere commented Dec 21, 2022 via email

@fabbing
Copy link
Contributor Author

fabbing commented Dec 21, 2022

Uninformed suggestion. If things are what you describe, wouldn't it be even clearer to have a queue_state variable and to define an enum with the possible values with meaningful names, e.g. FILL_MODE and DRAIN_MODE ?

Ideally, probably yes. But this is performance sensitive code and this would involve one more branch (to test the mode first and then the number of elements against the waterline).

@shindere
Copy link
Contributor

shindere commented Dec 21, 2022 via email

Copy link
Contributor

@sadiqj sadiqj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @fabbing and I think the implementation matches your description.

One question, do you have some sandmark suite (macro) benchmarking results for runtime and latency? I'm interested in the latter because the Chunk_size has increased significantly.

@fabbing
Copy link
Contributor Author

fabbing commented Jan 16, 2023

Here are the pausetime_seq results from a Sandmark run with 3 versions considered

  • trunk: this PR parent commit (97aa649)
  • like414: this PR last commit (2f6bd79)
  • chunksize: this PR last commit and reverting the chunk_size to its previous value (0x400)

Normalized values are computed against trunk value (ie: max_normalized_like414 = max_like414 / max_trunk)

Full table
Benchmark Version Mean Max Max normalized
alt-ergo.fill.why chunksize 0 2,809855 0,89264798151075
alt-ergo.fill.why trunk 0 3,147775 1
alt-ergo.fill.why like414 0 2,463743 0,782693489846002
alt-ergo.yyll.why chunksize 0 5,062655 0,202622918906207
alt-ergo.yyll.why trunk 0 24,985599 1
alt-ergo.yyll.why like414 0 5,890047 0,235737674329921
bdd.26 chunksize 0 2,635775 0,587134944528164
bdd.26 trunk 0 4,489215 1
bdd.26 like414 0 2,922495 0,651003571893973
binarytrees5.21 chunksize 0 5,931007 0,647584934687462
binarytrees5.21 trunk 0 9,158655 1
binarytrees5.21 like414 0 4,407295 0,481216401316569
chameneos_redux_lwt.600000 chunksize 0 0,201727 0,981319958942826
chameneos_redux_lwt.600000 trunk 0 0,205567 1
chameneos_redux_lwt.600000 like414 0 0,207615 1,00996268856383
crout-decomposition. chunksize 0 0,563199 0,867507682354844
crout-decomposition. trunk 0 0,649215 1
crout-decomposition. like414 0 0,571391 0,880125998321049
cubicle.german_pfs.cub chunksize 0 7,065599 0,325226213284371
cubicle.german_pfs.cub trunk 0 21,725183 1
cubicle.german_pfs.cub like414 0 7,098367 0,32673450897974
cubicle.szymanski_at.cub chunksize 0 6,549503 0,44840153262266
cubicle.szymanski_at.cub trunk 0 14,606335 1
cubicle.szymanski_at.cub like414 0 4,665343 0,319405449758615
durand-kerner-aberth. chunksize 0 0,104127 0,796766319526808
durand-kerner-aberth. trunk 0 0,130687 1
durand-kerner-aberth. like414 0 0,144895 1,10871777606036
evolutionary_algorithm.10000_10000 chunksize 0 22,282239 1,30518235952673
evolutionary_algorithm.10000_10000 trunk 0 17,072127 1
evolutionary_algorithm.10000_10000 like414 0 17,055743 0,999040307045514
fannkuchredux.12 chunksize 0 0,047999 0,983606221438964
fannkuchredux.12 trunk 0 0,048799 1
fannkuchredux.12 like414 0 0,047295 0,969179696305252
fasta3.25_000_000 chunksize 0 0,127679 1,03475131897788
fasta3.25_000_000 trunk 0 0,123391 1
fasta3.25_000_000 like414 0 0,128575 1,04201278861505
fasta6.25_000_000 chunksize 0 0,143615 1,06049194006926
fasta6.25_000_000 trunk 0 0,135423 1
fasta6.25_000_000 like414 0 0,140799 1,03969783567046
fft. chunksize 0 5,201919 1,08269396435546
fft. trunk 0 4,804607 1
fft. like414 0 5,009407 1,0426257548224
floyd_warshall.512 chunksize 0 1,783807 0,62979018021221
floyd_warshall.512 trunk 0 2,832383 1
floyd_warshall.512 like414 0 1,686527 0,595444542634241
game_of_life.256 chunksize 0 0,907263 0,569408470115976
game_of_life.256 trunk 0 1,593343 1
game_of_life.256 like414 0 0,805887 0,505783751521173
grammatrix. chunksize 0 1,215487 1,01539778873987
grammatrix. trunk 0 1,197055 1
grammatrix. like414 0 1,266687 1,05816942412838
graph500_seq_kernel1.edges_data chunksize 2 316,932095 0,145873551730425
graph500_seq_kernel1.edges_data trunk 6 2172,649471 1
graph500_seq_kernel1.edges_data like414 2 254,148607 0,116976350944924
hamming.1_000_000 chunksize 0 3,414015 0,187051122425079
hamming.1_000_000 trunk 0 18,251775 1
hamming.1_000_000 like414 0 1,622015 0,0888688908339052
kb_no_exc. chunksize 0 0,817663 0,840526151856653
kb_no_exc. trunk 0 0,972799 1
kb_no_exc. like414 0 0,699903 0,719473395840251
kb. chunksize 0 0,486911 0,636972063604194
kb. trunk 0 0,764415 1
kb. like414 0 0,425727 0,556931771354565
knucleotide. chunksize 0 3,266559 1,02704443863831
knucleotide. trunk 0 3,180543 1
knucleotide. like414 0 3,147775 0,989697356709216
knucleotide3. chunksize 0 3,241983 0,973554727605947
knucleotide3. trunk 0 3,330047 1
knucleotide3. like414 0 3,203071 0,961869607245784
levinson-durbin. chunksize 0 0,081151 0,803037949631389
levinson-durbin. trunk 0 0,101055 1
levinson-durbin. like414 0 0,085375 0,84483696996685
lexifi-g2pp. chunksize 0 0,335871 0,988696274185999
lexifi-g2pp. trunk 0 0,339711 1
lexifi-g2pp. like414 0 0,335615 0,987942692465066
LU_decomposition.1024 chunksize 0 0,771071 1,15049678681363
LU_decomposition.1024 trunk 0 0,670207 1
LU_decomposition.1024 like414 0 0,662527 0,988540853795917
mandelbrot6.16_000 chunksize 0 0,047711 0,993999874997396
mandelbrot6.16_000 trunk 0 0,047999 1
mandelbrot6.16_000 like414 0 0,047743 0,994666555553241
matrix_multiplication.1024 chunksize 0 1,471487 0,474884166289349
matrix_multiplication.1024 trunk 0 3,098623 1
matrix_multiplication.1024 like414 0 1,406975 0,454064595789807
menhir.ocamly chunksize 0 378,535935 0,401111110476506
menhir.ocamly trunk 0 943,718399 1
menhir.ocamly like414 0 337,117183 0,35722222154111
menhir.sql-parser chunksize 0 18,644991 2,53029481579051
menhir.sql-parser trunk 0 7,368703 1
menhir.sql-parser like414 0 20,283391 2,75264059360243
menhir.sysver chunksize 0 35,684351 1,05933852316232
menhir.sysver trunk 0 33,685503 1
menhir.sysver like414 0 9,371647 0,278210095304203
mergesort.67108864 chunksize 0 64,782335 0,374148368830531
mergesort.67108864 trunk 0 173,146111 1
mergesort.67108864 like414 0 57,671679 0,333080995391228
minilight.roomfront chunksize 0 0,244351 0,609513713067013
minilight.roomfront trunk 0 0,400895 1
minilight.roomfront like414 0 0,208639 0,52043303109293
naive-multilayer. chunksize 0 0,196607 1,00721315170672
naive-multilayer. trunk 0 0,195199 1
naive-multilayer. like414 0 0,195583 1,00196722319274
nbody.50_000_000 chunksize 0 0,136959 1,0907244737869
nbody.50_000_000 trunk 0 0,125567 1
nbody.50_000_000 like414 0 0,125503 0,999490311945017
nqueens.14 chunksize 0 0,065215 1,01191676881779
nqueens.14 trunk 0 0,064447 1
nqueens.14 like414 0 0,066687 1,03475724238522
pidigits5.10_000 chunksize 0 0,188671 1,06889089065271
pidigits5.10_000 trunk 0 0,176511 1
pidigits5.10_000 like414 0 0,178687 1,01232784359048
qr-decomposition. chunksize 0 0,324351 1,12024798383615
qr-decomposition. trunk 0 0,289535 1
qr-decomposition. like414 0 0,296703 1,02475693784862
quicksort.4000000 chunksize 0 0,128959 1,0974953830966
quicksort.4000000 trunk 0 0,117503 1
quicksort.4000000 like414 0 0,121279 1,03213534973575
regexredux2. chunksize 1 22,904831 0,193521587870496
regexredux2. trunk 1 118,358015 1
regexredux2. like414 1 23,134207 0,195459572382994
revcomp2. chunksize 0 1,273855 0,322613936330439
revcomp2. trunk 0 3,948543 1
revcomp2. like414 0 1,259519 0,318983230016743
sequence_cps.10000 chunksize 0 0,053951 0,961779124699171
sequence_cps.10000 trunk 0 0,056095 1
sequence_cps.10000 like414 0 0,053055 0,945806221588377
soli.2000 chunksize 0 0,084735 0,927819812322752
soli.2000 trunk 0 0,091327 1
soli.2000 like414 0 0,082751 0,906095678167464
spectralnorm2.5_500 chunksize 0 0,111871 0,965745560648832
spectralnorm2.5_500 trunk 0 0,115839 1
spectralnorm2.5_500 like414 0 0,113791 0,982320289367139
test_decompress.64_524_288 chunksize 0 0,279807 1,10404081455498
test_decompress.64_524_288 trunk 0 0,253439 1
test_decompress.64_524_288 like414 0 0,245375 0,968181692636098
test_lwt.200 chunksize 0 0,627199 0,511695508053588
test_lwt.200 trunk 0 1,225727 1
test_lwt.200 like414 0 0,614399 0,501252725933263
thread_ring_lwt_mvar.20_000 chunksize 0 0,305663 1,00251889994916
thread_ring_lwt_mvar.20_000 trunk 0 0,304895 1
thread_ring_lwt_mvar.20_000 like414 0 0,313343 1,02770789944079
thread_ring_lwt_stream.20_000 chunksize 0 0,693759 0,970630331402125
thread_ring_lwt_stream.20_000 trunk 0 0,714751 1
thread_ring_lwt_stream.20_000 like414 0 1,620991 2,26791008337169
yojson_ydump.sample.json chunksize 0 0,454911 1,0434528301454
yojson_ydump.sample.json trunk 0 0,435967 1
yojson_ydump.sample.json like414 0 0,452863 1,03875522688644
zarith_pi.10_000 chunksize 0 0,224639 1,00400459455715
zarith_pi.10_000 trunk 0 0,223743 1
zarith_pi.10_000 like414 0 0,169087 0,755719732013962

The average on all benchmarks:

  Version Average
Mean like414 0,986666666666667
Mean chunksize 0,986666666666667
Max like414 0,820132697954935
Max chunksize 0,840627800426733

And some graphs to better visualize the results per benchmark:

Mean

image

Max

@gasche
Copy link
Member

gasche commented Jan 16, 2023

The results look good overall; there are a few regressions in "minimum times", but those are quantities so small that it is not really concerning. On the other hand, menhir.sql-parser has a strong regression in "maximum" pause time. I wonder what is going on?

@stedolan
Copy link
Contributor

An increase in Chunk_size is necessary for performance. The prefetch buffer can hold 256 elements, but if it's only allowed to mark 1024 words then it barely has time to fill before it starts draining again, and spends very little time in the efficient steady-state. It's necessary to do significantly more work before draining to get good performance.

Having said that, @fabbing and I didn't put much thought into the new value (at least originally, perhaps @fabbing has since then). So there might be a better value to choose. It might also be worth moving the test for interrupts directly into the marking loop - it's pretty cheap, and it would save draining and refilling the prefetch buffer just to check for interrupts.

(As a general note: this is one bit of code where performance probably matters more than avoiding duplication. Many OCaml programs spend a double-digit percentage of their time in this function, so optimisation here is more beneficial than almost anywhere else. Refactoring that preserves performance is of course great, but if that turns out to be hard I'd choose the fast and ugly version here)

@fabbing
Copy link
Contributor Author

fabbing commented Jan 23, 2023

I edited my previous comment to remove some values from the table and graph. Indeed I tried to get more accurate figures but got them wrong. Sorry about that.

Inlining performance

My last commit (1d30b83) yields better performance than the previous refactoring but not as good as manually inlining mark_slice_darken_only inside do_some_marking. The manually inlined code is ~10% better in the micro-benchmark.
In the following table and graph, lazy_fix is the code of this PR's first commit including the proper handling of lazy values and correct work accounting (diff).
@stedolan's micro-benchmark is used with 20 iterations of Gc.major and ran 10 times under perf:

Raw table
  trunk like414 lazy_fix
GC time (s/gc) 1,7079 0,7739 0,6685
Task-clock (ms) 38315,979974 18003,776654 16155,050673
Cycles 68968490094 32406647395 29078944642
Table normalized
  trunk like414 / trunk lazy_fix / trunk like414 / lazy_fix
GC time (s/gc) 1 45,31 % 39,14 % 115,77 %
Task-clock (ms) 1 46,99 % 42,16 % 111,44 %
Cycles 1 46,99 % 42,16 % 111,44 %
Graph

image

Impact of Chunk_size

To try and pinpoint a better Chunk_size value, I ran the micro-benchmark with different Chunk_size under perf to visualize the impact of the Chunk_size on the benchmark time.
We can see the benefits of a bigger Chunk_size starts to decrease around 4864 (0x1300).

Raw table
Chunk size GC time Task clock Cycles Time
1024 0,6979 16805,328572 30249448592 16,806483024
1280 0,6951 16732,013482 30117479617 16,733136599
1536 0,6937 16706,811324 30072127855 16,707922556
1792 0,6871 16567,36066 29821102200 16,568566201
2048 0,6841 16495,469809 29691704798 16,496622062
2304 0,6828 16466,350823 29639305615 16,467323184
2560 0,6814 16445,11087 29601059354 16,44638652
2816 0,6834 16474,487396 29653951267 16,475394348
3072 0,681 16421,606056 29558744516 16,42291328
3328 0,6793 16395,230906 29511263906 16,396623443
3584 0,6769 16342,616445 29416559047 16,343734888
3840 0,6792 16384,124633 29491275896 16,385201654
4096 0,6786 16379,60165 29483131708 16,380887947
4352 0,6762 16324,52771 29384012068 16,325613344
4608 0,6773 16351,347737 29432286122 16,352412785
4864 0,6746 16294,414584 29329812773 16,295357699
5120 0,6766 16334,798273 29402493093 16,335836607
5376 0,6777 16348,556629 29427252052 16,349665815
5632 0,6752 16299,866932 29339602207 16,301248907
5888 0,6736 16261,006062 29269664349 16,262259327
6144 0,6755 16325,16908 29385158645 16,326185925
6400 0,6751 16294,153145 29329342761 16,295169579
6656 0,674 16276,200831 29297010143 16,277461059
6912 0,6747 16280,793755 29305284957 16,282081765
7168 0,6747 16291,833534 29325172093 16,292688599
7424 0,6716 16219,457316 29194876488 16,220700768
7680 0,6762 16321,729245 29378970165 16,322965445
7936 0,6742 16276,702341 29297925168 16,277860709
8192 0,673 16248,306974 29246795417 16,24975623
8704 0,6732 16258,789049 29265670274 16,260099216
9216 0,6767 16322,196059 29379783405 16,324041907
9728 0,6716 16228,873859 29211847207 16,229761962
10240 0,6715 16213,640809 29184423046 16,214532721
10752 0,6733 16249,577139 29249100762 16,250830843
11264 0,6715 16220,96639 29197603817 16,222014844
11776 0,6727 16239,63162 29231212658 16,240520013
12288 0,6736 16260,950064 29269576454 16,262168886
12800 0,6704 16195,032308 29150928408 16,195974763
13312 0,6721 16230,044766 29213949460 16,230952626
13824 0,6731 16244,866726 29240625767 16,245814942
14336 0,6709 16202,85012 29164995133 16,203966744
14848 0,6705 16195,705349 29152148467 16,196400558
15360 0,6725 16241,234658 29234093471 16,242182077
15872 0,671 16209,758984 29177429382 16,210854804
16384 0,6699 16183,95254 29130970062 16,185223053
16896 0,673 16246,022189 29242698579 16,247179946
17408 0,6701 16175,515303 29115786722 16,176731544
17920 0,6717 16221,438155 29198430208 16,223005145
18432 0,6733 16256,313798 29261245569 16,257071129
18944 0,6693 16174,134664 29113311102 16,175170861
19456 0,6709 16209,219319 29176459096 16,210471309
19968 0,6728 16237,37891 29227151440 16,23837567
Graph GC time & cycles count

chunk_size_cycles

I also ran the micro-benchmarks under olly to get GC pause times. Unlike benchmarks under perf, those are only over one run, so they are probably noisier.
Oddly enough, the increase in Chunk_size doesn't cause an increase in the different GC pause times.

Raw table
chunk_size Gc time (s/gc) mean (ms) stddev (ms) min (ms) max (ms)
1024 0,691 19,7 106,48 0 726,66
1280 0,686 19,52 105,63 0 720,37
1536 0,685 19,51 105,56 0 719,85
1792 0,677 19,33 104,35 0 710,93
2048 0,676 19,3 104,19 0 710,41
2304 0,675 19,37 103,92 0 707,79
2560 0,674 19,23 103,86 0 708,31
2816 0,676 19,27 104,09 0 708,84
3072 0,673 19,2 103,68 0 706,74
3328 0,671 19,15 103,32 0 704,64
3584 0,669 19,08 102,97 0 701,5
3840 0,671 19,14 103,32 0 704,12
4096 0,672 19,17 103,46 0 704,64
4352 0,667 19,05 102,79 0 700,45
4608 0,669 19,09 103,06 0 702,02
4864 0,666 19,02 102,56 0 699,92
5120 0,667 19,06 102,75 0 699,92
5376 0,67 19,11 103,15 0 703,07
5632 0,667 19,15 102,66 0 699,4
5888 0,665 19 102,46 0 698,35
6144 0,667 19,05 102,76 0 699,92
6400 0,666 19,03 102,66 0 699,92
6656 0,665 18,99 102,42 0 697,3
6912 0,667 19,05 102,75 0 700,45
7168 0,667 19,09 102,77 0 699,92
7424 0,666 19,02 102,59 0 699,92
7680 0,67 19,11 103,19 0 702,55
7936 0,668 19,07 102,91 0 701,5
8192 0,667 19,05 102,77 0 700,45
8704 0,666 19,03 102,64 0 699,4
9216 0,67 19,12 103,28 0 703,59
9728 0,667 19,16 102,73 0 700,45
10240 0,665 19 102,47 0 702,55
10752 0,668 19,07 102,92 0 702,55
11264 0,664 18,98 102,34 0 697,3
11776 0,665 18,98 102,4 0 697,83
12288 0,666 19,02 102,6 0 698,88
12800 0,662 18,91 101,9 0 694,68
13312 0,664 18,96 102,23 0 696,25
13824 0,664 18,97 102,33 0 697,3
14336 0,663 19,25 102,13 0 696,78
14848 0,664 18,98 102,29 0 697,3
15360 0,666 19,03 102,63 0 699,4
15872 0,662 18,92 101,97 0 694,68
16384 0,661 18,9 101,83 0 694,68
16896 0,664 18,97 102,35 0 697,3
17408 0,661 18,89 101,83 0 693,63
17920 0,662 18,93 102,02 0 695,21
18432 0,666 19 102,54 0 698,88
18944 0,661 18,9 101,81 0 694,16
19456 0,662 18,91 101,95 0 694,16
19968 0,664 18,96 102,25 0 696,78
Graph GC time & max latency

chunk_size_max_latency

Graph GC time & mean latency

chunk_size_mean_latency

Pathological GC max pause time

I also locally re-ran the menhir.sql_parser and thread_ring_lwt_stream.20_000 benchmarks and confirmed the considerable increase in the max pause time. It affects the prefetching code with either the new Chunk_size or the previous, unmodified one.
I'll try and investigate to understand what happens here.

@sadiqj
Copy link
Contributor

sadiqj commented Jan 30, 2023

Thanks for running all of these @fabbing . From the GC pause time and runtime data, it looks like there's little reason not to stick with 0x4000?

@fabbing
Copy link
Contributor Author

fabbing commented Jan 31, 2023

I re-ran the benchmarks on a different host and the results for menhir.sql_parser and thread_ring_lwt_stream.20_000 now show usual maximum latency, while another benchmark now shows unexpectedly high latency. So this doesn't seem to be related to this PR.
If increasing Chunk_size has a negative effect on the maximum latency, it globally seems to be compensated by the gain from prefetching.
Should I rebase on trunk and squash in one atomic commit?

Co-authored-by: Stephen Dolan <sdolan@janestreet.com>
Co-authored-by: Gabriel Scherer <gasche@github>
@fabbing
Copy link
Contributor Author

fabbing commented Feb 2, 2023

I think it's ready to merge. Thanks to all the reviewers.
I've rebased on trunk and squashed into an atomic commit (backup of the original branch here).

@sadiqj sadiqj merged commit 70f89a4 into ocaml:trunk Feb 2, 2023
@sadiqj
Copy link
Contributor

sadiqj commented Feb 2, 2023

Thanks @fabbing and @stedolan , good to finally have this in trunk.

@gasche
Copy link
Member

gasche commented Feb 2, 2023

Has someone done a full review of the patch? (Maybe @sadiqj ? In this case it is best to use the "Approve" feature of Github to make this explicit.)

@sadiqj
Copy link
Contributor

sadiqj commented Feb 3, 2023

Sorry @gasche , I did a review but seems I left it as a comment rather than approve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants