Make List.{map,mapi,map2} TRMC #11362

nojb · 2022-06-26T07:28:41Z

Now that we have TRMC we can make List.map (and friends) tail-recursive without harming performance for short lists (the vast majority of uses). Note that owing to the heap-allocated stacks of 5.0, it is much harder to trigger a Stack Overflow even with the non-tail recursive version; however there seems to be a noticeable performance gain in the case of long lists (see below).

The initial version of this PR modifies only List.map for purposes of discussion. If there is consensus, we would also adapt other similar functions in the List module.

I did a micro-benchmark (https://github.com/nojb/list_trmc/blob/bc6e38cb7a23bb4653b3c36d236c1b821c3bc8a6/test.ml) with the current trunk (dd87ff5) on an AMD (Fedora) machine:

$ LD_LIBRARY_PATH=~/ocaml/local/lib/ocaml/stublibs PATH=~/ocaml/local/bin:$PATH make
./test.bc
       n    List.map   trmc map
      10    0.00       0.00
     100    0.00       0.00
    1000    0.00       0.00
   10000    0.02       0.03
  100000    0.37       0.39
 1000000    24.57       4.78
./test.exe
       n    List.map   trmc map
      10    0.00       0.00
     100    0.00       0.00
    1000    0.00       0.00
   10000    0.01       0.00
  100000    0.16       0.14
 1000000    6.03       2.37

(.bc is bytecode, .exe is native-code, n is the length of list being mapped, each list is averaged over 100 times.)

(EDIT: I edited out a note about not observing the above slowdown under WSL/Windows, but it turns out that I was comparing against an already patched version of OCaml -- doh!)

See also related discussion at #10388 #867 #2199 #7205 #8010.

ferminr · 2022-06-26T09:33:47Z

One more related discussion at #10528.

gasche

Approved. This definitely needs a Changes entry.

gasche · 2022-06-26T12:49:54Z

Also map.mli probably needs an update.

xavierleroy · 2022-06-26T12:51:53Z

@nojb you need to provide performance numbers for short lists too. In symbolic computation applications, most lists have length <= 2.

nojb · 2022-06-26T13:32:46Z

@nojb you need to provide performance numbers for short lists too. In symbolic computation applications, most lists have length <= 2.

Here are the results of using core_bench to run an updated benchmark (https://github.com/nojb/list_trmc/blob/e1dd4a0523b7d78155b151781ac8bb80a4a0aa1a/test.ml) for list of length 1 <= n <= 10 (TRMC appears around ~20/25% slower for these lengths).

Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 1 │   4.99ns │   3.00w │     79.22% │
│   trmc 1 │   6.30ns │   3.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 2 │   7.50ns │   6.00w │     78.51% │
│   trmc 2 │   9.55ns │   6.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 3 │   9.99ns │   9.00w │     75.70% │
│   trmc 3 │  13.20ns │   9.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 4 │  12.50ns │  12.00w │     75.18% │
│   trmc 4 │  16.62ns │  12.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 5 │  15.00ns │  15.00w │     74.79% │
│   trmc 5 │  20.05ns │  15.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 6 │  17.52ns │  18.00w │     74.99% │
│   trmc 6 │  23.36ns │  18.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 7 │  20.00ns │  21.00w │     74.18% │
│   trmc 7 │  26.96ns │  21.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 8 │  22.52ns │  24.00w │     72.57% │
│   trmc 8 │  31.03ns │  24.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 9 │  25.01ns │  27.00w │     72.54% │
│   trmc 9 │  34.48ns │  27.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────┬──────────┬─────────┬────────────┐
│ Name      │ Time/Run │ mWd/Run │ Percentage │
├───────────┼──────────┼─────────┼────────────┤
│ stdlib 10 │  27.53ns │  30.00w │     72.96% │
│   trmc 10 │  37.73ns │  30.00w │    100.00% │
└───────────┴──────────┴─────────┴────────────┘

nojb · 2022-06-26T13:42:14Z

Approved. This definitely needs a Changes entry.

Thanks, assuming no blockers arise, I was planning to add other functions to this PR:

List.mapi
List.map2
List.append and ( @ )
List.init (it currently has a more complex version switching to a tail-rec version based on a threshold)
List.remove_assoc
List.remove_assq
List.merge

ferminr · 2022-06-26T14:11:28Z

There are a few other functions in List that are currently implemented as the usual idiom of tail recursion then reverse at the end, resulting in two traversals: (find_all, filteri, filter_map, concat_map) . Would these benefit from TRMC too? (they'd certainly be more readable, but I don't know about performance)

nojb · 2022-06-26T14:15:52Z

There are a few other functions in List that are currently implemented as the usual idiom of tail recursion then reverse at the end, resulting in two traversals: (find_all, filteri, filter_map, concat_map) . Would these benefit from TRMC too? (they'd certainly be more readable, but I don't know about performance)

I think so, am not sure yet how far we wish to go with the application of TRMC in the stdlib, but they certainly seem candidates for it.

nojb · 2022-06-26T15:11:08Z

@nojb you need to provide performance numbers for short lists too. In symbolic computation applications, most lists have length <= 2.

Here are the results of using core_bench to run an updated benchmark (https://github.com/nojb/list_trmc/blob/7b383dd1f7f3620205d3e7b0e94b745f9f185934/test.ml) for list of length 1 <= n <= 10 (TRMC appears to win consistently)

I had gotten my tests mixed up again! I amended the note above with the correct benchmark results: TRMC appears around ~20/25% slower for these lengths.

gasche · 2022-06-26T20:42:58Z

If you need to win against the non-tail-recursive version on small inputs, a trick that works surprisingly well is manual unrolling:

let[@tail_mod_cons] rec map_unrolled f = function
 | [] -> []
 | [x0] ->
   let y0 = f x0 in
   [y0]
 | [x0; x1] ->
   let y0 = f x0 in
   let y1 = f x1 in
   [y0; y1]
 | [x0; x1; x2] ->
   let y0 = f x0 in
   let y1 = f x1 in
   let y2 = f x2 in
   [y0; y1; y2]
 | x0 :: x1 :: x2 :: xs ->
   let y0 = f x0 in
   let y1 = f x1 in
   let y2 = f x2 in
   y0 :: y1 :: y2 :: map_unrolled f xs

nojb · 2022-06-28T07:17:37Z

If you need to win against the non-tail-recursive version on small inputs, a trick that works surprisingly well is manual unrolling:

I tried this with 4.14 (nojb/list_trmc@55cf03d) and indeed the unrolled TRMC version is faster on short lengths (1 <= n <= 10, n = 100, 1000, 10000) than stock List.map, but it becomes slower for lengths n = 10000 and n = 100000 (assuming I didn't both the benchmark). If you would like to reproduce it is enough to clone the repo above and do

opam install core_bench core_unix
make

Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 1 │   4.65ns │   3.00w │    100.00% │
│   trmc 1 │   3.81ns │   3.00w │     81.85% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 2 │   7.32ns │   6.00w │    100.00% │
│   trmc 2 │   5.33ns │   6.00w │     72.81% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 3 │   9.85ns │   9.00w │    100.00% │
│   trmc 3 │   7.26ns │   9.00w │     73.77% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 4 │  13.19ns │  12.00w │    100.00% │
│   trmc 4 │  11.24ns │  12.00w │     85.22% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 5 │  16.43ns │  15.00w │    100.00% │
│   trmc 5 │  13.54ns │  15.00w │     82.39% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 6 │  19.59ns │  18.00w │    100.00% │
│   trmc 6 │  15.87ns │  18.00w │     81.03% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 7 │  22.95ns │  21.00w │    100.00% │
│   trmc 7 │  18.85ns │  21.00w │     82.12% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 8 │  27.04ns │  24.00w │    100.00% │
│   trmc 8 │  20.90ns │  24.00w │     77.27% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 9 │  30.63ns │  27.00w │    100.00% │
│   trmc 9 │  23.72ns │  27.00w │     77.42% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────┬──────────┬─────────┬────────────┐
│ Name      │ Time/Run │ mWd/Run │ Percentage │
├───────────┼──────────┼─────────┼────────────┤
│ stdlib 10 │  32.79ns │  30.00w │    100.00% │
│   trmc 10 │  26.66ns │  30.00w │     81.29% │
└───────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name       │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ stdlib 100 │ 348.80ns │ 300.00w │    0.17w │    0.17w │    100.00% │
│   trmc 100 │ 265.58ns │ 300.00w │    0.33w │    0.33w │     76.14% │
└────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌─────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name        │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├─────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ stdlib 1000 │   3.83us │  3.00kw │   17.25w │   17.25w │    100.00% │
│   trmc 1000 │   3.17us │  3.00kw │   34.51w │   34.51w │     82.72% │
└─────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name         │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├──────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ stdlib 10000 │  61.81us │ 30.00kw │   1.82kw │   1.82kw │     98.65% │
│   trmc 10000 │  62.65us │ 30.00kw │   3.61kw │   3.61kw │    100.00% │
└──────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────────┬──────────┬──────────┬──────────┬──────────┬────────────┐
│ Name          │ Time/Run │  mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────────┼──────────┼──────────┼──────────┼──────────┼────────────┤
│ stdlib 100000 │   2.37ms │ 300.00kw │ 173.69kw │ 173.69kw │     89.47% │
│   trmc 100000 │   2.65ms │ 300.00kw │ 300.08kw │ 300.08kw │    100.00% │
└───────────────┴──────────┴──────────┴──────────┴──────────┴────────────┘

gasche · 2022-06-28T08:21:51Z

There are interesting GC effects related to the write barrier, that create a performance disadvantage for TRMC when:

you ignore the result of List.map right away instead of keeping it alive until the next minor GC
the allocation work of the List.map size is comparable to the minor heap size

This may very well be what you observe in your benchmark. I would try writing the result of List.map to a global reference instead of ignore-ing it right away. (This corresponds to the case where the lifetime of the result is extended until at least the next minor collection, which is a reasonable assumption.)

For more details, see the discussion of "Promotion" in the second TRMC PR.

nojb · 2022-06-28T12:29:15Z

I would try writing the result of List.map to a global reference instead of ignore-ing it right away.

Thanks for pointing me in the direction of that interesting discussion which I had missed! Indeed, by doing this, I get that the (unrolled) version of TRMC'ed List.map is always faster than the non tail-recursive List.map (numbers below).

Given this, are we happy with switching to this implementation (unrolled + TRMC)? If yes, are we happy with switching the other amenable functions in List to this style as well?

Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 1 │   5.96ns │   3.00w │     91.45% │
│ stdlib 1 │   6.52ns │   3.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 2 │   8.23ns │   6.00w │     85.85% │
│ stdlib 2 │   9.59ns │   6.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 3 │   9.63ns │   9.00w │     78.48% │
│ stdlib 3 │  12.27ns │   9.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 4 │  13.87ns │  12.00w │     84.26% │
│ stdlib 4 │  16.46ns │  12.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 5 │  15.65ns │  15.00w │     80.86% │
│ stdlib 5 │  19.35ns │  15.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 6 │  17.93ns │  18.00w │     77.29% │
│ stdlib 6 │  23.19ns │  18.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 7 │  21.84ns │  21.00w │     82.35% │
│ stdlib 7 │  26.52ns │  21.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 8 │  24.45ns │  24.00w │     78.07% │
│ stdlib 8 │  31.33ns │  24.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 9 │  26.55ns │  27.00w │     79.31% │
│ stdlib 9 │  33.48ns │  27.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────┬──────────┬─────────┬────────────┐
│ Name      │ Time/Run │ mWd/Run │ Percentage │
├───────────┼──────────┼─────────┼────────────┤
│   trmc 10 │  29.70ns │  30.00w │     82.90% │
│ stdlib 10 │  35.82ns │  30.00w │    100.00% │
└───────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name       │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│   trmc 100 │ 290.50ns │ 300.00w │    0.67w │    0.67w │     76.80% │
│ stdlib 100 │ 378.24ns │ 300.00w │    0.51w │    0.51w │    100.00% │
└────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌─────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name        │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├─────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│   trmc 1000 │   3.72us │  3.00kw │   69.91w │   69.91w │     81.33% │
│ stdlib 1000 │   4.58us │  3.00kw │   52.26w │   52.26w │    100.00% │
└─────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name         │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├──────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│   trmc 10000 │  98.56us │ 30.00kw │   7.49kw │   7.49kw │     99.03% │
│ stdlib 10000 │  99.52us │ 30.00kw │   5.61kw │   5.61kw │    100.00% │
└──────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────────┬──────────┬──────────┬──────────┬──────────┬────────────┐
│ Name          │ Time/Run │  mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────────┼──────────┼──────────┼──────────┼──────────┼────────────┤
│   trmc 100000 │   3.00ms │ 300.00kw │ 300.09kw │ 300.09kw │     80.05% │
│ stdlib 100000 │   3.75ms │ 300.00kw │ 300.26kw │ 300.26kw │    100.00% │
└───────────────┴──────────┴──────────┴──────────┴──────────┴────────────┘

gasche · 2022-06-28T15:40:12Z

For me, using TMC is a good idea, but manual unrolling is less clear. (I wonder if flambda's unrolling control suffices -- for flambda users.) I would maybe not suggest as a general policy to systematically unroll stdlib functions, especially since the code is going to get more complex for other functions than for map.

My proposal would be as follows:

if the function is written to be tail-rec with an accumulator, only move it to TMC, no unrolling
if the function was kept non-tail-rec for performance, maybe consider unrolling to be performance-preserving

nojb · 2022-06-29T05:54:07Z

My proposal would be as follows:

Thanks @gasche. After spending some time benchmarking different functions from List, I decided to focus only on map (and friends) so as not to rush it (each function needs to be looked at separately).

I updated the PR making List.map, List.mapi and List.map2 TRMC, with an unroll depth of 3 to make sure to also improve the case of short lists (but note that I didn't experiment with different depths to see what was best). With this, the functions seem to be uniformly faster than stock versions (tested on 4.14 with lenghts 1 .. 10, 100, 1000, 10000, 100000). Full results at

I also updated the docstrings to stop indicating these functions as being non-tail-recursive.

@gasche do you mind giving another pass at this PR?
@xavierleroy are you convinced by these numbers?

gasche · 2022-06-29T05:58:45Z

stdlib/list.ml

+      let r1 = f a1 b1 in
+      let r2 = f a2 b2 in
+      let r3 = f a3 b3 in
+      r1::r2::r3::map2 f l1 l2
  | (_, _) -> invalid_arg "List.map2"


I wonder what is the impact of unrolling on code size in this case of partial pattern matching followed by a catch-all. Risks I see include (1) sensibly more pattern-matching-related code being produced (due to unrolled cases not being an exhaustive covering) and (2) performance overhead. In your stead I would (1) ensure that you have a benchmark for this one compared to the non-unrolled variant, (2) eyeball the -dlambda or -dcmm output to ensure that there is no blowup, (3) consider not unrolling this one.

Good point. Performance numbers below (roughly unrolled TRMC > stdlib > TRMC, bigger is faster). For code size indeed there seems to be quite an increase. I wasn't sure how to quantify it, but the linear code is 107 lines for TRMC (no unrolling) vs 420 lines for TRMC (with unrolling).

Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌──────────────────────┬──────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├──────────────────────┼──────────┼─────────┼────────────┤ │ map2 trmc unrolled 1 │ 5.78ns │ 3.00w │ 67.52% │ │ map2 trmc 1 │ 8.56ns │ 3.00w │ 100.00% │ │ map2 stdlib 1 │ 7.09ns │ 3.00w │ 82.92% │ └──────────────────────┴──────────┴─────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌──────────────────────┬──────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├──────────────────────┼──────────┼─────────┼────────────┤ │ map2 trmc unrolled 2 │ 8.73ns │ 6.00w │ 72.10% │ │ map2 trmc 2 │ 12.11ns │ 6.00w │ 100.00% │ │ map2 stdlib 2 │ 10.32ns │ 6.00w │ 85.21% │ └──────────────────────┴──────────┴─────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌──────────────────────┬──────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├──────────────────────┼──────────┼─────────┼────────────┤ │ map2 trmc unrolled 3 │ 10.80ns │ 9.00w │ 64.55% │ │ map2 trmc 3 │ 16.73ns │ 9.00w │ 100.00% │ │ map2 stdlib 3 │ 13.61ns │ 9.00w │ 81.33% │ └──────────────────────┴──────────┴─────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌──────────────────────┬──────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├──────────────────────┼──────────┼─────────┼────────────┤ │ map2 trmc unrolled 4 │ 15.35ns │ 12.00w │ 74.18% │ │ map2 trmc 4 │ 20.69ns │ 12.00w │ 100.00% │ │ map2 stdlib 4 │ 17.10ns │ 12.00w │ 82.67% │ └──────────────────────┴──────────┴─────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌──────────────────────┬──────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├──────────────────────┼──────────┼─────────┼────────────┤ │ map2 trmc unrolled 5 │ 17.18ns │ 15.00w │ 71.54% │ │ map2 trmc 5 │ 24.01ns │ 15.00w │ 100.00% │ │ map2 stdlib 5 │ 19.55ns │ 15.00w │ 81.42% │ └──────────────────────┴──────────┴─────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌──────────────────────┬──────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├──────────────────────┼──────────┼─────────┼────────────┤ │ map2 trmc unrolled 6 │ 20.89ns │ 18.00w │ 74.09% │ │ map2 trmc 6 │ 28.19ns │ 18.00w │ 100.00% │ │ map2 stdlib 6 │ 23.11ns │ 18.00w │ 81.96% │ └──────────────────────┴──────────┴─────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌──────────────────────┬──────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├──────────────────────┼──────────┼─────────┼────────────┤ │ map2 trmc unrolled 7 │ 23.94ns │ 21.00w │ 73.87% │ │ map2 trmc 7 │ 32.41ns │ 21.00w │ 100.00% │ │ map2 stdlib 7 │ 27.28ns │ 21.00w │ 84.17% │ └──────────────────────┴──────────┴─────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌──────────────────────┬──────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├──────────────────────┼──────────┼─────────┼────────────┤ │ map2 trmc unrolled 8 │ 28.64ns │ 24.00w │ 78.83% │ │ map2 trmc 8 │ 36.34ns │ 24.00w │ 100.00% │ │ map2 stdlib 8 │ 31.61ns │ 24.00w │ 87.00% │ └──────────────────────┴──────────┴─────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌──────────────────────┬──────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├──────────────────────┼──────────┼─────────┼────────────┤ │ map2 trmc unrolled 9 │ 31.31ns │ 27.00w │ 72.45% │ │ map2 trmc 9 │ 43.22ns │ 27.00w │ 100.00% │ │ map2 stdlib 9 │ 35.82ns │ 27.00w │ 82.88% │ └──────────────────────┴──────────┴─────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌───────────────────────┬──────────┬─────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ Percentage │ ├───────────────────────┼──────────┼─────────┼────────────┤ │ map2 trmc unrolled 10 │ 33.03ns │ 30.00w │ 74.05% │ │ map2 trmc 10 │ 44.60ns │ 30.00w │ 100.00% │ │ map2 stdlib 10 │ 36.78ns │ 30.00w │ 82.47% │ └───────────────────────┴──────────┴─────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌────────────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├────────────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤ │ map2 trmc unrolled 100 │ 327.20ns │ 300.00w │ 0.67w │ 0.67w │ 74.92% │ │ map2 trmc 100 │ 436.73ns │ 300.00w │ 0.68w │ 0.68w │ 100.00% │ │ map2 stdlib 100 │ 388.96ns │ 300.00w │ 0.51w │ 0.51w │ 89.06% │ └────────────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌─────────────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├─────────────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤ │ map2 trmc unrolled 1000 │ 3.84us │ 3.00kw │ 69.90w │ 69.90w │ 76.97% │ │ map2 trmc 1000 │ 4.99us │ 3.00kw │ 69.90w │ 69.90w │ 100.00% │ │ map2 stdlib 1000 │ 4.52us │ 3.00kw │ 52.22w │ 52.22w │ 90.57% │ └─────────────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌──────────────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├──────────────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤ │ map2 trmc unrolled 10000 │ 96.55us │ 30.00kw │ 7.54kw │ 7.54kw │ 93.08% │ │ map2 trmc 10000 │ 103.73us │ 30.00kw │ 7.47kw │ 7.47kw │ 100.00% │ │ map2 stdlib 10000 │ 98.35us │ 30.00kw │ 5.57kw │ 5.57kw │ 94.81% │ └──────────────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘ Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'. ┌───────────────────────────┬──────────┬──────────┬──────────┬──────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├───────────────────────────┼──────────┼──────────┼──────────┼──────────┼────────────┤ │ map2 trmc unrolled 100000 │ 2.76ms │ 300.00kw │ 299.97kw │ 299.97kw │ 78.35% │ │ map2 trmc 100000 │ 2.89ms │ 300.00kw │ 300.08kw │ 300.08kw │ 82.08% │ │ map2 stdlib 100000 │ 3.52ms │ 300.00kw │ 300.09kw │ 300.09kw │ 100.00% │ └───────────────────────────┴──────────┴──────────┴──────────┴──────────┴────────────┘

I think that's okay: the 3-unrolling is not much more than 3x the size of the no-unrolling version, and the performance profile is on par with the map1 versions. I think that your choice of using the same approach for map2 is justified by these numbers.

xavierleroy · 2022-06-29T17:37:28Z

@xavierleroy are you convinced by these numbers?

I haven't had time to check all of them :-) but it looks like List.map is now plenty fast, both for very short and very long lists.

I agree with @gasche that List.map2 should perhaps not use the unrolling trick because it makes the code bigger and the speed benefit is probably lower (or null) compared with List.map.

Generally speaking, the deal with TRMC was to support the efficient execution of functions written in "natural" recursive style -- no List.rev, no CPS, etc. Manual unrolling makes functions less "natural" and should be done sparingly.

nojb · 2022-07-02T08:24:25Z

Generally speaking, the deal with TRMC was to support the efficient execution of functions written in "natural" recursive style -- no List.rev, no CPS, etc. Manual unrolling makes functions less "natural" and should be done sparingly.

I agree. I pushed a new commit reducing unrolling to its bare minimum (ie making two calls to f in each iteration instead of one). This is still enough to beat or remain within 1-2% of stock List.map, List.mapi and List.map2 in all cases. With this small amount of unrolling, the increase in code size is also much less (0.75x instead of 4x).

Personally I think this is a good middle ground where we get to beat or stay essentially equal with non-TRMC versions, while at the same time gaining tail-recursivity. On that note, I am considering that this PR is ready for a formal review and/or approval :)

(Benchmark and test results at: nojb/list_trmc@b95ea8a)

gasche

Looks good to me, thanks for the careful measurements.

I think we may want to roll back the unrolling (aha) if/when we start assuming that most user code uses an flambda-style optimizer, and replace the manual transformation by an unrolling annotation. This is for later.

nojb · 2022-07-02T16:07:16Z

Thanks for the approval @gasche. As it the weekend, I will wait a few more days before merging to give people more time to chime in, just in case.

nojb · 2022-07-05T16:10:52Z

No-one else has chimed in; planning to rebase and merge by end of day today.

Octachron · 2022-07-06T15:18:02Z

For future references, beware that the backtrace_dynlink.ml test has a flambda reference file that needs to be updated too as done in 8acd34e. Maybe it would be better to try to use an independent module rather than List in this test?

nojb · 2022-07-06T15:55:25Z

For future references, beware that the backtrace_dynlink.ml test has a flambda reference file that needs to be updated too as done in 8acd34e. Maybe it would be better to try to use an independent module rather than List in this test?

Right, thanks for the reminder! As far the second question, as I can see this comes from a use of List inside Dynlink, not sure we can do much about it...

gasche approved these changes Jun 26, 2022

View reviewed changes

nojb force-pushed the list_trmc branch from 6e1f9c1 to 9c55906 Compare June 29, 2022 05:43

nojb changed the title ~~Make List.map TRMC~~ Make List.{map,mapi,map2} TRMC Jun 29, 2022

nojb force-pushed the list_trmc branch from 9c55906 to 9a09dbc Compare June 29, 2022 05:53

nojb force-pushed the list_trmc branch from 9a09dbc to 32ba8b7 Compare June 29, 2022 05:54

gasche reviewed Jun 29, 2022

View reviewed changes

gasche approved these changes Jul 2, 2022

View reviewed changes

nojb added 5 commits July 5, 2022 18:42

List.map TRMC

1b2b5e9

List.map2 TRMC

aa50648

List.mapi TRMC

50bc679

Changes

3a3ac0d

Less unrolling

bdd9a97

Accept tests

7db3ac4

nojb force-pushed the list_trmc branch from 9299277 to 7db3ac4 Compare July 5, 2022 16:43

nojb merged commit cefde00 into ocaml:trunk Jul 5, 2022

nojb deleted the list_trmc branch July 5, 2022 17:20

nojb mentioned this pull request Jul 5, 2022

List: replace rev-based tail recursion with TRMC #11402

Merged

xavierleroy mentioned this pull request Jul 25, 2022

Make List functions safe by default and optionally offer faster versions #10528

Closed

nojb mentioned this pull request Jan 3, 2023

Document List.filter_map as tailrec; add rev version #10609

Closed

Bannerets mentioned this pull request Jan 4, 2023

TRMC implementation of @ #11859

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make List.{map,mapi,map2} TRMC #11362

Make List.{map,mapi,map2} TRMC #11362

nojb commented Jun 26, 2022 •

edited

ferminr commented Jun 26, 2022

gasche left a comment

gasche commented Jun 26, 2022

xavierleroy commented Jun 26, 2022

nojb commented Jun 26, 2022 •

edited

nojb commented Jun 26, 2022

ferminr commented Jun 26, 2022

nojb commented Jun 26, 2022 •

edited

nojb commented Jun 26, 2022

gasche commented Jun 26, 2022

nojb commented Jun 28, 2022

gasche commented Jun 28, 2022

nojb commented Jun 28, 2022

gasche commented Jun 28, 2022

nojb commented Jun 29, 2022 •

edited

gasche Jun 29, 2022

nojb Jun 29, 2022 •

edited

gasche Jun 29, 2022

xavierleroy commented Jun 29, 2022

nojb commented Jul 2, 2022

gasche left a comment

nojb commented Jul 2, 2022

nojb commented Jul 5, 2022

Octachron commented Jul 6, 2022

nojb commented Jul 6, 2022

Make List.{map,mapi,map2} TRMC #11362

Make List.{map,mapi,map2} TRMC #11362

Conversation

nojb commented Jun 26, 2022 • edited

ferminr commented Jun 26, 2022

gasche left a comment

Choose a reason for hiding this comment

gasche commented Jun 26, 2022

xavierleroy commented Jun 26, 2022

nojb commented Jun 26, 2022 • edited

nojb commented Jun 26, 2022

ferminr commented Jun 26, 2022

nojb commented Jun 26, 2022 • edited

nojb commented Jun 26, 2022

gasche commented Jun 26, 2022

nojb commented Jun 28, 2022

gasche commented Jun 28, 2022

nojb commented Jun 28, 2022

gasche commented Jun 28, 2022

nojb commented Jun 29, 2022 • edited

gasche Jun 29, 2022

Choose a reason for hiding this comment

nojb Jun 29, 2022 • edited

Choose a reason for hiding this comment

gasche Jun 29, 2022

Choose a reason for hiding this comment

xavierleroy commented Jun 29, 2022

nojb commented Jul 2, 2022

gasche left a comment

Choose a reason for hiding this comment

nojb commented Jul 2, 2022

nojb commented Jul 5, 2022

Octachron commented Jul 6, 2022

nojb commented Jul 6, 2022

nojb commented Jun 26, 2022 •

edited

nojb commented Jun 26, 2022 •

edited

nojb commented Jun 26, 2022 •

edited

nojb commented Jun 29, 2022 •

edited

nojb Jun 29, 2022 •

edited