Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make List.{map,mapi,map2} TRMC #11362

Merged
merged 6 commits into from Jul 5, 2022
Merged

Make List.{map,mapi,map2} TRMC #11362

merged 6 commits into from Jul 5, 2022

Conversation

nojb
Copy link
Contributor

@nojb nojb commented Jun 26, 2022

Now that we have TRMC we can make List.map (and friends) tail-recursive without harming performance for short lists (the vast majority of uses). Note that owing to the heap-allocated stacks of 5.0, it is much harder to trigger a Stack Overflow even with the non-tail recursive version; however there seems to be a noticeable performance gain in the case of long lists (see below).

The initial version of this PR modifies only List.map for purposes of discussion. If there is consensus, we would also adapt other similar functions in the List module.

I did a micro-benchmark (https://github.com/nojb/list_trmc/blob/bc6e38cb7a23bb4653b3c36d236c1b821c3bc8a6/test.ml) with the current trunk (dd87ff5) on an AMD (Fedora) machine:

$ LD_LIBRARY_PATH=~/ocaml/local/lib/ocaml/stublibs PATH=~/ocaml/local/bin:$PATH make
./test.bc
       n    List.map   trmc map
      10    0.00       0.00
     100    0.00       0.00
    1000    0.00       0.00
   10000    0.02       0.03
  100000    0.37       0.39
 1000000    24.57       4.78
./test.exe
       n    List.map   trmc map
      10    0.00       0.00
     100    0.00       0.00
    1000    0.00       0.00
   10000    0.01       0.00
  100000    0.16       0.14
 1000000    6.03       2.37

(.bc is bytecode, .exe is native-code, n is the length of list being mapped, each list is averaged over 100 times.)

(EDIT: I edited out a note about not observing the above slowdown under WSL/Windows, but it turns out that I was comparing against an already patched version of OCaml -- doh!)

See also related discussion at #10388 #867 #2199 #7205 #8010.

@ferminr
Copy link
Contributor

ferminr commented Jun 26, 2022

One more related discussion at #10528.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. This definitely needs a Changes entry.

@gasche
Copy link
Member

gasche commented Jun 26, 2022

Also map.mli probably needs an update.

@xavierleroy
Copy link
Contributor

@nojb you need to provide performance numbers for short lists too. In symbolic computation applications, most lists have length <= 2.

@nojb
Copy link
Contributor Author

nojb commented Jun 26, 2022

@nojb you need to provide performance numbers for short lists too. In symbolic computation applications, most lists have length <= 2.

Here are the results of using core_bench to run an updated benchmark (https://github.com/nojb/list_trmc/blob/e1dd4a0523b7d78155b151781ac8bb80a4a0aa1a/test.ml) for list of length 1 <= n <= 10 (TRMC appears around ~20/25% slower for these lengths).

Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 1 │   4.99ns │   3.00w │     79.22% │
│   trmc 1 │   6.30ns │   3.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 2 │   7.50ns │   6.00w │     78.51% │
│   trmc 2 │   9.55ns │   6.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 3 │   9.99ns │   9.00w │     75.70% │
│   trmc 3 │  13.20ns │   9.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 4 │  12.50ns │  12.00w │     75.18% │
│   trmc 4 │  16.62ns │  12.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 5 │  15.00ns │  15.00w │     74.79% │
│   trmc 5 │  20.05ns │  15.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 6 │  17.52ns │  18.00w │     74.99% │
│   trmc 6 │  23.36ns │  18.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 7 │  20.00ns │  21.00w │     74.18% │
│   trmc 7 │  26.96ns │  21.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 8 │  22.52ns │  24.00w │     72.57% │
│   trmc 8 │  31.03ns │  24.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 9 │  25.01ns │  27.00w │     72.54% │
│   trmc 9 │  34.48ns │  27.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────┬──────────┬─────────┬────────────┐
│ Name      │ Time/Run │ mWd/Run │ Percentage │
├───────────┼──────────┼─────────┼────────────┤
│ stdlib 10 │  27.53ns │  30.00w │     72.96% │
│   trmc 10 │  37.73ns │  30.00w │    100.00% │
└───────────┴──────────┴─────────┴────────────┘

@nojb
Copy link
Contributor Author

nojb commented Jun 26, 2022

Approved. This definitely needs a Changes entry.

Thanks, assuming no blockers arise, I was planning to add other functions to this PR:

  • List.mapi
  • List.map2
  • List.append and ( @ )
  • List.init (it currently has a more complex version switching to a tail-rec version based on a threshold)
  • List.remove_assoc
  • List.remove_assq
  • List.merge

@ferminr
Copy link
Contributor

ferminr commented Jun 26, 2022

There are a few other functions in List that are currently implemented as the usual idiom of tail recursion then reverse at the end, resulting in two traversals: (find_all, filteri, filter_map, concat_map) . Would these benefit from TRMC too? (they'd certainly be more readable, but I don't know about performance)

@nojb
Copy link
Contributor Author

nojb commented Jun 26, 2022

There are a few other functions in List that are currently implemented as the usual idiom of tail recursion then reverse at the end, resulting in two traversals: (find_all, filteri, filter_map, concat_map) . Would these benefit from TRMC too? (they'd certainly be more readable, but I don't know about performance)

I think so, am not sure yet how far we wish to go with the application of TRMC in the stdlib, but they certainly seem candidates for it.

@nojb
Copy link
Contributor Author

nojb commented Jun 26, 2022

@nojb you need to provide performance numbers for short lists too. In symbolic computation applications, most lists have length <= 2.

Here are the results of using core_bench to run an updated benchmark (https://github.com/nojb/list_trmc/blob/7b383dd1f7f3620205d3e7b0e94b745f9f185934/test.ml) for list of length 1 <= n <= 10 (TRMC appears to win consistently)

I had gotten my tests mixed up again! I amended the note above with the correct benchmark results: TRMC appears around ~20/25% slower for these lengths.

@gasche
Copy link
Member

gasche commented Jun 26, 2022

If you need to win against the non-tail-recursive version on small inputs, a trick that works surprisingly well is manual unrolling:

let[@tail_mod_cons] rec map_unrolled f = function
 | [] -> []
 | [x0] ->
   let y0 = f x0 in
   [y0]
 | [x0; x1] ->
   let y0 = f x0 in
   let y1 = f x1 in
   [y0; y1]
 | [x0; x1; x2] ->
   let y0 = f x0 in
   let y1 = f x1 in
   let y2 = f x2 in
   [y0; y1; y2]
 | x0 :: x1 :: x2 :: xs ->
   let y0 = f x0 in
   let y1 = f x1 in
   let y2 = f x2 in
   y0 :: y1 :: y2 :: map_unrolled f xs

@nojb
Copy link
Contributor Author

nojb commented Jun 28, 2022

If you need to win against the non-tail-recursive version on small inputs, a trick that works surprisingly well is manual unrolling:

I tried this with 4.14 (nojb/list_trmc@55cf03d) and indeed the unrolled TRMC version is faster on short lengths (1 <= n <= 10, n = 100, 1000, 10000) than stock List.map, but it becomes slower for lengths n = 10000 and n = 100000 (assuming I didn't both the benchmark). If you would like to reproduce it is enough to clone the repo above and do

opam install core_bench core_unix
make
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 1 │   4.65ns │   3.00w │    100.00% │
│   trmc 1 │   3.81ns │   3.00w │     81.85% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 2 │   7.32ns │   6.00w │    100.00% │
│   trmc 2 │   5.33ns │   6.00w │     72.81% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 3 │   9.85ns │   9.00w │    100.00% │
│   trmc 3 │   7.26ns │   9.00w │     73.77% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 4 │  13.19ns │  12.00w │    100.00% │
│   trmc 4 │  11.24ns │  12.00w │     85.22% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 5 │  16.43ns │  15.00w │    100.00% │
│   trmc 5 │  13.54ns │  15.00w │     82.39% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 6 │  19.59ns │  18.00w │    100.00% │
│   trmc 6 │  15.87ns │  18.00w │     81.03% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 7 │  22.95ns │  21.00w │    100.00% │
│   trmc 7 │  18.85ns │  21.00w │     82.12% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 8 │  27.04ns │  24.00w │    100.00% │
│   trmc 8 │  20.90ns │  24.00w │     77.27% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│ stdlib 9 │  30.63ns │  27.00w │    100.00% │
│   trmc 9 │  23.72ns │  27.00w │     77.42% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────┬──────────┬─────────┬────────────┐
│ Name      │ Time/Run │ mWd/Run │ Percentage │
├───────────┼──────────┼─────────┼────────────┤
│ stdlib 10 │  32.79ns │  30.00w │    100.00% │
│   trmc 10 │  26.66ns │  30.00w │     81.29% │
└───────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name       │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ stdlib 100 │ 348.80ns │ 300.00w │    0.17w │    0.17w │    100.00% │
│   trmc 100 │ 265.58ns │ 300.00w │    0.33w │    0.33w │     76.14% │
└────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌─────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name        │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├─────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ stdlib 1000 │   3.83us │  3.00kw │   17.25w │   17.25w │    100.00% │
│   trmc 1000 │   3.17us │  3.00kw │   34.51w │   34.51w │     82.72% │
└─────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name         │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├──────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ stdlib 10000 │  61.81us │ 30.00kw │   1.82kw │   1.82kw │     98.65% │
│   trmc 10000 │  62.65us │ 30.00kw │   3.61kw │   3.61kw │    100.00% │
└──────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────────┬──────────┬──────────┬──────────┬──────────┬────────────┐
│ Name          │ Time/Run │  mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────────┼──────────┼──────────┼──────────┼──────────┼────────────┤
│ stdlib 100000 │   2.37ms │ 300.00kw │ 173.69kw │ 173.69kw │     89.47% │
│   trmc 100000 │   2.65ms │ 300.00kw │ 300.08kw │ 300.08kw │    100.00% │
└───────────────┴──────────┴──────────┴──────────┴──────────┴────────────┘

@gasche
Copy link
Member

gasche commented Jun 28, 2022

There are interesting GC effects related to the write barrier, that create a performance disadvantage for TRMC when:

  • you ignore the result of List.map right away instead of keeping it alive until the next minor GC
  • the allocation work of the List.map size is comparable to the minor heap size

This may very well be what you observe in your benchmark. I would try writing the result of List.map to a global reference instead of ignore-ing it right away. (This corresponds to the case where the lifetime of the result is extended until at least the next minor collection, which is a reasonable assumption.)

For more details, see the discussion of "Promotion" in the second TRMC PR.

@nojb
Copy link
Contributor Author

nojb commented Jun 28, 2022

I would try writing the result of List.map to a global reference instead of ignore-ing it right away.

Thanks for pointing me in the direction of that interesting discussion which I had missed! Indeed, by doing this, I get that the (unrolled) version of TRMC'ed List.map is always faster than the non tail-recursive List.map (numbers below).

Given this, are we happy with switching to this implementation (unrolled + TRMC)? If yes, are we happy with switching the other amenable functions in List to this style as well?

Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 1 │   5.96ns │   3.00w │     91.45% │
│ stdlib 1 │   6.52ns │   3.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 2 │   8.23ns │   6.00w │     85.85% │
│ stdlib 2 │   9.59ns │   6.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 3 │   9.63ns │   9.00w │     78.48% │
│ stdlib 3 │  12.27ns │   9.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 4 │  13.87ns │  12.00w │     84.26% │
│ stdlib 4 │  16.46ns │  12.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 5 │  15.65ns │  15.00w │     80.86% │
│ stdlib 5 │  19.35ns │  15.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 6 │  17.93ns │  18.00w │     77.29% │
│ stdlib 6 │  23.19ns │  18.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 7 │  21.84ns │  21.00w │     82.35% │
│ stdlib 7 │  26.52ns │  21.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 8 │  24.45ns │  24.00w │     78.07% │
│ stdlib 8 │  31.33ns │  24.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────┬──────────┬─────────┬────────────┐
│ Name     │ Time/Run │ mWd/Run │ Percentage │
├──────────┼──────────┼─────────┼────────────┤
│   trmc 9 │  26.55ns │  27.00w │     79.31% │
│ stdlib 9 │  33.48ns │  27.00w │    100.00% │
└──────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────┬──────────┬─────────┬────────────┐
│ Name      │ Time/Run │ mWd/Run │ Percentage │
├───────────┼──────────┼─────────┼────────────┤
│   trmc 10 │  29.70ns │  30.00w │     82.90% │
│ stdlib 10 │  35.82ns │  30.00w │    100.00% │
└───────────┴──────────┴─────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name       │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│   trmc 100 │ 290.50ns │ 300.00w │    0.67w │    0.67w │     76.80% │
│ stdlib 100 │ 378.24ns │ 300.00w │    0.51w │    0.51w │    100.00% │
└────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌─────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name        │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├─────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│   trmc 1000 │   3.72us │  3.00kw │   69.91w │   69.91w │     81.33% │
│ stdlib 1000 │   4.58us │  3.00kw │   52.26w │   52.26w │    100.00% │
└─────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌──────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name         │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├──────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│   trmc 10000 │  98.56us │ 30.00kw │   7.49kw │   7.49kw │     99.03% │
│ stdlib 10000 │  99.52us │ 30.00kw │   5.61kw │   5.61kw │    100.00% │
└──────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────────┬──────────┬──────────┬──────────┬──────────┬────────────┐
│ Name          │ Time/Run │  mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────────┼──────────┼──────────┼──────────┼──────────┼────────────┤
│   trmc 100000 │   3.00ms │ 300.00kw │ 300.09kw │ 300.09kw │     80.05% │
│ stdlib 100000 │   3.75ms │ 300.00kw │ 300.26kw │ 300.26kw │    100.00% │
└───────────────┴──────────┴──────────┴──────────┴──────────┴────────────┘

@gasche
Copy link
Member

gasche commented Jun 28, 2022

For me, using TMC is a good idea, but manual unrolling is less clear. (I wonder if flambda's unrolling control suffices -- for flambda users.) I would maybe not suggest as a general policy to systematically unroll stdlib functions, especially since the code is going to get more complex for other functions than for map.

My proposal would be as follows:

  • if the function is written to be tail-rec with an accumulator, only move it to TMC, no unrolling
  • if the function was kept non-tail-rec for performance, maybe consider unrolling to be performance-preserving

@nojb nojb changed the title Make List.map TRMC Make List.{map,mapi,map2} TRMC Jun 29, 2022
@nojb
Copy link
Contributor Author

nojb commented Jun 29, 2022

My proposal would be as follows:

Thanks @gasche. After spending some time benchmarking different functions from List, I decided to focus only on map (and friends) so as not to rush it (each function needs to be looked at separately).

I updated the PR making List.map, List.mapi and List.map2 TRMC, with an unroll depth of 3 to make sure to also improve the case of short lists (but note that I didn't experiment with different depths to see what was best). With this, the functions seem to be uniformly faster than stock versions (tested on 4.14 with lenghts 1 .. 10, 100, 1000, 10000, 100000). Full results at

I also updated the docstrings to stop indicating these functions as being non-tail-recursive.

@gasche do you mind giving another pass at this PR?
@xavierleroy are you convinced by these numbers?

let r1 = f a1 b1 in
let r2 = f a2 b2 in
let r3 = f a3 b3 in
r1::r2::r3::map2 f l1 l2
| (_, _) -> invalid_arg "List.map2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what is the impact of unrolling on code size in this case of partial pattern matching followed by a catch-all. Risks I see include (1) sensibly more pattern-matching-related code being produced (due to unrolled cases not being an exhaustive covering) and (2) performance overhead. In your stead I would (1) ensure that you have a benchmark for this one compared to the non-unrolled variant, (2) eyeball the -dlambda or -dcmm output to ensure that there is no blowup, (3) consider not unrolling this one.

Copy link
Contributor Author

@nojb nojb Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Performance numbers below (roughly unrolled TRMC > stdlib > TRMC, bigger is faster). For code size indeed there seems to be quite an increase. I wasn't sure how to quantify it, but the linear code is 107 lines for TRMC (no unrolling) vs 420 lines for TRMC (with unrolling).

Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name                 │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 1 │   5.78ns │   3.00w │     67.52% │
│ map2 trmc          1 │   8.56ns │   3.00w │    100.00% │
│ map2 stdlib        1 │   7.09ns │   3.00w │     82.92% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name                 │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 2 │   8.73ns │   6.00w │     72.10% │
│ map2 trmc          2 │  12.11ns │   6.00w │    100.00% │
│ map2 stdlib        2 │  10.32ns │   6.00w │     85.21% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name                 │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 3 │  10.80ns │   9.00w │     64.55% │
│ map2 trmc          3 │  16.73ns │   9.00w │    100.00% │
│ map2 stdlib        3 │  13.61ns │   9.00w │     81.33% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name                 │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 4 │  15.35ns │  12.00w │     74.18% │
│ map2 trmc          4 │  20.69ns │  12.00w │    100.00% │
│ map2 stdlib        4 │  17.10ns │  12.00w │     82.67% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name                 │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 5 │  17.18ns │  15.00w │     71.54% │
│ map2 trmc          5 │  24.01ns │  15.00w │    100.00% │
│ map2 stdlib        5 │  19.55ns │  15.00w │     81.42% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name                 │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 6 │  20.89ns │  18.00w │     74.09% │
│ map2 trmc          6 │  28.19ns │  18.00w │    100.00% │
│ map2 stdlib        6 │  23.11ns │  18.00w │     81.96% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name                 │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 7 │  23.94ns │  21.00w │     73.87% │
│ map2 trmc          7 │  32.41ns │  21.00w │    100.00% │
│ map2 stdlib        7 │  27.28ns │  21.00w │     84.17% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name                 │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 8 │  28.64ns │  24.00w │     78.83% │
│ map2 trmc          8 │  36.34ns │  24.00w │    100.00% │
│ map2 stdlib        8 │  31.61ns │  24.00w │     87.00% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name                 │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 9 │  31.31ns │  27.00w │     72.45% │
│ map2 trmc          9 │  43.22ns │  27.00w │    100.00% │
│ map2 stdlib        9 │  35.82ns │  27.00w │     82.88% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌───────────────────────┬──────────┬─────────┬────────────┐
│ Name                  │ Time/Run │ mWd/Run │ Percentage │
├───────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 10 │  33.03ns │  30.00w │     74.05% │
│ map2 trmc          10 │  44.60ns │  30.00w │    100.00% │
│ map2 stdlib        10 │  36.78ns │  30.00w │     82.47% │
└───────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌────────────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name                   │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├────────────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ map2 trmc unrolled 100 │ 327.20ns │ 300.00w │    0.67w │    0.67w │     74.92% │
│ map2 trmc          100 │ 436.73ns │ 300.00w │    0.68w │    0.68w │    100.00% │
│ map2 stdlib        100 │ 388.96ns │ 300.00w │    0.51w │    0.51w │     89.06% │
└────────────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌─────────────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name                    │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├─────────────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ map2 trmc unrolled 1000 │   3.84us │  3.00kw │   69.90w │   69.90w │     76.97% │
│ map2 trmc          1000 │   4.99us │  3.00kw │   69.90w │   69.90w │    100.00% │
│ map2 stdlib        1000 │   4.52us │  3.00kw │   52.22w │   52.22w │     90.57% │
└─────────────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name                     │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├──────────────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ map2 trmc unrolled 10000 │  96.55us │ 30.00kw │   7.54kw │   7.54kw │     93.08% │
│ map2 trmc          10000 │ 103.73us │ 30.00kw │   7.47kw │   7.47kw │    100.00% │
│ map2 stdlib        10000 │  98.35us │ 30.00kw │   5.57kw │   5.57kw │     94.81% │
└──────────────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌───────────────────────────┬──────────┬──────────┬──────────┬──────────┬────────────┐
│ Name                      │ Time/Run │  mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────────────────────┼──────────┼──────────┼──────────┼──────────┼────────────┤
│ map2 trmc unrolled 100000 │   2.76ms │ 300.00kw │ 299.97kw │ 299.97kw │     78.35% │
│ map2 trmc          100000 │   2.89ms │ 300.00kw │ 300.08kw │ 300.08kw │     82.08% │
│ map2 stdlib        100000 │   3.52ms │ 300.00kw │ 300.09kw │ 300.09kw │    100.00% │
└───────────────────────────┴──────────┴──────────┴──────────┴──────────┴────────────┘

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's okay: the 3-unrolling is not much more than 3x the size of the no-unrolling version, and the performance profile is on par with the map1 versions. I think that your choice of using the same approach for map2 is justified by these numbers.

@xavierleroy
Copy link
Contributor

@xavierleroy are you convinced by these numbers?

I haven't had time to check all of them :-) but it looks like List.map is now plenty fast, both for very short and very long lists.

I agree with @gasche that List.map2 should perhaps not use the unrolling trick because it makes the code bigger and the speed benefit is probably lower (or null) compared with List.map.

Generally speaking, the deal with TRMC was to support the efficient execution of functions written in "natural" recursive style -- no List.rev, no CPS, etc. Manual unrolling makes functions less "natural" and should be done sparingly.

@nojb
Copy link
Contributor Author

nojb commented Jul 2, 2022

Generally speaking, the deal with TRMC was to support the efficient execution of functions written in "natural" recursive style -- no List.rev, no CPS, etc. Manual unrolling makes functions less "natural" and should be done sparingly.

I agree. I pushed a new commit reducing unrolling to its bare minimum (ie making two calls to f in each iteration instead of one). This is still enough to beat or remain within 1-2% of stock List.map, List.mapi and List.map2 in all cases. With this small amount of unrolling, the increase in code size is also much less (0.75x instead of 4x).

Personally I think this is a good middle ground where we get to beat or stay essentially equal with non-TRMC versions, while at the same time gaining tail-recursivity. On that note, I am considering that this PR is ready for a formal review and/or approval :)

(Benchmark and test results at: nojb/list_trmc@b95ea8a)

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for the careful measurements.

I think we may want to roll back the unrolling (aha) if/when we start assuming that most user code uses an flambda-style optimizer, and replace the manual transformation by an unrolling annotation. This is for later.

@nojb
Copy link
Contributor Author

nojb commented Jul 2, 2022

Thanks for the approval @gasche. As it the weekend, I will wait a few more days before merging to give people more time to chime in, just in case.

@nojb
Copy link
Contributor Author

nojb commented Jul 5, 2022

No-one else has chimed in; planning to rebase and merge by end of day today.

@nojb nojb merged commit cefde00 into ocaml:trunk Jul 5, 2022
@nojb nojb deleted the list_trmc branch July 5, 2022 17:20
@Octachron
Copy link
Member

For future references, beware that the backtrace_dynlink.ml test has a flambda reference file that needs to be updated too as done in 8acd34e. Maybe it would be better to try to use an independent module rather than List in this test?

@nojb
Copy link
Contributor Author

nojb commented Jul 6, 2022

For future references, beware that the backtrace_dynlink.ml test has a flambda reference file that needs to be updated too as done in 8acd34e. Maybe it would be better to try to use an independent module rather than List in this test?

Right, thanks for the reminder! As far the second question, as I can see this comes from a use of List inside Dynlink, not sure we can do much about it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants