-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make List.{map,mapi,map2} TRMC #11362
Conversation
One more related discussion at #10528. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. This definitely needs a Changes entry.
Also |
@nojb you need to provide performance numbers for short lists too. In symbolic computation applications, most lists have length <= 2. |
Here are the results of using
|
Thanks, assuming no blockers arise, I was planning to add other functions to this PR:
|
There are a few other functions in List that are currently implemented as the usual idiom of tail recursion then reverse at the end, resulting in two traversals: ( |
I think so, am not sure yet how far we wish to go with the application of TRMC in the stdlib, but they certainly seem candidates for it. |
I had gotten my tests mixed up again! I amended the note above with the correct benchmark results: TRMC appears around ~20/25% slower for these lengths. |
If you need to win against the non-tail-recursive version on small inputs, a trick that works surprisingly well is manual unrolling: let[@tail_mod_cons] rec map_unrolled f = function
| [] -> []
| [x0] ->
let y0 = f x0 in
[y0]
| [x0; x1] ->
let y0 = f x0 in
let y1 = f x1 in
[y0; y1]
| [x0; x1; x2] ->
let y0 = f x0 in
let y1 = f x1 in
let y2 = f x2 in
[y0; y1; y2]
| x0 :: x1 :: x2 :: xs ->
let y0 = f x0 in
let y1 = f x1 in
let y2 = f x2 in
y0 :: y1 :: y2 :: map_unrolled f xs |
I tried this with 4.14 (nojb/list_trmc@55cf03d) and indeed the unrolled TRMC version is faster on short lengths (
|
There are interesting GC effects related to the write barrier, that create a performance disadvantage for TRMC when:
This may very well be what you observe in your benchmark. I would try writing the result of For more details, see the discussion of "Promotion" in the second TRMC PR. |
Thanks for pointing me in the direction of that interesting discussion which I had missed! Indeed, by doing this, I get that the (unrolled) version of TRMC'ed List.map is always faster than the non tail-recursive List.map (numbers below). Given this, are we happy with switching to this implementation (unrolled + TRMC)? If yes, are we happy with switching the other amenable functions in
|
For me, using TMC is a good idea, but manual unrolling is less clear. (I wonder if flambda's unrolling control suffices -- for flambda users.) I would maybe not suggest as a general policy to systematically unroll stdlib functions, especially since the code is going to get more complex for other functions than for My proposal would be as follows:
|
Thanks @gasche. After spending some time benchmarking different functions from I updated the PR making
I also updated the docstrings to stop indicating these functions as being non-tail-recursive. @gasche do you mind giving another pass at this PR? |
let r1 = f a1 b1 in | ||
let r2 = f a2 b2 in | ||
let r3 = f a3 b3 in | ||
r1::r2::r3::map2 f l1 l2 | ||
| (_, _) -> invalid_arg "List.map2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what is the impact of unrolling on code size in this case of partial pattern matching followed by a catch-all. Risks I see include (1) sensibly more pattern-matching-related code being produced (due to unrolled cases not being an exhaustive covering) and (2) performance overhead. In your stead I would (1) ensure that you have a benchmark for this one compared to the non-unrolled variant, (2) eyeball the -dlambda or -dcmm output to ensure that there is no blowup, (3) consider not unrolling this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Performance numbers below (roughly unrolled TRMC > stdlib > TRMC, bigger is faster). For code size indeed there seems to be quite an increase. I wasn't sure how to quantify it, but the linear code is 107 lines for TRMC (no unrolling) vs 420 lines for TRMC (with unrolling).
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 1 │ 5.78ns │ 3.00w │ 67.52% │
│ map2 trmc 1 │ 8.56ns │ 3.00w │ 100.00% │
│ map2 stdlib 1 │ 7.09ns │ 3.00w │ 82.92% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 2 │ 8.73ns │ 6.00w │ 72.10% │
│ map2 trmc 2 │ 12.11ns │ 6.00w │ 100.00% │
│ map2 stdlib 2 │ 10.32ns │ 6.00w │ 85.21% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 3 │ 10.80ns │ 9.00w │ 64.55% │
│ map2 trmc 3 │ 16.73ns │ 9.00w │ 100.00% │
│ map2 stdlib 3 │ 13.61ns │ 9.00w │ 81.33% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 4 │ 15.35ns │ 12.00w │ 74.18% │
│ map2 trmc 4 │ 20.69ns │ 12.00w │ 100.00% │
│ map2 stdlib 4 │ 17.10ns │ 12.00w │ 82.67% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 5 │ 17.18ns │ 15.00w │ 71.54% │
│ map2 trmc 5 │ 24.01ns │ 15.00w │ 100.00% │
│ map2 stdlib 5 │ 19.55ns │ 15.00w │ 81.42% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 6 │ 20.89ns │ 18.00w │ 74.09% │
│ map2 trmc 6 │ 28.19ns │ 18.00w │ 100.00% │
│ map2 stdlib 6 │ 23.11ns │ 18.00w │ 81.96% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 7 │ 23.94ns │ 21.00w │ 73.87% │
│ map2 trmc 7 │ 32.41ns │ 21.00w │ 100.00% │
│ map2 stdlib 7 │ 27.28ns │ 21.00w │ 84.17% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 8 │ 28.64ns │ 24.00w │ 78.83% │
│ map2 trmc 8 │ 36.34ns │ 24.00w │ 100.00% │
│ map2 stdlib 8 │ 31.61ns │ 24.00w │ 87.00% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├──────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 9 │ 31.31ns │ 27.00w │ 72.45% │
│ map2 trmc 9 │ 43.22ns │ 27.00w │ 100.00% │
│ map2 stdlib 9 │ 35.82ns │ 27.00w │ 82.88% │
└──────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌───────────────────────┬──────────┬─────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ Percentage │
├───────────────────────┼──────────┼─────────┼────────────┤
│ map2 trmc unrolled 10 │ 33.03ns │ 30.00w │ 74.05% │
│ map2 trmc 10 │ 44.60ns │ 30.00w │ 100.00% │
│ map2 stdlib 10 │ 36.78ns │ 30.00w │ 82.47% │
└───────────────────────┴──────────┴─────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌────────────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├────────────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ map2 trmc unrolled 100 │ 327.20ns │ 300.00w │ 0.67w │ 0.67w │ 74.92% │
│ map2 trmc 100 │ 436.73ns │ 300.00w │ 0.68w │ 0.68w │ 100.00% │
│ map2 stdlib 100 │ 388.96ns │ 300.00w │ 0.51w │ 0.51w │ 89.06% │
└────────────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌─────────────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├─────────────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ map2 trmc unrolled 1000 │ 3.84us │ 3.00kw │ 69.90w │ 69.90w │ 76.97% │
│ map2 trmc 1000 │ 4.99us │ 3.00kw │ 69.90w │ 69.90w │ 100.00% │
│ map2 stdlib 1000 │ 4.52us │ 3.00kw │ 52.22w │ 52.22w │ 90.57% │
└─────────────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌──────────────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├──────────────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ map2 trmc unrolled 10000 │ 96.55us │ 30.00kw │ 7.54kw │ 7.54kw │ 93.08% │
│ map2 trmc 10000 │ 103.73us │ 30.00kw │ 7.47kw │ 7.47kw │ 100.00% │
│ map2 stdlib 10000 │ 98.35us │ 30.00kw │ 5.57kw │ 5.57kw │ 94.81% │
└──────────────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
Estimated testing time 30s (3 benchmarks x 10s). Change using '-quota'.
┌───────────────────────────┬──────────┬──────────┬──────────┬──────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────────────────────┼──────────┼──────────┼──────────┼──────────┼────────────┤
│ map2 trmc unrolled 100000 │ 2.76ms │ 300.00kw │ 299.97kw │ 299.97kw │ 78.35% │
│ map2 trmc 100000 │ 2.89ms │ 300.00kw │ 300.08kw │ 300.08kw │ 82.08% │
│ map2 stdlib 100000 │ 3.52ms │ 300.00kw │ 300.09kw │ 300.09kw │ 100.00% │
└───────────────────────────┴──────────┴──────────┴──────────┴──────────┴────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's okay: the 3-unrolling is not much more than 3x the size of the no-unrolling version, and the performance profile is on par with the map1 versions. I think that your choice of using the same approach for map2
is justified by these numbers.
I haven't had time to check all of them :-) but it looks like List.map is now plenty fast, both for very short and very long lists. I agree with @gasche that List.map2 should perhaps not use the unrolling trick because it makes the code bigger and the speed benefit is probably lower (or null) compared with List.map. Generally speaking, the deal with TRMC was to support the efficient execution of functions written in "natural" recursive style -- no List.rev, no CPS, etc. Manual unrolling makes functions less "natural" and should be done sparingly. |
I agree. I pushed a new commit reducing unrolling to its bare minimum (ie making two calls to Personally I think this is a good middle ground where we get to beat or stay essentially equal with non-TRMC versions, while at the same time gaining tail-recursivity. On that note, I am considering that this PR is ready for a formal review and/or approval :) (Benchmark and test results at: nojb/list_trmc@b95ea8a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the careful measurements.
I think we may want to roll back the unrolling (aha) if/when we start assuming that most user code uses an flambda-style optimizer, and replace the manual transformation by an unrolling annotation. This is for later.
Thanks for the approval @gasche. As it the weekend, I will wait a few more days before merging to give people more time to chime in, just in case. |
No-one else has chimed in; planning to rebase and merge by end of day today. |
For future references, beware that the backtrace_dynlink.ml test has a flambda reference file that needs to be updated too as done in 8acd34e. Maybe it would be better to try to use an independent module rather than |
Right, thanks for the reminder! As far the second question, as I can see this comes from a use of |
Now that we have TRMC we can make
List.map
(and friends) tail-recursive without harming performance for short lists (the vast majority of uses). Note that owing to the heap-allocated stacks of 5.0, it is much harder to trigger a Stack Overflow even with the non-tail recursive version; however there seems to be a noticeable performance gain in the case of long lists (see below).The initial version of this PR modifies only
List.map
for purposes of discussion. If there is consensus, we would also adapt other similar functions in theList
module.I did a micro-benchmark (https://github.com/nojb/list_trmc/blob/bc6e38cb7a23bb4653b3c36d236c1b821c3bc8a6/test.ml) with the current
trunk
(dd87ff5) on an AMD (Fedora) machine:(
.bc
is bytecode,.exe
is native-code,n
is the length of list being mapped, each list is averaged over 100 times.)(EDIT: I edited out a note about not observing the above slowdown under WSL/Windows, but it turns out that I was comparing against an already patched version of OCaml -- doh!)
See also related discussion at #10388 #867 #2199 #7205 #8010.