Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: tail recursion modulo constructors #181

Closed
wants to merge 2 commits into from
Closed

Conversation

let-def
Copy link
Contributor

@let-def let-def commented May 6, 2015

This pull request introduces tail-recursion modulo constructor, which allows to write
a version List.map that is both tail-recursive and natural.

Using TRMC

Getting the benefits of TRMC is opt-in. Since it turns stack allocation into mutation and heap-allocation, it might have an impact on the program profile.

However, because of the nature of TRMC mutations, they almost never cross heaps and stay in the fast-path (I don't have recent benchmarks yet, but the performance cost, if any is negligible).

Attributes

A few syntaxes have been tried, the two most satisfying ones are:

  • annotating bindings
let rec map f = function
 | [] -> []
 | x :: xs ->
   let y = f x in
   y :: map f xs
 [@@trmc]
  • annotating patterns
let rec map[@trmc] f = function
 | [] -> []
 | x :: xs ->
   let y = f x in
   y :: map f xs

The first one applies only to global bindings (item attributes are not yet allowed on local-let).
The second one required updating the parser, as attributes were not allowed on the identifier of function binding, and proved quite tricky. It is the one implemented in current revision.

Finally, the -force-trmc flags enable automatic rewriting of calls to trmc style.

TRMC cases

TRMC applies to all tail-positions consisting of value constructors, possibly nested, applied to the result of a recursive call to a "trmc candidate", that is a recursive function being defined which have the [@trmc] attribute specified above.

Examples:

  • y :: map f xs from above,
  • y :: z :: map f xs would qualify also
  • in case of ambiguity, for instance when mapping both subtrees of a tree,
    e.g. Node (l,r) -> Node (tree_map f l, tree_map f r), one of the call will
    be rewritten in TRMC style but which one is not specified.

In the last case, use an explicit let-binding to enforce the order. TRMC rewritting won't cross let-bindings, so

let l = tree_map f l in
let r = tree_map f r in
Node (l, r)

will disable TRMC.

let l = tree_map f l in
Node (l, tree_map f r)

will force TRMC on the right subtree.

Also, you shouldn't rely on the evaluation order in the expression in tail-position, as trmc rewritting might change the ordering (trmc call evaluated last). But since afaik evaluation order of arguments is not strongly specified in OCaml, it should already be the case.

Warnings

Three new warnings are introduced:

  • 51, [@trmc] attribute on a function which doesn't satisfy trmc requirements
  • 52, [@trmc] attribute on a function which is never applied in trmc position
  • 53, a call has been detected to be in trmc-position, the function could benefit of [@trmc] attribute (useful to detect trmc opportunities in existing code base).

However, in the last case, making this a warning is not satisfying, as plenty of codes with -warn-error A will break.

Cons

There are a few cases where observable behavior will change or break.

When using delim-cc or hansei. Since the continuation captures the stack and not the heap, and trmc move values to the heap and out of the stack really early, one will be able to observe values getting mutated. TRMC shouldn't be used in such cases.

It is unclear how TRMC will behave with multicore GCs and alternative runtimes (like js_of_ocaml, although it is more likely to be fine). The mutations introduced will touch values which are not expected to be mutated.
As far as OCaml semantics and the classic OCaml runtime are concerned, this is fine since these values are not observable. But different runtimes might have different expectations.

let rec l = 1 :: l in map succ l might prove to be a bit memory hungry instead of failing quickly.

@gasche
Copy link
Member

gasche commented May 6, 2015

For the record, I have discussed this feature extensively with @def-lkb , and I think it is a nice thing to have opt-in.

I am tired of having to explain to beginners that the nice elegant way to write map is not the actual way you have to implement it in a reasonable library. Whenever possible, the elegant code should also be the best code, and TRMC brings us closer to this ideal. (I realize this sounds like a philosophical point, but I think it is a strong argument and it took me some years to realize it.)

On the other hand TRMC can be problematic (I do think Hansei is an interesting use of the language, and thanks Oleg for pointing out that destination-passing-style List.map breaks it), so it is important that it is only opt-in.

Finally, there are also some important use-cases where TRMC does not suffice to obtain good stack usage (it is not clear how to implement mapM in a strict language without blowing the stack on large lists). It is not the be-all end-all of "OS stack vs. heap for callstack allocation" debates.

@alainfrisch
Copy link
Contributor

This is a great proposal.

I'm not sure I buy @gasche's argument about beginners, though. If TRMC is not automatic, having to tell beginners to enable it explicitly is not natural and doesn't achieve the goal of least-surprise.

Also, it would be quite useful to benchmark various versions of List.map. Last time I checked, a manual encoding of TRMC for this case leads to significantly worse performances than the naive non tailrec version. It would be interesting to compare the TRMC version with the "defunctionalized CPS" version (i.e. using an accumulator and a List.rev), because that one could also be mechanized. (There are several ways to achieve good performance for common sizes and still have a version that works on long list (do a non-tailrec version for the first N elements; unroll the loop) -- but that's another story.)

@dbuenzli
Copy link
Contributor

dbuenzli commented May 6, 2015

Le mercredi, 6 mai 2015 à 17:13, gasche a écrit :

I am tired of having to explain to beginners that the nice elegant way to write map is not the actual way you have to implement it in a reasonable library. Whenever possible, the elegant code should also be the best code, and TRMC brings us closer to this ideal. (I realize this sounds like a philosophical point, but I think it is a strong argument and it took me some years to realize it.)

It's always funny how people pull out the beginners argument when they find it handy to supposedly make a point. You brought nothing for the beginners since you now have to explain to them both what tail-recursion and tail-recursion modulo constructors is and that you need to add an attribute to get the latter, I have seen more beginner friendly introductions…

(And as thinking about beginners when you design is usually a bad idea, always design for the working programmer. API and PL designers should really read more hci books)

Best,

Daniel

@lpw25
Copy link
Contributor

lpw25 commented May 6, 2015

I do like this idea.

However, I am concerned about the addition of more mutations of immutable things since these are more costly in multicore (if the mutated block is on the major heap then you must allocate the pointed-to value on the major heap as well).

I am also concerned about the interaction with delimited continuations (which -- spoiler alert -- may also be relevant to multicore). It is possible that this interaction can be fixed. Essentially copying a stack must also cause any TRMC values being built to be copied (I believe Gabriel discussed such a scheme on the caml-list). I think that this should probably be worked out properly before considering TRMC for inclusion.

@yallop
Copy link
Member

yallop commented May 6, 2015

It sounds like this only applies to self calls. Can it be generalised to support arbitrary tail calls?

@let-def
Copy link
Contributor Author

let-def commented May 6, 2015

@yallop this applies to all recursive functions defined in the let-bindings parents of the call-site.

let rec f =and g =
  let rec g' = … x :: f () in
  g' …
in

Here, that call to f from g' is rewritable. Note that the drawback is that this might generate an amount of code exponential to the number of nested lets.
This could maybe be worked around by doing trmc rewriting after lambda lifting, but that would be much more involved.

@bluddy
Copy link
Contributor

bluddy commented May 6, 2015

@lpw25 will multicore be purely a runtime or will it also require compiling programs differently? If the latter, then this optimization can be turned off. Perhaps ideally, both code paths can be included, and the runtime gets to choose which one it wants by setting a global, or via a preprocessing phase much like dynamic linking?

@dbuenzli
Copy link
Contributor

dbuenzli commented May 6, 2015

Le mercredi, 6 mai 2015 à 18:16, Yotam Barnoy a écrit :

@lpw25 (https://github.com/lpw25) will multicore be purely a runtime or will it also require compiling programs differently? If the latter, then this optimization can be turned off.

Well no, it's an all or nothing thing. Otherwise you may blow the stack depending on the runtime and you really don't want that.

Daniel

@bluddy
Copy link
Contributor

bluddy commented May 6, 2015

@dbuenzli Right, well, I think it's better to view this optimization not as a surefire way to fix stack overflows, but as a hint to the compiler. Even without stack overflow protection, this optimization can give you a performance boost, so it's still worthwhile.

Otherwise, this really does get messy for multicore, which is coming up... any day now?

@dbuenzli
Copy link
Contributor

dbuenzli commented May 6, 2015

@dbuenzli (https://github.com/dbuenzli) Right, well, I think it's better to view this optimization not as a surefire way to fix stack overflows, but as a hint to the compiler.
This proposal is not an optimization (reread Frédéric's very first sentence) and it seems it may even impact performance negatively in certain scenarios.

Even without stack overflow protection, this optimization can give you a performance boost, so it's still worthwhile.

Personally I'm first interested in writing robust and correct programs and part of this task is making sure that you do not blow the stack whenever your data becomes arbitrarily large.

@alainfrisch
Copy link
Contributor

Even without stack overflow protection, this optimization can give you a performance boost, so it's still worthwhile.

I think this is wrong. Most tricks to avoid blowing the stack (CPS, extra accumulator, mutation) tend to produce code much slower than the non-tailrec version.

@bluddy
Copy link
Contributor

bluddy commented May 6, 2015

My bad. I guess it's only a performance optimization as compared to reversing the reverse-mapped list, which in itself is done to avoid stack overflow.

@let-def
Copy link
Contributor Author

let-def commented May 6, 2015

I think this is wrong. Most tricks to avoid blowing the stack (CPS, extra accumulator, mutation) tend to produce code much slower than the non-tailrec version.

This doesn't match my observations, but my benchmarks were not made with latest version, so take what I am about to say with a pinch of salt.

Generally, code with and without trmc were performing similarly.
On a quick test with operf-micro, there was no difference (around 1%, generally in favor of trmc, but likely due to external factors), although operf-micro is probably not stressing much trmc cases.
On a variety of tests involving List.map:

  • for small lists and next to no work in the applied function, (List.map succ [1;2]), naive version was 10% faster,
  • as the list got larger, performance shifted progressively in favor of trmc version, with some pathological cases where trmc version was twice as fast as naive one (when list size got close to minor heap, my guesstimate: the only work was to build the list, in the naive case it was built twice, once on the stack, once in heap, or directly in heap for trmc version)
  • unrolling List.map some times (directly matching lists with less than 5 elements) had much more impact than trmc rewriting.

The unrolled version was faster than Core_list.map one, quite significantly on large lists (when Core switches to the array representation).

@alainfrisch
Copy link
Contributor

as the list got larger, performance shifted progressively in favor of trmc version, with some pathological cases where trmc version was twice as fast as naive one

I would be interested to see such cases, and how they would behave when the partially built list does not fit in the minor heap during the entire List.map (either because the list is very long, or just because the mapped function allocates).

@let-def
Copy link
Contributor Author

let-def commented May 12, 2015

Ok, I fixed a bug I introduced while rebasing and now the worst case is 30% slower with TRMC with operf-micro, and 1 to 3% faster in other cases. I haven't had time to investigate the cases where TRMC was slower.

@let-def
Copy link
Contributor Author

let-def commented May 12, 2015

After discussion with @lpw25, in order to make TRMC behaves well multicore gc and delim-cc, we decided to restrict TRMC rewriting to recursive calls occurring at the last position of the block being built.

That is for (::) : 'a * 'a list -> 'a list, the last position is the list value.
This applies to nested calls to, so that in x :: (y :: f xs), the call f xs could be rewritten, but not in x :: (f xs :: ys).

@alainfrisch
Copy link
Contributor

  1. Do we agree that the two versions below behave the same:
let rec map1[@trmc] f = function [] -> [] | hd :: tl -> f hd :: map1 f tl;;
let rec map2[@trmc] f = function [] -> [] | hd :: tl -> let r = f hd in r :: map2 f tl;;

?

(i.e. that the trmc attribute can change the evaluation order)

  1. In terms of user interface, I'm wondering whether it would be preferable to put the attribute on the call:
let rec map f = function [] -> [] | hd :: tl -> f hd :: (map f tl [@trmc])

or:

let rec map f = function [] -> [] | hd :: tl -> f hd :: (map[@trmc] f tl)

This would be more in line with the [@tailcall] attribute, and makes it more explicit where users expect a call to be turned into a tail one.

@alainfrisch
Copy link
Contributor

as the list got larger, performance shifted progressively in favor of trmc version, with some pathological cases where trmc version was twice as fast as naive one

I'm still interested to see concrete cases where the naive version trmc's map is faster than the non-trmc version, for the same evaluation order.

I ran the following benchmark:

let rec map_trmc [@trmc] f = function
  | [] -> []
  | hd :: tl -> let r = f hd in r :: map_trmc f tl

let rec map f = function
  | [] -> []
  | hd :: tl -> let r = f hd in r :: map f tl

let n = 10000000

let test len =
  let l = ref [] in
  for i = 1 to len do l := i :: !l done;
  let l = !l in

  Gc.compact ();
  let t0 = Unix.gettimeofday () in
  for j = 1 to n / len do ignore (map succ l) done;
  let x1 = Unix.gettimeofday () -. t0 in

  Gc.compact ();
  let t0 = Unix.gettimeofday () in
  for j = 1 to n / len do ignore (map_trmc succ l) done;
  let x2 = Unix.gettimeofday () -. t0 in

  Printf.printf "len = %08i:   ratio = % 6.02f%%\n%!" len (100. *. x2 /. x1)


let () =
  test 10;
  test 100;
  test 1000;
  test 10000;
  test 100000

and I get:

len = 00000010:   ratio =  104.93%
len = 00000100:   ratio =  104.29%
len = 00001000:   ratio =  99.68%
len = 00010000:   ratio =  116.31%
len = 00100000:   ratio =  144.07%

changing the implementation to:

let rec map_trmc [@trmc] f = function
  | [] -> []
  | hd :: tl -> f hd :: map_trmc f tl

let rec map f = function
  | [] -> []
  | hd :: tl -> f hd :: map f tl

give very different results:

len = 00000010:   ratio =  106.70%
len = 00000100:   ratio =  55.58%
len = 00001000:   ratio =  50.80%
len = 00010000:   ratio =  73.18%
len = 00100000:   ratio =  128.37%

but here the semantics is different (the trmc version always evaluate left-to-right).

Now, using:

let rec map f l = List.rev (List.rev_map f l)

I get:

len = 00000010:   ratio =  73.83%
len = 00000100:   ratio =  79.02%
len = 00001000:   ratio =  77.45%
len = 00010000:   ratio =  71.45%
len = 00100000:   ratio =  82.75%

@let-def
Copy link
Contributor Author

let-def commented Jun 17, 2015

(i.e. that the trmc attribute can change the evaluation order)

Yes, trmc force the trmc-call (right most) to be evaluated last.

let rec map f = function [] -> [] | hd :: tl -> f hd :: (map f tl [@TRMC])
or:
let rec map f = function [] -> [] | hd :: tl -> f hd :: (map[@TRMC] f tl)

Nice, I also find that better and more predictible. Unless people express other opinions, I think I will apply your suggestion.

@let-def
Copy link
Contributor Author

let-def commented Jun 17, 2015

About your benchmark: I likely made the mistake of having wrong evaluation order in my tests.
Your results seem to sum up what I observed.

In a second run, I unrolled the map function a few times and the difference was much larger than between TRMC and non-TRMC versions. And this time I took care of having the appropriate evaluation order. (Which makes me think that this rewriting might conflict with flambda unrolling)

@gasche
Copy link
Member

gasche commented Jun 17, 2015

In terms of user interface, I'm wondering whether it would be preferable to put the attribute on the call [...]. This would be more in line with the [@tailcall] attribute, and makes it more explicit where users expect a call to be turned into a tail one.

Is TRMC a global or local transform? My previous mental model was of TRMC as whole-function transform, and I was just writing a message explaining that I dislike your proposal for this reason. Thinking about it more, it looks like (TRMC indeed requires a whole-function change but) the possibilty for each recursive call to be TRMC-ed is independent from the rest of the recursive function body. If that is really the case, then maybe it should indeed be at the level of the call.

This would also have the advantage of easily allowing the user to specify which case should be trmc-ed if there are two recursive calls under the constructors, which can be helpful when the recursion depth is uneven. (But as noted by @def-lkb this can already done by using let explicitly.)

(I think in this case it would be both fairly reasonable and intuitive to allow both positions.)

@bobzhang
Copy link
Member

I like this idea, can we pass all attributes to function declarations downto lambda layer in a modular way instead of the ad-hoc passing down, now it seems if we want to add one optimization, we need change lambda interface, this is already the case with [@taillcall], this will quickly makes lambda layer bloat

@bobzhang bobzhang mentioned this pull request Aug 14, 2015
@let-def
Copy link
Contributor Author

let-def commented Nov 15, 2015

@def-lkb you should add tests.

@gasche
Copy link
Member

gasche commented Nov 19, 2015

We discussed it at the developer meeting today, and the consensus was that there were too many changes in for 4.03, so this could be postponed.

Also, the idea of only ever mutating the last field (as a way to create a possibility of a workaround to support multi-shot continuations) proved unpopular. I don't see it suggested in the PR (and I don't understand the full details of it: if the last field is not a self-call but some other is, then no TRMC is performed on this use-site?). On the other hand, I do think that supporting multi-shot continuations was a valid move (for HANSEI for example); we may need to think harder about (1) less ad-hoc designs for the multi-shot case or (2) whether a clear specification exists that encompasses both choices of behaviour, to at least be able to give a stable specification that allows changing our minds about this specific design detail.

@damiendoligez damiendoligez added this to the 4.04-or-later milestone Jan 25, 2016
@mshinwell mshinwell removed this from the 4.04 milestone Sep 7, 2016
@mshinwell
Copy link
Contributor

@let-def What is the plan for this pull request?

@let-def
Copy link
Contributor Author

let-def commented Dec 27, 2016

@mshinwell Unless there is a consensus that this is better implemented in flambda than lambda (potentially better optimizing but no longer benefiting bytecode interpreter), I will port to latest version.
Then I rerun tests and provide uptodate figures.

@trefis trefis mentioned this pull request Dec 7, 2018
@gasche
Copy link
Member

gasche commented Dec 28, 2018

@let-def in #2199 (comment) you announce that you rebased this PR on trunk. Would you update the branch so that the PR itself remains up-to-date?

@sbriais
Copy link
Contributor

sbriais commented Apr 25, 2019

Any news about this PR?

@nojb
Copy link
Contributor

nojb commented Apr 25, 2019

Any news about this PR?

On today's developer meeting it was decided that some version of this will be integrated into the compiler, in principle in time for 4.10.

@hongchangwu
Copy link

hongchangwu commented Apr 25, 2019

Have we considered calling it something like [@tailrec]? I know that could potentially cause some confusion with [@tailcall], but it's much more readable IMO.

@ksromanov
Copy link
Contributor

Have we considered calling it something like [@tailrec]? I know that could potentially cause some confusion with [@tailcall], but it's much more readable IMO.

  1. Is it possible to use [@tailcall] in constructor calls as well as in usual calls?
  2. If not, may be some derivative of [@tailcall], such as [@tailcall_cnstr] or smth like that will work? It should be close to [@tailcall], but slightly different + it should be obvious which attribute refers to which situation.

@let-def

  1. I rebased the branch against current trunk (May 8) - in my ocaml fork/trmc-2019; it is done absolutely automagical after rebase in March.

  2. Installation in opam-2.0.4/fresh opam database was also very smooth (ocaml/trmc-2019 in $HOME, no hand-written opam file necessary):

cd
opam switch create ocaml-trmc --empty
opam pin add ocaml-variants.4.10.0+trmc ./ocaml/
eval $(opam env)

Result:

        OCaml version 4.10.0+dev0-2019-04-23

# let rec map f = function
 | [] -> []
 | x :: xs ->
   let y = f x in
   y :: map f xs
 [@@trmc]          ;;
val map : ('a -> 'b) -> 'a list -> 'b list = <fun>
# map ((+) 1) [1;2;3;4];;
- : int list = [2; 3; 4; 5]

OPAM switch and installed packages:

#  switch      compiler             description
   default     ocaml-system.4.05.0  default
      *        ocaml-trmc                       ocaml-trmc
# Packages matching: installed
# Name         # Installed # Synopsis
base-bigarray  base
base-threads   base
base-unix      base
ocaml          4.10.0      The OCaml compiler (virtual package)
ocaml-config   1           OCaml Switch Configuration
ocaml-variants 4.10.0+trmc pinned to version 4.10.0+trmc at git+file:///home/kromanov1/ocaml#trmc-2019

@gasche
Copy link
Member

gasche commented May 9, 2019

He considered using something like [@tailcall {mod,modulo,under}_constr] during the developer meeting, with no strong consensus either way. My plan is to leave the choice to @let-def when he has the time to do it. In addition to rebasing the branch (thanks @ksromanov for your help), there is the question of merging this or not with the tailrec/tailcall support (and: warnings and error messages, etc.), doing a fresh round of benchmarks could be nice (but maybe not necessary), and this feature needs to be properly documented.

@gasche
Copy link
Member

gasche commented Jun 21, 2019

@let-def: gentle ping. I've been wishing for this feature again (I'm thinking of sending a List.concat_map to the standard library and trmc would definitely help avoid annoying design choices.) Let us know if there anything we can do to help -- for example, meeting to discuss the specification of the feature and how to document it.

@ksromanov
Copy link
Contributor

ksromanov commented Jun 21, 2019

@gasche Thank you very much for keeping track of this PR.

Various list traversal functions naturally appear in our work from time to time. Just recently I needed fold_left with two list elements (in Haskell it is a composition of drop and zip, but in ocaml stdlib, unfortunately, all *2 functions require list length to be equal).

And, of course, non-tail recursive List.map is always a source of surprises on real data.

P.S.
I can always rebase this PR, if necessary. Or test on OPAM.

@pmetzger
Copy link
Member

@let-def Ping?

@let-def
Copy link
Contributor Author

let-def commented Aug 12, 2019

@pmetzger I should have time to look at that next month.

@pmetzger
Copy link
Member

@let-def Thank you!

@ksromanov
Copy link
Contributor

@let-def I am rebasing your branch from time to time - everything works well so far (rebased version in my fork/trmc-2019). The OPAM switch recipe above also works (checked today).

@ksromanov
Copy link
Contributor

ksromanov commented Nov 30, 2019

I just rebased on current trunk again - https://github.com/ksromanov/ocaml/tree/trmc-2019-Nov

Unfortunately there were significant changes in drivers, so, may be I messed up with warning defaults.

Unfortunately, since several files are compiled with -w A, which implies "all warnings are on", I had to add [@@trmc] annotations to several Stdlib files. Also I had to disable warning 70 - Potential_trmc_call in Makefiles.

And the last - ksromanov@9d1101b fixes a small bug - warning 70 is issued even when function is applied partially (there is no call => no tail call optimization).


I also checked that trmc-enabled compiler can be installed as OPAM switch (with a small hack) and compile dune/lwt/react and some other packages.

I also checked that it can compile small test with map above. The resulting map does not produce stack overflow (I used List.map as a control).

@gasche
Copy link
Member

gasche commented Apr 28, 2021

This has been subsumed by #9760, closing.

(There are some features in this PR not implemented in #9760, around the "driver", in particular the ability to ask the compiler to let us know about potential use-cases for the opt-in attribute.)

@gasche gasche closed this Apr 28, 2021
chambart pushed a commit to chambart/ocaml-1 that referenced this pull request Oct 5, 2021
@gasche
Copy link
Member

gasche commented Nov 2, 2021

The follow-up PR #9760 was just merged, six years after this first iteration from @let-def. The implementation has changed a lot, but the overall design and in particular the code-production strategy is very close to the original proposal. Thanks!

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet