Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussing the design of Lazy under Multicore #750

Open
gasche opened this issue Nov 21, 2021 · 47 comments
Open

Discussing the design of Lazy under Multicore #750

gasche opened this issue Nov 21, 2021 · 47 comments

Comments

@gasche
Copy link
Contributor

gasche commented Nov 21, 2021

This issue is collecting my understanding of the design of Lazy under Multicore.

Sequential Lazy

With Sequential OCaml, trying to force a value that is currently being forced raises Undefined. This can be caused by a thunk trying to force itself recursively, or two systhreads racing to force a value.

This is implemented by mutating the lacy block twice during forcing, once to replace the thunk closure with fun () -> raise Undefined before computation (the tag remains Lazy_tag), and once to set the result (the tag remains Lazy_tag in case of an exception, and becomes Forward_tag otherwise).

(Note: there is no allocation/yield between setting the thunk to fun () -> raise Undefined and invoking the thunk closure, so if two systhreads race to force a value, exactly one of them will get the real thunk closure.)

Concurrency problems

With Multicore OCaml, synchronization was introduced to avoid two potential issues:

  • duplicated computations: if two domains race to force a not-yet-computed thunk, they could both invoke the thunk closure
  • memory unsafety: if a domain mutates a thunk with a computed value, racing with a domain that forces the thunk, the forcing domain could try to invoke the returned-value as a not-yet-computed-thunk, which is unsound

(This is in contrast with other mutable-state libraries like Stack where lack of synchronization can break programs but not break memory-safety.)

Note: it is clear that "memory unsafety" is a blocking issue, and that solving it requires some synchronization. However, whether "duplicated computations" is a blocking issue or not is a design choice, we could have decided that it is okay to have two domains racing. This being said, I don't think that relaxing this design constraint would lead to a substantially simpler implementation, as we need synchronization anyway to avoid memory unsafety.

Current approach (November 2021)

The solution used in Multicore OCaml is basically to treat Lazy_tag values as atomic references, with the state indicated in the tag:

  • Lazy_tag indicates (1) a thunk that is not yet forced, or (2) a thunk that was forced in the past and raised an exception, or whose forcing failed with Undefiend. The block argument stores a function closure that contains the thunk computation, or is of the form fun () -> raise e case (2).
  • Forcing_tag indicates a thunk that is in the process of being forced by a thread. The block argument stores a marker of the forcing domain, to distinguish "recursive forcing" from "concurrent forcing by another domain".
  • Forward_tag or any other tag is used for thunks that have been computed to a value (not an exception).

Forcing a value performs two compare-and-swap instructions on the tag, moving once from Lazy_tag to Forcing_tag, then from Forcing_tag to Forward_tag (if the thunk returns a value) or Lazy_tag (if the thunk raises an exception).

The Forcing_tag has a lock invariant: only one domain can succeed in setting it concurrently, and this ownership is required to (1) call the thunk closure and (2) mutate the block argument. (Setting the tag back to Lazy_tag or Forward_tag releases the lock.)

In particular, this lock invariant avoids both problems of duplicated comptuations and memory unsafety: two domains cannot race in calling the thunk closure, or one in calling the closure and another in writing a result.

Note: the compare-and-swap parts had to be written in C to avoid atomic-breakage bugs caused by poll points. See #483 / #501 .

Some problems with racy forcing

If someone tries to force a thunk that is already in the Forcing_tag state, forcing fails with an exception. The Multicore implementations tries to distinguish between "access from the same domain", and fail with Undefined in this case, or "access from another domain", which fails with RacyLazy.

Problem 1: fragility of the implementation

To make this distinction, the racy forcer looks at the block argument, which is a marker of the forcer's domain. This is however fairly fragile as this lookup is racy (when doing the lookup, we don't hold the Forcing lock). This fragility caused the bug #504, and the fix is sophisticated (using a "magic" value created by the GC that is supposed to be secret to the Lazy code in unspecified ways).

Problem 2: domains as an incomplete abstraction

Domains are not meant to be the final granularity of concurrency for
user-level OCaml code, but instead be populated by lightweight
abstractions like systhreads, fibers, etc. From a user of those
abstractions, the distinction "from this or another domain" is not the
right distinction, it does not suffice to distinguish concurrent
forcing from recursive forcing.

Problem 3: poor programmability

Problem 2 could be the job of users to fix: if they are using systhreads or fibers or tasks, etc., they can catch the forcing exceptions and perform their own concurrency logic after that.

The problem is that users cannot really tell, on receiving a RacyLazy exception, whether the Lazy.force call they made was racy (in which case they may want to try again later), or whether they correctly forced a thunk that itself raised RacyLazy as part of its computation (in which case they will forever get this exception back).

try_force

@gadmm identified "problem 3" above, and proposed (in #338) a try_force function that would return an option, with None unambiguously indicating that the thunk is being forced concurrently. All concurrent users of thunks should use try_force, and RacyLazy can be treated as a programming error that should never be caught.

This proposed implementation initially removed the distinction between Undefined and RacyLazy in force, both conditions (same-domain forcing and other-domain forcing) would raise Undefined, while try_force would return None on other-domain forcing and raise Undefined on same-domain forcing. This part was reverted, and the current code still raises RacyLazy.

@gasche
Copy link
Contributor Author

gasche commented Nov 21, 2021

This comment collects some remarks/changes/suggestions I had while reviewing the Lazy implementation and bug history.

Repeated-computation bug in CamlinternalLazy.do_force_block?

CamlinternakLazy.do_force_block is called from the domain that "owns
the Forcing lock" (given by update_to_forcing to
force_gen_lazy_block which calls do_force_block when taking the
lock succeeded).

let do_force_block blk =
  let b = Obj.repr blk in
  let closure = (Obj.obj (Obj.field b 0) : unit -> 'arg) in
  Obj.set_field b 0 (domain_unique_token ());
  try
    let result = closure () in
    Obj.set_field b 0 (Obj.repr result);
    update_to_forward b;
    result
  with e ->
    begin match e with
    | RacyLazy -> Obj.set_field b 0 (Obj.repr closure)
    | _ -> Obj.set_field b 0 (Obj.repr (fun () -> raise e))
    end;
    reset_to_lazy b;
    raise e

The | RacyLazy -> case is rather suspicious: we never go into do_force_block if the present thunk is being forced, so this RacyLazy does not come from the forcing of the current thunk, it comes from the forcing of some other thunk in the middle of the closure () call. In this case the closure is reverted to its pre-forcing state, instead of mutating into an exception-raising closure.

I believe that this creates the duplicated-computation issue that the design is trying to avoid: concurrently forcing lazy e may compute parts of e several times (concurrently), as long as some of the executions fail at some point by lazily-forcing another thunk.

The rationale for this special treatment was given by KC in #338 (comment), but I suppose that it was not realized at the time that it broke an explicit design choice.

Suggestion: handling of failed thunks

When forcing a thunk fails with an exception e, the current
implementation sets the thunk to (fun () -> raise e), as a way to
ensure that subsequent forcing will directly raise the exception. Such
"failed thunks" are however fairly inefficient, as subsequent forcing
calls will always:

  • fetch the domain-unique-token to use during forcing
  • set the block tag with this token
  • invoke the closure that raises an exception
  • allocate a new closure that raises this exception
  • set the block tag with this closure

A possibly more direct logic would be to set the block arg to be the exception e directly, rather than a fresh closure (fun () -> e). Then a Lazy_tag block would have an argument that is a closure if the function is not yet computed, or an exception otherwise (note: some builtin exceptions have tag 0). If we take the Forcing lock and notice that we have an exception, we can reset_to_lazy right away and raise the exception, without any of those extra steps. (But still two atomic CAS; to avoid those, one would have to use yet another tag, which is not desirable.)

Suggestion: Completely merging Undefined and RacyLazy

We could forget about the distinction of RacyLazy and Undefined, and behave in the same way whether the racy forcer comes from our domain or another. (As explained in "Problem 2", this is not a very robust distinction.):

  • force would call Undefined on both, as in @gadmm's proposal
  • try_force would return None on both

This would substantially simplify the implementation by removing the complex logic to identify the forcing domain in a datarace-safe way.

The downside is that some code that is recursively forcing a thunk could start looping if the authors used try_force in this situation without noticing the issue.

@Gbury
Copy link
Contributor

Gbury commented Nov 21, 2021

A possibly more direct logic would be to set the block tag to be the exception e directly,

Wouldn't that prevent distinguishing between a lazy value that raises an exception and a lazy value that returns an exception ?

@gasche
Copy link
Contributor Author

gasche commented Nov 21, 2021

The representation I have in mind is a block of tag Lazy_tag with an exception in the first block argument. Forced thunks that successfully returned an exception value are either a block of tag Forward_tag with the exception in the first block argument, or directly the exception value.

Edit: I realize there was a typo in my previous message (block tag => block arg), now fixed, thanks!

@gasche
Copy link
Contributor Author

gasche commented Nov 22, 2021

Allowing repeated computations on force-force races?

What if we decided that we are okay with redundant computations if two domains race to force the same thunk? For example, we could use the following implementation scheme:

  • read (atomically) the block argument
  • read (atomically) the tag
  • if the tag was Lazy, then the block argument is a closure, compute its value
  • once the computation is done, compare-and-swap the Lazy tag to Forward_tag and then write the value
    (what happens if the Forward_tag points to an immediate and shortcutting happens?)
    if the Lazy tag is not there anymore, a value was computed in the meantime,
    return that value instead (atomic read of the block argument)

(Returning the value-in-the-meantime instead of the value we computed ourselves guarantees that forcing the same thunk always returns the same value, although effects may differ.)

With this scheme, there is no detection of self-loops anymore in the implementation (that would previous raise Undefined), but otherwise thread-local usage of lazy have exactly the same behavior as Sequential OCaml or Multicore OCaml today.

This approach would be more expressive, in that it would be compatible with non-thread-local usage (if we decided to allow users to do this): users would have to use a concurrency-control mechanism in their thunks, which could be aware of their desired concurrency abstraction to provide the semantics they want (the lazy thunk could block if entered concurrently, and/or include accurate self-loop-detection logic).

If the only form of concurrency-control that we want is "fail on concurrent forcing", then this proposal of having a lock in user code is probably less efficient than the current Multicore approach. (But then this proposal may be more efficient than the current Multicore approach for thread-local usage, as it only uses one atomic stores instead of two.)

I think that this version of Lazy.force may not be the right choice by default (repeated computations and self-loops sound like debugging nightmares to be had), but it may be a good separate primitive to provide for people willing to use Lazy in concurrent scenarios (Lazy.unchecked_force?). I'm not sure that try_force currently succeeds in providing a flexible enough primitive for this (one cannot actually block on the in-progress thunk, just yield), whereas this unchecked_force is evidently flexible enough.

(cc @gadmm who I suspect has thought about this stuff)

@xavierleroy
Copy link
Contributor

When forcing a thunk fails with an exception e, the current
implementation sets the thunk to (fun () -> raise e), as a way to
ensure that subsequent forcing will directly raise the exception. Such
"failed thunks" are however fairly inefficient, [...]

I would not worry too much about inefficiencies in this case. A lazy value that raises an exception is almost always a programming error. So, I'd rather keep the implementation as simple as possible.

@xavierleroy
Copy link
Contributor

We could forget about the distinction of RacyLazy and Undefined, and behave in the same way whether the racy forcer comes from our domain or another. (As explained in "Problem 2", this is not a very robust distinction.): force would call Undefined on both

I can understand the rationale for the current design: Undefined corresponds to a self-referencing lazy value, which cannot be evaluated at all, while RacyLazy is just temporary failure and a suggestion to try again. If you try again on a self-referencing lazy value, you get non-termination instead of a clean Undefined exception.

@gasche
Copy link
Contributor Author

gasche commented Nov 22, 2021

But Undefined is only an approximation of self-reference detection. It doesn't work for systhreads, or in fact any other domain-local concurrency mechanism: any cooperative-concurrency monad (or effect-handler) could lead to false positives if the computation of a thunk yields control to other fibers.

(For example: a logging resource is stored in a lazy thunk to be initialized only on demand; two fibers are trying to perform a logging call, the first gets control and forces the logging thunk, the logging initialization logic itself yields, and the second fiber gets control: Undefined.)

The current design is in an awkward situation where:

  • we don't want to make thunks "really thread-safe" by blocking the domain on concurrent forcing, to leave room for domain-local concurrency mechanisms
  • but the implementation is unsafe for domain-local concurrency mechanism, even when using try_force

Unless I am missing something, there is currently no robust way to manage a lazy thunk as a shared resource (it must be synchronized/protected). The only safe way to use thread is with exclusive ownership, and RacyLazy should be understood as signalling a programming error for what could be undefined-behavior with a less-safe implementation. try_force is an improvement, but remains incompatible with domain-local concurrency.

@bluddy
Copy link
Contributor

bluddy commented Nov 22, 2021

This seems like a very complex implementation for something that should never happen. Lazy shouldn't ever be used for synchronization. It seems like the only reason we're forced to use synchronization is because the abstraction is a relatively high-level one and we want to hide our magic from the user. Therefore, why can we not just loop when we encounter a half-forced lazy value? We could use a spinlock while making sure to also check the GC work once in a while.

@gadmm
Copy link
Contributor

gadmm commented Nov 22, 2021

Thanks for the ping. I made some remarks about some of these issues recently at #713 and #520 and was waiting for feedback, it is probably good to centralize it with an issue. In order of your remarks:

  • Sequential Lazy: on the contrary, our conclusion was that racing to force a Lazy value concurrently with systhreads was memory-unsafe in stock OCaml.
  • My interest was in implementing double-checked locking with a mutex efficiently (for when the computation can take long, but the value is going to be accessed many times once computed). But my conclusion is now that the current interface is not sufficient to do it as efficiently as possible. Yet, the synchronisation cost of double-checked locking is already paid for the sake of memory safety; apart from the question "how do I store a mutex until the thunk is forced", it does not seem more costly to provide synchronisation. Storing the mutex separately and freeing its memory by mutating a record field would likely negate the performance benefits of Lazy. (But maybe other people do not see there a good use of Lazy?)
  • One could have both a default most generic implementation that uses a mutex because it is a very useful pattern to have (see e.g. François Pottier's intuition that Seq.memoize was automatically going to be thread-safe). Such a default implementation does not preclude alternative "try_force"/"create" primitive to allow custom implementations that avoids using a mutex. This takes the point of view that domains are not an "incomplete abstraction" but a most generic one. I am still not sure about what is desirable as a default. (Note that an error-checking mutex would let you distinguish recursively forcing from forcing from two systhreads concurrently.)
  • One clearly wants to reliably react to Undefined to deal with concurrently forcing from two fibers. So try_force should not raise, but should it return a more precise enum? I do not see when you want to tell the CPU to relax when forcing in parallel from two domains but do something else when forcing concurrently on one domain. Instead, there could be a specialized function to call cpu_relax and raise otherwise, and have try_force return None for both conditions for other use-cases. But a bigger issue is that it is not clear that the interface is expressive enough for efficient implementations of locking (see second point); I think it would be better for interface design to start with some concrete examples (even in pseudo-code).
  • For the other extremum of short (and pure) computations, it seems that repeating the computation on separate domains can improve performance, see Harris, Marlow and Peyton Jones, “Haskell on a Shared-Memory Multiprocessor”. This is the default behaviour for Haskell but it would not make sense to make it a default behaviour for OCaml because it would only be useful for specific programming styles. (As you note, the current interface does not allow this.)
  • My overall conclusion is that it is clear that you want a diversity of synchronisation mechanisms, not clear that the current interface is useful for this, and not clear that you won't prefer to implement some of the behaviours natively. For interface design, there is some exploration of the space to do starting from examples. For upstreaming, it might be better to keep for now what one is sure about, for instance removing try_force for now, have the specification of force be forward-compatible with some default synchronisation mechanism if you want to introduce one later on, and maybe introduce a variant that calls cpu_relax on inter-domain races.

On a personal note, I quickly found that this problem was much more involved than I thought (i.e. time I was willing to spend on it), and I am no longer actively looking at exploring better designs, though I volunteered to clean-up the interface a bit before upstreaming.

@gadmm
Copy link
Contributor

gadmm commented Nov 22, 2021

I wonder how much of these various behaviours could be factored in one by storing inside the lazy block not the thunk that is going to be forced, but a closure made of a forcing function with suitable synchronisation/non-synchronisation together with the thunk as an argument, and expecting the lazy block as a second argument (e.g. so that the lazy keyword in pattern-matching always does the right thing).

@lpw25
Copy link
Contributor

lpw25 commented Nov 22, 2021

But Undefined is only an approximation of self-reference detection. It doesn't work for systhreads, or in fact any other domain-local concurrency mechanism: any cooperative-concurrency monad (or effect-handler) could lead to false positives if the computation of a thunk yields control to other fibers.

Not sure it helps with any of the other issues mentioned, but it seems like this could be solved by storing a user-provided token from an extensible variant in the lazy whilst it is being forced rather than storing the domain id. This token would also be returned to anyone who attempts to force the lazy concurrently.

This would allow concurrency libraries to store whatever notion of identity for a unit of concurrency that they want to and thus distinguish between racy and ill-founded forcing.

@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 22, 2021

Repeated computation bug

Neither of the two available choices in the current design for handling nested lazy computation bug is satisfying. I reproduce here the example from #338 (comment)

let l1 = lazy ...
let l2 = lazy (Lazy.force l1)
let domain1 = Domain.spawn (fun _ -> ignore (Lazy.force l1))
let domain2 = Domain.spawn (fun _ -> ignore (Lazy.force l2))

The problem is that there could be a race on forcing l1, which is lost by domain2.

Solution 1: Restore the computation, but suffer recomputation

As @gasche points out in #750 (comment), the RacyLazy restores the original computation and raises the RacyLazy exception. Since the inner lazy raised RacyLazy the outer lazy computation could not complete. We raise RacyLazy here to indicate that forcing the outer lazy again may complete the computation. But as pointed out, this duplicates the computation.

Solution 2: Update the lazy to raising RacyLazy and suffer non-termination

I guess the way to look at this is that since the RacyLazy was originally raised at the inner lazy value, the concurrency control should have been done there, and not let the exception bubble up to the outer lazy. However, naively updating the outer lazy with a closure that always raises RacyLazy and raising RacyLazy gives the wrong idea to the client that forcing the outer lazy again may complete the outer computation, which will never be the case now. We've lost the original computation.

We can't make the outer lazy always raise Undefined as there were no recursive invocations of the lazy. Are there other solutions?

Merging Undefined & RacyLazy

Simplifies the interface. We will just document that if you are doing Lazy.force concurrently, expect to see Undefined. Recommend, Lazy.try_force for concurrent use.

This option is memory-safe, avoids recomputation, and has all the rest of the nice properties that we want to have. I'd be happy to go with this option. The downside with this code is that when users port their code to multicore they may see Undefined exception, but that's the expectation. This fixes the repeated computation bug above in an expected way; there is concurrent Lazy.force l1. The code is not correct.

I am leaning towards this choice for MVP. We may explore richer semantics later.

@gadmm
Copy link
Contributor

gadmm commented Nov 23, 2021

@gasche suggested to have a primitive Lazy without locks and with repeated forcing on races, the idea being that one can then implement the desired synchronisation behaviour inside the thunk itself. There are more ideas (and motivations) in the Haskell paper mentioned above for making it memory-safe without CAS along similar lines as his suggestion. One may need to pass the lazy block as an additional argument to the thunk depending on how one deals with updating it, but I believe this lets one implement all the behaviours one might want. This also solves where you can store an atomic value containing your custom thread id, or a mutex, etc.

@kayceesrk
Copy link
Contributor

it seems like this could be solved by storing a user-provided token from an extensible variant in the lazy whilst it is being forced rather than storing the domain id. This token would also be returned to anyone who attempts to force the lazy concurrently.

@lpw25 This sounds like a good idea.

The only usability issue is that, if the value is made user-visible, then there is the possibility that Lazy.try_force may raise Undefined where it should have returned None, if the token is the result of a lazy computation. The crux of the problem is the same as #504. Assume that fiber 1 forced a lazy which returned v, and v happens to be the token with which fiber 2 forces the same lazy. Due to the order of the writes for updating the lazy with the final value, fiber 2 may see Forcing tag and v as the id in the lazy which is the same as the one used by fiber 2. There is a race here, and try_force should return None, but it will raise Undefined since it looks like recursive forcing.

That said, this is no worse than using the same "unique" token for two different fibers.

@gasche
Copy link
Contributor Author

gasche commented Nov 24, 2021

if the value is made user-visible [...]

This is an issue with the current choice of implementation, but there are race-free implementations of Leo's idea (obviously: we could extend Lazy blocks with another argument that would be a single-writer multi-reader lock, done; but we are trying to avoid this for memory-efficiency reasons). Staying close to the current implementation: we could store in the Forcing_tag argument a pair of a single unique/magic value (never leaked outside), and the user-provided forcing token. If we atomically-read the lazy block argument and it's a pair with the magic value, it must come from the Forcing state. (This means allocating when entering the Forcing state, which I believe is fine.)

I thought along lines similar to Leo's proposal ("forcing witnesses"), but I didn't think of extensible variants, which are kind of his thing. In my mind a proper API with this would have ('v, 'tok) Lazy.t, which would be a big API change. Fixing 'tok to a single extensible type is one way avoid this, but another option could be to define a two-parameter type (Lazy.t2?), and provide a single instance with t (with the extensible variant or just Domain.id as a fixed payload).

Note: if you propose this sort of nicety, then I think we could consider dropping Undefined completely and letting users compare forcing tokens to decide if they are self-looping -- instead of insisting to do an equality check on the witness on the Lazy side, as I think KC has in mind in his example.

type ('a, 'tok) t2
val from_fun : ('unit -> 'a) -> ('a, 'tok) t2
val try_force : ('a, 'tok) t2 -> ('a, 'tok) result

type 'a t = ('a, Domain.id) t2

(One advantage of the extensible-variant as opposed to this parametric API is that we can also return the extensible-variant token as an exception argument for force : 'a t -> 'a, whereas we cannot have a parametrized type as exception argument.)

@gasche
Copy link
Contributor Author

gasche commented Nov 24, 2021

I wonder how much of these various behaviours could be factored in one by storing inside the lazy block not the thunk that is going to be forced, but a closure made of a forcing function with suitable synchronisation/non-synchronisation together with the thunk as an argument, and expecting the lazy block as a second argument (e.g. so that the lazy keyword in pattern-matching always does the right thing).

If I understand correctly @gadmm's suggestion, it corresponds to a sort of object-oriented approach where thunks decide, on creation time, how they are going to be forced, which means that consumers need no extra information on the "forcing" side, which has the advantage of giving a more flexible semantics to lazy patterns (which otherwise default to Lazy.force which may or may not be the correct approach).

type 'a t = 'a lazy_state ref
type 'a lazy_state = Lazy of ('a t -> 'a) | Forcing of ('a t -> 'a) | Forward of 'a | Exn of exn

let force (thunk : 'a t) =
  match !thunk with
  | Lazy force -> force thunk
  | Forcing race -> race thunk
  | Forward v -> v
  | Exn exn -> raise exn

let from_fun (f : unit -> 'a) =
  let closure (self : 'a t) =
    self := Forcing (fun _ -> raise Undefined);
    match f () with
    | exception exn -> self := Exn exn; reraise exn
    | v -> self := Forward v; v

I find this suggestion interesting, but also substantially more complex than the "relevant forcing" proposal (we have a lot more flexibility but it's not clear what to do with it, and what the performance cost is), so I would be tempted to explore "relevant forcing" first.

@gasche
Copy link
Contributor Author

gasche commented Nov 24, 2021

Note on "relevant forcing": in addition to a magic value and the forcing token, the Forcing state could also store the original closure (and return it to try_force along the token). This allows a purely user-side implementation of the duplicated-computation approach (what I called unchecked_force).

@gasche
Copy link
Contributor Author

gasche commented Nov 24, 2021

I read @gadmm's reference, Haskell on a Shared-Memory Preprocessor, Tim Harris, Simon Marlow, Simon Peyton Jones, 2005. It's interesting, the amount of sophistication is impressive (duplicate computations are detected "lazily" when a thread scans its call stack, and in that case the computation is aborted to block on the first forcer), but most aspects of it are very different from the current OCaml design -- and thankfully we don't have the same performance constraints.

Some remarks:

  • They remark that CAS-ing twice on each force, as Multicore does as well, has an important cost in practice, +50% mean slowdown (and +200% on some benchmarks) compared to Sequential GHC. I would expect Lazy-heavy programs in OCaml to suffer similar slowdowns, but thankfully Lazy-heavy programs are rare, with forcing nowhere as pervasives as in Haskell code. Their lock-free solution is super-complex and not something we could accept in OCaml (it necessarily opens the door to duplicate computation), so I believe our CAS are there to stay.
  • They use several block arguments per thunk. Maybe we shouldn't worry about adding more arguments to a Lazy block.
  • When they move a thunk to the Forcing state (they call this "blackholing" the thunk), they basically write a function pointer that, on invocation, will block the caller to wait on the result. So instead of a data "token" on force, they push "code" that is invoked on racy forces.

@gasche
Copy link
Contributor Author

gasche commented Nov 24, 2021

We could simplify the logic a lot by having several block arguments per lazy thunk. For example

  1. the first argument stores the Lazy closure
  2. the second argument stores the Forcing payload
  3. the third argument stores the Forward value

With this design we don't have the problem of being unable to interpret the block argument due to races. We can still observe a Forcing payload or a Lazy closure that changed since we started forcing, but this looks semantically correct. (It corresponds to the expected semantics for a different concurrent scheduling.)

We could even remove that last confusion by storing Forward exceptions as a fourth block argument, so that we never go back from Forcing to Lazy.

@xavierleroy
Copy link
Contributor

Watch for memory leaks. The closure must become unreachable from the lazy value as soon as we start evaluating it, otherwise leaks occur. In other words, blackholing is not just to detect self-referential lazy values, but to ensure timely garbage collection of the closure.

@bluddy
Copy link
Contributor

bluddy commented Nov 24, 2021

Are there any methods that would allow us to ditch the atomic after we force the value? After all, we only want to limit access to the unforced value -- once it's forced, we don't care about synchronization.

So we could have an additional, first boolean word that is always initialized to False, indicating whether the value should be accessed as an atomic. The first thread to force the value will first indicate Forcing atomically in the second word, and then write the full value of the second word after it is done. Once forcing is complete, the first boolean is set to true by the same thread. Any thread accessing the value will first check if it needs to read it atomically using the first boolean. Even if it misreads the boolean before it's been fully written and therefore reads the 2nd word atomically, it's not a big deal.

Is this kind of thing doable?

@gasche
Copy link
Contributor Author

gasche commented Nov 24, 2021

@bluddy if forcing a value completes successfully, the tag becomes Forward_tag or some other tag (not Lazy or Forcing), and no atomic reads are performed by forcing that value. I think that this implements something similar to your suggestion.

@xavierleroy Good point! Note that the current Multicore implementation keeps the closure reachable, because it is reinstated if it fails with RacyLazy. Another argument against this reinstatement.
(I realize now that the Sequential implementation does not keep closure live after calling it, but a comment to indicate that this is an important property may have helped the Multicore people :-)

@kayceesrk
Copy link
Contributor

Multicore implementation keeps the closure reachable, because it is reinstated if it fails with RacyLazy

Agreed. Restoring the lazy computation is a mistake for several reasons.

@gadmm
Copy link
Contributor

gadmm commented Nov 24, 2021

I think there was a misunderstanding of my suggestion and/or this Haskell paper; I will try to clarify.

The lock-less implementation of Lazy is actually much simpler than the current implementation of Lazy in multicore (if there is any doubt, I have included some pseudo-code in my examples below). The main idea is to write the thunk and the result in separate fields to avoid a race, it requires no CAS and only needs two states instead of three (no need for a forcing tag), this does not count as "super-complex" to me. Duplicate evaluation is a feature: this is what lets you decouple the synchronisation mechanism. (The Haskell paper only becomes complicated when it talks about stopping the duplicated evaluation before it ends, where the GC must be involved. Obviously you do not want or need this in OCaml, especially if what is interrupted is some synchronisation code.)

I claim the following: the default implementation (as well as its variant with a "custom" thread id, and other ideas I see floating) can be factored through this primitive lock-less lazy. For performance, I only see two differences: 1) I only need (I think) one CAS instead of two (as an additional benefit of storing the value and the thunk separately) ; 2) there can be extra allocations, in particular the atomic containing the thread id is allocated separately (but if this becomes important then you could let the creation function request extra fields for its own use when allocating the lazy, for instance). In addition, it supports any custom synchronisation mechanism you want.

The following pseudo-code fleshes out my suggestion better (I omitted exception-handling for simplicity though); I wrote down five six different realistic, memory-safe examples (all of which correctly free any thunk or atomic as early as possible).

Examples

@gadmm
Copy link
Contributor

gadmm commented Nov 24, 2021

They remark that CAS-ing twice on each force, as Multicore does as well, has an important cost in practice, +50% mean slowdown (and +200% on some benchmarks) compared to Sequential GHC. I would expect Lazy-heavy programs in OCaml to suffer similar slowdowns, but thankfully Lazy-heavy programs are rare, with forcing nowhere as pervasives as in Haskell code. Their lock-free solution is super-complex and not something we could accept in OCaml (it necessarily opens the door to duplicate computation), so I believe our CAS are there to stay.

I added a sixth example (the simplest of all), a lock-less one that never fails on races but duplicates the computation if needed. The motivating use-case is somebody wanting to implement some of Okasaki's “Purely functional data structures” (1996) which use lazy evaluation (where the thunk is pure) to achieve some amortized complexity. Such persistent data structures are a good candidate for wanting to share them across threads, but the cost of synchronizing every forcing of thunks would probably be prohibitive. (Bonus: a library of such lock-less, persistent data structures would be universal for multicore OCaml, not dependent on whichever threading library one uses.)

@kayceesrk
Copy link
Contributor

They remark that CAS-ing twice on each force, as Multicore does as well, has an important cost in practice, +50% mean slowdown (and +200% on some benchmarks) compared to Sequential GHC.

Anecdotally, CAS has gotten much cheaper over the years, especially uncontended ones. The numbers reported may not stand the test of time. FWIW we use a CAS on each object to promote them in the parallel minor GC. @ctk21 or @stedolan may be able to confirm whether this is true.

There are quite a few ideas flying around, and it has become hard for me to keep track of the various options since lazy is quite subtle due to its interaction with concurrent mutator and GC threads. It would be nice to summarise the various options in a design space document with the tradeoffs.

I am also a little bit worried about redesigning the lazy implementation from scratch at this stage, though my worry may stem from my lack of understanding of the various proposals. Having implemented Lazy in various broken ways previously, I would err on the side of forwards-compatible implementation with simple semantics.

Let's discuss this today in WG4 meeting.

@bluddy
Copy link
Contributor

bluddy commented Nov 25, 2021

I have to admit that I didn't even know about the existence of the Undefined exception for recursive forces. This is an extreme edge case though, and I have to ask -- do we even need it (for multicore)? Why not just block when encountering the Forcing tag? Recursive forcing would cause no termination, and racy forcing would just cause one thread to wait for another, unless the user caused a deadlock, in which case all bets are off, and that's fine. I don't see a reason to burden the user with handling Undefined and RacyLazy. If the user wants full usage of multicore laziness, he'd better add some synchronization. Otherwise, he'll get the bare minimum, which is blocking to prevent unsoundness, and he needs to be careful.

I do also think we need to try and make sure the CAS is only required when forcing the lazy value, and that afterwards it should be available without a CAS if such a design is at all possible.

@gadmm
Copy link
Contributor

gadmm commented Nov 25, 2021

Regarding performance of CAS, digging deeper into the implementation made me realise that we mean two different things with CAS. The current implementation does two CAS loops, each of which require a C function call ("noalloc"). This is because the update of the tag races with updates to the GC flags. Ideally you should only need one plain CAS and no external function call. Incidentally, I realised my examples currently need this CAS loop when updating the tag for the same reason. These external function calls + CAS loops can actually be removed but the trade-offs must be benchmarked. (I have updated my example file to explain this.) So there might be performance gains but for different reasons than described in the Haskell paper.

I am also a little bit worried about redesigning the lazy implementation from scratch at this stage, though my worry may stem from my lack of understanding of the various proposals. Having implemented Lazy in various broken ways previously, I would err on the side of forwards-compatible implementation with simple semantics.

This sounds very reasonable. My example suggests that it is beneficial to avoid having to expand the list of reserved tags with a new "forcing" tag, which would be the only thing that encourages looking into this option now rather than later.

I have added "minimal demos" at the end of my example file which would be the simplest way to test the lock-less approach. They describe the default behaviour without the decoupling. If this works then I think the rest works too, and it would give a comparison point in terms of performance.

I agree that it would help to have a design document (I update my example file so that it remains up-to-date regarding the lock-less proposition, so you can see it as a "design document" for now). The main question I see is which design better enables the user to implement their own synchronisation. It would help to have pseudo-code examples of how custom synchronisation is expected to be implemented by proponents of the try_force approach. (As far as I see, this approach only supports spin-waiting, and it is not yet clear how to cleanly support custom threading libraries.)

@gasche
Copy link
Contributor Author

gasche commented Nov 25, 2021

OCaml maintainers had a discussion about this with the Multicore people today. The consensus that we reached (which was also my proposal) was to go for the minimal implementation for now, to (1) keep things as simple as possible, and (2) leave the door open to better proposals than what we currently have.

Minimal in the following sense:

  • Lazy.t is specified as not being thread-safe: concurrent usage must add its own synchronization
  • Undefined and RacyLazy are unified again
  • no forcer-identifying information in the Forcing payload (simplifies the implementation)
  • no try_force, I think

We can improve on all of these later extending the minimal proposal, possibly along the lines of Leo's idea (user-provided token on Forcing) or Guillaume's idea.

@gadmm: would you be interested in elaborating a version of your proposal as a fork of the 5.00 branch? (This is something we work on together if you like.) This would allow us to do benchmark to evaluate performance differences but, more importantly (I think efficiency is not the main concern, at least for me) to actually demonstrate feasibility and expressivity.

@dbuenzli
Copy link
Contributor

  • Lazy.t is specified as not being thread-safe: concurrent usage must add its own synchronization

Is Mutex.t the only tool here ? A usage pattern of Lazy is for libraries to trigger initialisation bits on first use. The slight problem is that unlike Atomic.t, Mutex.t doesn't live in the Stdlib.

@gasche
Copy link
Contributor Author

gasche commented Nov 25, 2021

Mutex is in the Multicore stdlib. We could consider moving it in 4.14 for compatibility, but this might be a bit tricky (do we want to provide different implementations depending on whether systhreads are used?). Would you open a tracking issue?

@kayceesrk
Copy link
Contributor

kayceesrk commented Nov 26, 2021

no try_force, I think

That is correct.

There are a few things we should keep in mind while making Lazy safe for concurrency in the future.

Concurrency using fibers is fundamentally different from domains and systhreads. In particular, while fibers are used to build concurrency abstractions, they are also used for other purposes as well, such as encoding mutable state through the state effect, which is not concurrency per se. Additionally, the fundamental distinguishing feature is that fibers by themselves don't have a notion of a "scheduler" which domains and systhread do. A fiber has no built-in notion of parking and unparking them. For example, when a fiber blocks on a lazy the notion of what should run next (if any) is provided by the handler in the context. I played around with concurrency structures that are parameterized over the schedulers in effects examples. There may be a few hints there.

The other point is that the OCaml 5.00 standard library will have other structures that don't work with fibers such as Mutex, Condition and Semaphores. Hence, the distinction between fibers and (domains & systhreads) is already there.

@gadmm
Copy link
Contributor

gadmm commented Nov 30, 2021

@gadmm: would you be interested in elaborating a version of your proposal as a fork of the 5.00 branch? (This is something we work on together if you like.) This would allow us to do benchmark to evaluate performance differences but, more importantly (I think efficiency is not the main concern, at least for me) to actually demonstrate feasibility and expressivity.

I would be happy to help, and to be helped for this. As a first step, I'd be happy with some feedback regarding my pseudo-code (@gasche,@lpw25?); especially if someone familiar with the OCaml memory model could please have a look at the synchronisation part (today's discussion already clarified a lot). I have updated my document so that it is fairly detailed. The roadblocks are as follows:

  1. missing acquire/release fences in OCaml memory model for an ideal implementation (this was partially clarified today; we can do with an atomic load at the start since the current multicore implementation ends up having to do worse).
  2. the [@poll error] block is insufficiently expressive (same issues that I have already mentioned in its PR). (But this does not seem to be an obstacle in current OCaml and multicore where the same issue appears.)
  3. I welcome help to adapt lambda/{matching,translcore}.ml which is still hieroglyphs to me.
  4. I can deal with changes to the runtime.
  5. flambda support (same situation as current multicore implementation).
  6. I welcome help to benchmark implementations (some computations using a concurrently-explored persistent amortized data structure?).

1.,2.,5. we can ignore at the start. The "minimal demo" at the beginning of the document is already very close to the current implementation. Would anyone be willing to help with 3. and 6. ?

I think the benefits to get rid of tag mutation are bigger than initially thought so I would like to explore this option first.

I also realized that there is another dimension of customization that benefits from a flexible design, which is interrupt-safety: in case of an exception, should I free the thunk early and “poison” the lazy block, or should I put things back in order so that later consumers have a chance to succeed? This is important in the presence of cancellation/asynchronous exceptions, if I want to share a data structure across cancellation units (the poisoning semantics is only useful for sharing among people I want to interrupt together). The user should be able to pick the appropriate semantics.

I think efficiency is not the main concern, at least for me

If the purpose was not efficiency, then the Lazy module could be coded directly in OCaml. The special treatment from the GC, from Lambda, and from the memory representation, are a heavy investment whose only point is efficiency. Or am I missing something?

@gasche
Copy link
Contributor Author

gasche commented Nov 30, 2021

I'm happy to help with 3 (adapting the matching logic). Regarding 6, I think that the non-shared use-case remains the determinant criterion for Lazy, so I think that you could start by benchmarking non-concurrent usage: if you do sensibly worse than the current implementation, it may be a no-go (for the standard, Lazy.t type), and if you do sensibly better, people will want it anyway.

For a sequential benchmark, I would consider something as follows:

let fib n =
  let rec fib i = Lazy.force (Lazy.force fibs).(i)
  and fibs = lazy (Array.init (n+1) (function
    | 0 -> lazy 0
    | 1 -> lazy 1
    | i -> lazy (fib (i - 1) + fib (i - 2))
  ))
  in fib n

let () =
  let usage () = prerr_endline "usage: ./fib <input> <iteration-nb>" in
  if Array.length Sys.argv < 3 then (usage (); exit 0)
  else match int_of_string Sys.argv.(1), int_of_string Sys.argv.(2)  with
    | exception _ -> usage (); exit 1
    | input, nb_iter ->
      Printf.printf "%d\n%!" (fib input);
      for i = 1 to nb_iter - 1 (* already iterated once above *) do
        ignore (fib input);
      done

If the purpose was not efficiency, then the Lazy module could be coded directly in OCaml. The special treatment from the GC, from Lambda, and from the memory representation, are a heavy investment whose only point is efficiency. Or am I missing something?

Personally I think that having a standard building block for "delayed and memoized computations" is the major service provided by Lazy, not necessarily its performance profile (I don't know of existing programs where Lazy plays a determinant role; it is rumoured that Lexifi has some, so maybe cc @alainfrisch @nojb). I find the "Forward shortcutting" optimization rather elegant, because it means that lazy thunks are basically free when reused a lot. But the other aspects may not be critical; whereas "behaving properly in concurrent programs" would be a net benefit to many prospective users of lazy (for delayed library initialization, etc.).

@xavierleroy
Copy link
Contributor

A long time ago, before the "forward shortcutting" optimization was implemented, Andrew Tolmach experimented with retargeting the GHC Haskell compiler to produce OCaml Lambda IR + lazy values. He concluded that performance was close to that of GHC provided "forward shortcutting" was implemented in the GC. (GHC's GC does it, of course.) Then, @damiendoligez liked the idea and merged Andrew's implementation of "forward shortcutting" in OCaml.

The moral of the story is that OCaml's implementation of lazy evaluation is not far behind that of GHC in terms of performance. That's a nice achievement that I'd like to preserve as we move into OCaml 5.00.

@gadmm
Copy link
Contributor

gadmm commented Nov 30, 2021

Thanks @gasche!

Well maybe not many users have had a chance to explore concurrent lock-free persistent amortized functional data structures in OCaml yet (I am only half joking). I think it would be nice to have also a benchmark that compares various kinds of synchronisation, if we want to decide later which lazy implementations to introduce in the standard library (implementing a new lazy even with the high-level primitives is still not for the casual programmer). The Haskell people like to tout the advantages of purity and laziness, I wonder what are their go-to benchmarks that crucially relies on laziness and takes advantage of parallelism.

@xavierleroy
Copy link
Contributor

Re: performance of lazy evaluation, I thought more about it and still believes that the forward shortcutting optimization is crucial.

However, representing lazy thunks by a block of size 2 words instead of 1 might be tolerable. (So that one of the words can be used to record the status of the thunk, instead of using the header tag for this purpose.)

If we had a good benchmark for lazy evaluation, we could even try to measure the overhead on a modified OCaml 4.x.

@alainfrisch
Copy link
Contributor

(I don't know of existing programs where Lazy plays a determinant role; it is rumoured that Lexifi has some, so maybe cc @alainfrisch @nojb)

It's true that we use Lazy perhaps a bit more than the typical OCaml user, but I wouldn't go as far as to say that the performance of Lazy necessarily plays a determinant role for us. It might be the case that we would suffer some observable degradation for some payloads if Lazy becomes much slower, but I don't know how to quantify that for now. If a branch with the intended change is made available for 4.xx already, we'll be able to provide some concrete feedback.

@gadmm
Copy link
Contributor

gadmm commented Dec 2, 2021

@gasche (and anyone else interested in helping), the code for the "minimal demo" (two fields instead of tag mutation, no custom synchronisation yet) is there: https://github.com/gadmm/ocaml-multicore/tree/sync_lazy. There are a few remaining TODOs, mainly code in lambda/ as I said, for which I will be happy if @gasche can fill in the blanks as proposed.

More code is removed than added, it is no longer necessary to lower Obj.last_non_constant_constructor_tag from 245 to 243, some synchronisation between lazy and the rest of the runtime is removed which makes the implementation simpler.

This is against 5.0; doing the same against 4.14 would duplicate the work, and while it is interesting to see how performance of stock lazys would be affected by the change, this would not give a reliable comparison with multicore, since too many factors would affect the result, and the latter comparison is more important for this work.

@xavierleroy Short-cutting during marking appears to be removed in multicore (though there are still traces of it for ephemerons). Short-cutting during marking was already removed in 4.13 at ocaml/ocaml#10195 on the basis that we did not know many programs that would benefit from it (I did not object because I was not familiar with use-cases of Lazy). So lazys remain “not that far behind GHC”, but only if you are done forcing them before the next minor GC.

AFAIU nothing prevents reintroducing short-cutting in 4.13 (a naive implementation would not interact well with prefetching, but the price is only paid by lazy users). I am curious about the obstacles to short-cutting in multicore (I imagine that this is due to races). (cc @stedolan) I do not think that my approach makes any difference about it.

The story with Andrew Tolmach is interesting, are there by chance any written traces of it?

@xavierleroy
Copy link
Contributor

This is against 5.0; doing the same against 4.14 would duplicate the work

A 4.14 implementation would enable current power users of lazy values (Lexifi?) to test and measure on their own code.

Short-cutting during marking was already removed in 4.13

Right. Would have been nice to benchmark the actual impact of the change. But if the choice is faster marking overall vs. slightly slower lazy values, we all go for faster marking.

@gadmm
Copy link
Contributor

gadmm commented Dec 3, 2021

But if the choice is faster marking overall vs. slightly slower lazy values, we all go for faster marking.

If I am not mistaken, it only affected marking performance for users of lazy values: you pay for a cache miss during marking to check whether a forward pointer can be short-circuited, i.e. not pointing to a lazy or a double (something you already paid for before prefetching was implemented, so if figures used to show that it is beneficial then they would still do). Stephen can correct me if I misunderstood.

@gadmm
Copy link
Contributor

gadmm commented Dec 3, 2021

I read the code again and I now remember that it is more complicated than that. It is possible that it is free in terms of performance, but I would have to measure it.

@gadmm
Copy link
Contributor

gadmm commented Dec 6, 2021

I have reimplemented the short-cutting optimisation (gadmm/ocaml@a7d84b8). Benchmark results suggest that the performance difference of marking is well below fluctuations from random factors. The optimisation requires an additional store instruction in the prefetch loop, and perf reports an IPC increase on average which is enough to explain the good performance. (This is with an adaptation of Stephen's micro-benchmark. By way of indication, a back of the envelope calculation for the OCaml and Coq benchmarks given at the end of the PR considering the worst case gives <90ms total difference, out of 3 min and 12 min respectively, which is at least an order of magnitude below any detection threshold and random fluctuations.)

So the optimisation is free for non-users of lazy.

Consequently I would like to know whether other factors are an obstacle to this optimisation in multicore OCaml.

(I had to make only one change compared to my first naive implementation. I initially used a buffer of pairs instead of a pair of buffers, for which the performance impact was bigger, but nevertheless already hard to distinguish from random factors. For explaining why my second version is so cheap, my idea was that having two separate buffers would let the CPU treat the two sets of data differently in cache; but there may be other factors I am not aware of.)

@lpw25
Copy link
Contributor

lpw25 commented Dec 6, 2021

Note that Stephen tried to benchmark the impact of his changes on lazy values but, despite asking pretty widely, could not find any programs whose performance actually depended on the performance of lazy values. Until someone actually has such a program available, I would suggest that changing the prefetching marking loop to support short-circuiting would be a case of premature optimization.

@gasche
Copy link
Contributor Author

gasche commented Dec 7, 2021

@gadmm I proposed (untested) code sketches for the missing parts in your sync_lazy branch in gadmm#1.

@gadmm
Copy link
Contributor

gadmm commented Jan 30, 2022

This approach was encouraged by @lthls given the simplifications it brings for flambda, so I've been reminded to reply to this discussion.

The reason for discussing the short-cutting optimisation is that it is harder for me to justify at a scientific level working on an efficient thread-safe design of lazys if I believe that an unrelated incoherent aspect of the implementation should discourage programmers from exploring uses where efficiency matters.

Note that Stephen tried to benchmark the impact of his changes on lazy values but, despite asking pretty widely, could not find any programs whose performance actually depended on the performance of lazy values. Until someone actually has such a program available, I would suggest that changing the prefetching marking loop to support short-circuiting would be a case of premature optimization.

I think that this was asking the wrong question for several reasons. But to focus on the main issue:

  • Short-cutting only young references is a recipe for the kind of situation where we get nice figures during tests and synthetic benchmarks without knowing how they reflect actual use (where the programs can do arbitrary things in-between that compete for the minor heap).
  • In particular, if you want to see the impact in a synthetic benchmark, a relevant strategy might simply be to set the minor heap to the minimum size.
  • The main point for removal, that it was going to cost performance for non-users of lazy, is moot.
  • A pure OCaml implementation of lazy with a mutable record, not aiming at efficiency, could remove many more complications from the runtime and the compiler than the couple of lines needed to support major short-cutting.

So, to me, the two reasonable positions are either to have short-cutting fully implemented or to remove the optimizations entirely; I do not see the rationale for the current intermediate state.

@lpw25
Copy link
Contributor

lpw25 commented Jan 30, 2022

The main point for removal, that it was going to cost performance for non-users of lazy, is moot.

I think that may be a stronger statement than is warranted from your micro-benchmark. The pre-fetching patch produced large improvements in the performance of real programs (often > 10%) because the performance of the marking loop is very important. You need to be quite sure that you're not making it slower, and have a really good reason for doing so, if you are going to make changes there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants