-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussing the design of Lazy under Multicore #750
Comments
This comment collects some remarks/changes/suggestions I had while reviewing the Lazy implementation and bug history. Repeated-computation bug in CamlinternalLazy.do_force_block?CamlinternakLazy.do_force_block is called from the domain that "owns let do_force_block blk =
let b = Obj.repr blk in
let closure = (Obj.obj (Obj.field b 0) : unit -> 'arg) in
Obj.set_field b 0 (domain_unique_token ());
try
let result = closure () in
Obj.set_field b 0 (Obj.repr result);
update_to_forward b;
result
with e ->
begin match e with
| RacyLazy -> Obj.set_field b 0 (Obj.repr closure)
| _ -> Obj.set_field b 0 (Obj.repr (fun () -> raise e))
end;
reset_to_lazy b;
raise e The I believe that this creates the duplicated-computation issue that the design is trying to avoid: concurrently forcing The rationale for this special treatment was given by KC in #338 (comment), but I suppose that it was not realized at the time that it broke an explicit design choice. Suggestion: handling of failed thunksWhen forcing a thunk fails with an exception
A possibly more direct logic would be to set the block arg to be the exception Suggestion: Completely merging
|
Wouldn't that prevent distinguishing between a lazy value that raises an exception and a lazy value that returns an exception ? |
The representation I have in mind is a block of tag Edit: I realize there was a typo in my previous message (block tag => block arg), now fixed, thanks! |
Allowing repeated computations on force-force races?What if we decided that we are okay with redundant computations if two domains race to force the same thunk? For example, we could use the following implementation scheme:
(Returning the value-in-the-meantime instead of the value we computed ourselves guarantees that forcing the same thunk always returns the same value, although effects may differ.) With this scheme, there is no detection of self-loops anymore in the implementation (that would previous raise This approach would be more expressive, in that it would be compatible with non-thread-local usage (if we decided to allow users to do this): users would have to use a concurrency-control mechanism in their thunks, which could be aware of their desired concurrency abstraction to provide the semantics they want (the lazy thunk could block if entered concurrently, and/or include accurate self-loop-detection logic). If the only form of concurrency-control that we want is "fail on concurrent forcing", then this proposal of having a lock in user code is probably less efficient than the current Multicore approach. (But then this proposal may be more efficient than the current Multicore approach for thread-local usage, as it only uses one atomic stores instead of two.) I think that this version of (cc @gadmm who I suspect has thought about this stuff) |
I would not worry too much about inefficiencies in this case. A lazy value that raises an exception is almost always a programming error. So, I'd rather keep the implementation as simple as possible. |
I can understand the rationale for the current design: |
But (For example: a logging resource is stored in a lazy thunk to be initialized only on demand; two fibers are trying to perform a logging call, the first gets control and forces the logging thunk, the logging initialization logic itself yields, and the second fiber gets control: Undefined.) The current design is in an awkward situation where:
Unless I am missing something, there is currently no robust way to manage a lazy thunk as a shared resource (it must be synchronized/protected). The only safe way to use thread is with exclusive ownership, and |
This seems like a very complex implementation for something that should never happen. Lazy shouldn't ever be used for synchronization. It seems like the only reason we're forced to use synchronization is because the abstraction is a relatively high-level one and we want to hide our magic from the user. Therefore, why can we not just loop when we encounter a half-forced lazy value? We could use a spinlock while making sure to also check the GC work once in a while. |
Thanks for the ping. I made some remarks about some of these issues recently at #713 and #520 and was waiting for feedback, it is probably good to centralize it with an issue. In order of your remarks:
On a personal note, I quickly found that this problem was much more involved than I thought (i.e. time I was willing to spend on it), and I am no longer actively looking at exploring better designs, though I volunteered to clean-up the interface a bit before upstreaming. |
I wonder how much of these various behaviours could be factored in one by storing inside the lazy block not the thunk that is going to be forced, but a closure made of a forcing function with suitable synchronisation/non-synchronisation together with the thunk as an argument, and expecting the lazy block as a second argument (e.g. so that the |
Not sure it helps with any of the other issues mentioned, but it seems like this could be solved by storing a user-provided token from an extensible variant in the lazy whilst it is being forced rather than storing the domain id. This token would also be returned to anyone who attempts to force the lazy concurrently. This would allow concurrency libraries to store whatever notion of identity for a unit of concurrency that they want to and thus distinguish between racy and ill-founded forcing. |
Repeated computation bugNeither of the two available choices in the current design for handling nested lazy computation bug is satisfying. I reproduce here the example from #338 (comment) let l1 = lazy ...
let l2 = lazy (Lazy.force l1)
let domain1 = Domain.spawn (fun _ -> ignore (Lazy.force l1))
let domain2 = Domain.spawn (fun _ -> ignore (Lazy.force l2)) The problem is that there could be a race on forcing Solution 1: Restore the computation, but suffer recomputationAs @gasche points out in #750 (comment), the Solution 2: Update the lazy to raising
|
@gasche suggested to have a primitive Lazy without locks and with repeated forcing on races, the idea being that one can then implement the desired synchronisation behaviour inside the thunk itself. There are more ideas (and motivations) in the Haskell paper mentioned above for making it memory-safe without CAS along similar lines as his suggestion. One may need to pass the lazy block as an additional argument to the thunk depending on how one deals with updating it, but I believe this lets one implement all the behaviours one might want. This also solves where you can store an atomic value containing your custom thread id, or a mutex, etc. |
@lpw25 This sounds like a good idea. The only usability issue is that, if the value is made user-visible, then there is the possibility that That said, this is no worse than using the same "unique" token for two different fibers. |
This is an issue with the current choice of implementation, but there are race-free implementations of Leo's idea (obviously: we could extend Lazy blocks with another argument that would be a single-writer multi-reader lock, done; but we are trying to avoid this for memory-efficiency reasons). Staying close to the current implementation: we could store in the I thought along lines similar to Leo's proposal ("forcing witnesses"), but I didn't think of extensible variants, which are kind of his thing. In my mind a proper API with this would have Note: if you propose this sort of nicety, then I think we could consider dropping type ('a, 'tok) t2
val from_fun : ('unit -> 'a) -> ('a, 'tok) t2
val try_force : ('a, 'tok) t2 -> ('a, 'tok) result
type 'a t = ('a, Domain.id) t2 (One advantage of the extensible-variant as opposed to this parametric API is that we can also return the extensible-variant token as an exception argument for |
If I understand correctly @gadmm's suggestion, it corresponds to a sort of object-oriented approach where thunks decide, on creation time, how they are going to be forced, which means that consumers need no extra information on the "forcing" side, which has the advantage of giving a more flexible semantics to lazy patterns (which otherwise default to type 'a t = 'a lazy_state ref
type 'a lazy_state = Lazy of ('a t -> 'a) | Forcing of ('a t -> 'a) | Forward of 'a | Exn of exn
let force (thunk : 'a t) =
match !thunk with
| Lazy force -> force thunk
| Forcing race -> race thunk
| Forward v -> v
| Exn exn -> raise exn
let from_fun (f : unit -> 'a) =
let closure (self : 'a t) =
self := Forcing (fun _ -> raise Undefined);
match f () with
| exception exn -> self := Exn exn; reraise exn
| v -> self := Forward v; v I find this suggestion interesting, but also substantially more complex than the "relevant forcing" proposal (we have a lot more flexibility but it's not clear what to do with it, and what the performance cost is), so I would be tempted to explore "relevant forcing" first. |
Note on "relevant forcing": in addition to a magic value and the forcing token, the Forcing state could also store the original closure (and return it to |
I read @gadmm's reference, Haskell on a Shared-Memory Preprocessor, Tim Harris, Simon Marlow, Simon Peyton Jones, 2005. It's interesting, the amount of sophistication is impressive (duplicate computations are detected "lazily" when a thread scans its call stack, and in that case the computation is aborted to block on the first forcer), but most aspects of it are very different from the current OCaml design -- and thankfully we don't have the same performance constraints. Some remarks:
|
We could simplify the logic a lot by having several block arguments per lazy thunk. For example
With this design we don't have the problem of being unable to interpret the block argument due to races. We can still observe a Forcing payload or a Lazy closure that changed since we started forcing, but this looks semantically correct. (It corresponds to the expected semantics for a different concurrent scheduling.) We could even remove that last confusion by storing Forward exceptions as a fourth block argument, so that we never go back from Forcing to Lazy. |
Watch for memory leaks. The closure must become unreachable from the lazy value as soon as we start evaluating it, otherwise leaks occur. In other words, blackholing is not just to detect self-referential lazy values, but to ensure timely garbage collection of the closure. |
Are there any methods that would allow us to ditch the atomic after we force the value? After all, we only want to limit access to the unforced value -- once it's forced, we don't care about synchronization. So we could have an additional, first boolean word that is always initialized to False, indicating whether the value should be accessed as an atomic. The first thread to force the value will first indicate Forcing atomically in the second word, and then write the full value of the second word after it is done. Once forcing is complete, the first boolean is set to true by the same thread. Any thread accessing the value will first check if it needs to read it atomically using the first boolean. Even if it misreads the boolean before it's been fully written and therefore reads the 2nd word atomically, it's not a big deal. Is this kind of thing doable? |
@bluddy if forcing a value completes successfully, the tag becomes Forward_tag or some other tag (not Lazy or Forcing), and no atomic reads are performed by forcing that value. I think that this implements something similar to your suggestion. @xavierleroy Good point! Note that the current Multicore implementation keeps the closure reachable, because it is reinstated if it fails with RacyLazy. Another argument against this reinstatement. |
Agreed. Restoring the lazy computation is a mistake for several reasons. |
I think there was a misunderstanding of my suggestion and/or this Haskell paper; I will try to clarify. The lock-less implementation of Lazy is actually much simpler than the current implementation of Lazy in multicore (if there is any doubt, I have included some pseudo-code in my examples below). The main idea is to write the thunk and the result in separate fields to avoid a race, it requires no CAS and only needs two states instead of three (no need for a forcing tag), this does not count as "super-complex" to me. Duplicate evaluation is a feature: this is what lets you decouple the synchronisation mechanism. (The Haskell paper only becomes complicated when it talks about stopping the duplicated evaluation before it ends, where the GC must be involved. Obviously you do not want or need this in OCaml, especially if what is interrupted is some synchronisation code.) I claim the following: the default implementation (as well as its variant with a "custom" thread id, and other ideas I see floating) can be factored through this primitive lock-less lazy. For performance, I only see two differences: 1) I only need (I think) one CAS instead of two (as an additional benefit of storing the value and the thunk separately) ; 2) there can be extra allocations, in particular the atomic containing the thread id is allocated separately (but if this becomes important then you could let the creation function request extra fields for its own use when allocating the lazy, for instance). In addition, it supports any custom synchronisation mechanism you want. The following pseudo-code fleshes out my suggestion better (I omitted exception-handling for simplicity though); I wrote down |
I added a sixth example (the simplest of all), a lock-less one that never fails on races but duplicates the computation if needed. The motivating use-case is somebody wanting to implement some of Okasaki's “Purely functional data structures” (1996) which use lazy evaluation (where the thunk is pure) to achieve some amortized complexity. Such persistent data structures are a good candidate for wanting to share them across threads, but the cost of synchronizing every forcing of thunks would probably be prohibitive. (Bonus: a library of such lock-less, persistent data structures would be universal for multicore OCaml, not dependent on whichever threading library one uses.) |
Anecdotally, CAS has gotten much cheaper over the years, especially uncontended ones. The numbers reported may not stand the test of time. FWIW we use a CAS on each object to promote them in the parallel minor GC. @ctk21 or @stedolan may be able to confirm whether this is true. There are quite a few ideas flying around, and it has become hard for me to keep track of the various options since lazy is quite subtle due to its interaction with concurrent mutator and GC threads. It would be nice to summarise the various options in a design space document with the tradeoffs. I am also a little bit worried about redesigning the lazy implementation from scratch at this stage, though my worry may stem from my lack of understanding of the various proposals. Having implemented Lazy in various broken ways previously, I would err on the side of forwards-compatible implementation with simple semantics. Let's discuss this today in WG4 meeting. |
I have to admit that I didn't even know about the existence of the I do also think we need to try and make sure the CAS is only required when forcing the lazy value, and that afterwards it should be available without a CAS if such a design is at all possible. |
Regarding performance of CAS, digging deeper into the implementation made me realise that we mean two different things with CAS. The current implementation does two CAS loops, each of which require a C function call ("noalloc"). This is because the update of the tag races with updates to the GC flags. Ideally you should only need one plain CAS and no external function call. Incidentally, I realised my examples currently need this CAS loop when updating the tag for the same reason. These external function calls + CAS loops can actually be removed but the trade-offs must be benchmarked. (I have updated my example file to explain this.) So there might be performance gains but for different reasons than described in the Haskell paper.
This sounds very reasonable. My example suggests that it is beneficial to avoid having to expand the list of reserved tags with a new "forcing" tag, which would be the only thing that encourages looking into this option now rather than later. I have added "minimal demos" at the end of my example file which would be the simplest way to test the lock-less approach. They describe the default behaviour without the decoupling. If this works then I think the rest works too, and it would give a comparison point in terms of performance. I agree that it would help to have a design document (I update my example file so that it remains up-to-date regarding the lock-less proposition, so you can see it as a "design document" for now). The main question I see is which design better enables the user to implement their own synchronisation. It would help to have pseudo-code examples of how custom synchronisation is expected to be implemented by proponents of the |
OCaml maintainers had a discussion about this with the Multicore people today. The consensus that we reached (which was also my proposal) was to go for the minimal implementation for now, to (1) keep things as simple as possible, and (2) leave the door open to better proposals than what we currently have. Minimal in the following sense:
We can improve on all of these later extending the minimal proposal, possibly along the lines of Leo's idea (user-provided token on Forcing) or Guillaume's idea. @gadmm: would you be interested in elaborating a version of your proposal as a fork of the 5.00 branch? (This is something we work on together if you like.) This would allow us to do benchmark to evaluate performance differences but, more importantly (I think efficiency is not the main concern, at least for me) to actually demonstrate feasibility and expressivity. |
Is |
Mutex is in the Multicore stdlib. We could consider moving it in 4.14 for compatibility, but this might be a bit tricky (do we want to provide different implementations depending on whether systhreads are used?). Would you open a tracking issue? |
That is correct. There are a few things we should keep in mind while making Lazy safe for concurrency in the future. Concurrency using fibers is fundamentally different from domains and systhreads. In particular, while fibers are used to build concurrency abstractions, they are also used for other purposes as well, such as encoding mutable state through the state effect, which is not concurrency per se. Additionally, the fundamental distinguishing feature is that fibers by themselves don't have a notion of a "scheduler" which domains and systhread do. A fiber has no built-in notion of parking and unparking them. For example, when a fiber blocks on a lazy the notion of what should run next (if any) is provided by the handler in the context. I played around with concurrency structures that are parameterized over the schedulers in effects examples. There may be a few hints there. The other point is that the OCaml 5.00 standard library will have other structures that don't work with fibers such as Mutex, Condition and Semaphores. Hence, the distinction between fibers and (domains & systhreads) is already there. |
I would be happy to help, and to be helped for this. As a first step, I'd be happy with some feedback regarding my pseudo-code (@gasche,@lpw25?); especially if someone familiar with the OCaml memory model could please have a look at the synchronisation part (today's discussion already clarified a lot). I have updated my document so that it is fairly detailed. The roadblocks are as follows:
1.,2.,5. we can ignore at the start. The "minimal demo" at the beginning of the document is already very close to the current implementation. Would anyone be willing to help with 3. and 6. ? I think the benefits to get rid of tag mutation are bigger than initially thought so I would like to explore this option first. I also realized that there is another dimension of customization that benefits from a flexible design, which is interrupt-safety: in case of an exception, should I free the thunk early and “poison” the lazy block, or should I put things back in order so that later consumers have a chance to succeed? This is important in the presence of cancellation/asynchronous exceptions, if I want to share a data structure across cancellation units (the poisoning semantics is only useful for sharing among people I want to interrupt together). The user should be able to pick the appropriate semantics.
If the purpose was not efficiency, then the Lazy module could be coded directly in OCaml. The special treatment from the GC, from Lambda, and from the memory representation, are a heavy investment whose only point is efficiency. Or am I missing something? |
I'm happy to help with 3 (adapting the matching logic). Regarding 6, I think that the non-shared use-case remains the determinant criterion for Lazy, so I think that you could start by benchmarking non-concurrent usage: if you do sensibly worse than the current implementation, it may be a no-go (for the standard, For a sequential benchmark, I would consider something as follows: let fib n =
let rec fib i = Lazy.force (Lazy.force fibs).(i)
and fibs = lazy (Array.init (n+1) (function
| 0 -> lazy 0
| 1 -> lazy 1
| i -> lazy (fib (i - 1) + fib (i - 2))
))
in fib n
let () =
let usage () = prerr_endline "usage: ./fib <input> <iteration-nb>" in
if Array.length Sys.argv < 3 then (usage (); exit 0)
else match int_of_string Sys.argv.(1), int_of_string Sys.argv.(2) with
| exception _ -> usage (); exit 1
| input, nb_iter ->
Printf.printf "%d\n%!" (fib input);
for i = 1 to nb_iter - 1 (* already iterated once above *) do
ignore (fib input);
done
Personally I think that having a standard building block for "delayed and memoized computations" is the major service provided by Lazy, not necessarily its performance profile (I don't know of existing programs where Lazy plays a determinant role; it is rumoured that Lexifi has some, so maybe cc @alainfrisch @nojb). I find the "Forward shortcutting" optimization rather elegant, because it means that lazy thunks are basically free when reused a lot. But the other aspects may not be critical; whereas "behaving properly in concurrent programs" would be a net benefit to many prospective users of |
A long time ago, before the "forward shortcutting" optimization was implemented, Andrew Tolmach experimented with retargeting the GHC Haskell compiler to produce OCaml Lambda IR + lazy values. He concluded that performance was close to that of GHC provided "forward shortcutting" was implemented in the GC. (GHC's GC does it, of course.) Then, @damiendoligez liked the idea and merged Andrew's implementation of "forward shortcutting" in OCaml. The moral of the story is that OCaml's implementation of lazy evaluation is not far behind that of GHC in terms of performance. That's a nice achievement that I'd like to preserve as we move into OCaml 5.00. |
Thanks @gasche! Well maybe not many users have had a chance to explore concurrent lock-free persistent amortized functional data structures in OCaml yet (I am only half joking). I think it would be nice to have also a benchmark that compares various kinds of synchronisation, if we want to decide later which lazy implementations to introduce in the standard library (implementing a new lazy even with the high-level primitives is still not for the casual programmer). The Haskell people like to tout the advantages of purity and laziness, I wonder what are their go-to benchmarks that crucially relies on laziness and takes advantage of parallelism. |
Re: performance of lazy evaluation, I thought more about it and still believes that the forward shortcutting optimization is crucial. However, representing lazy thunks by a block of size 2 words instead of 1 might be tolerable. (So that one of the words can be used to record the status of the thunk, instead of using the header tag for this purpose.) If we had a good benchmark for lazy evaluation, we could even try to measure the overhead on a modified OCaml 4.x. |
It's true that we use Lazy perhaps a bit more than the typical OCaml user, but I wouldn't go as far as to say that the performance of Lazy necessarily plays a determinant role for us. It might be the case that we would suffer some observable degradation for some payloads if Lazy becomes much slower, but I don't know how to quantify that for now. If a branch with the intended change is made available for 4.xx already, we'll be able to provide some concrete feedback. |
@gasche (and anyone else interested in helping), the code for the "minimal demo" (two fields instead of tag mutation, no custom synchronisation yet) is there: https://github.com/gadmm/ocaml-multicore/tree/sync_lazy. There are a few remaining TODOs, mainly code in More code is removed than added, it is no longer necessary to lower This is against 5.0; doing the same against 4.14 would duplicate the work, and while it is interesting to see how performance of stock lazys would be affected by the change, this would not give a reliable comparison with multicore, since too many factors would affect the result, and the latter comparison is more important for this work. @xavierleroy Short-cutting during marking appears to be removed in multicore (though there are still traces of it for ephemerons). Short-cutting during marking was already removed in 4.13 at ocaml/ocaml#10195 on the basis that we did not know many programs that would benefit from it (I did not object because I was not familiar with use-cases of Lazy). So lazys remain “not that far behind GHC”, but only if you are done forcing them before the next minor GC. AFAIU nothing prevents reintroducing short-cutting in 4.13 (a naive implementation would not interact well with prefetching, but the price is only paid by lazy users). I am curious about the obstacles to short-cutting in multicore (I imagine that this is due to races). (cc @stedolan) I do not think that my approach makes any difference about it. The story with Andrew Tolmach is interesting, are there by chance any written traces of it? |
A 4.14 implementation would enable current power users of lazy values (Lexifi?) to test and measure on their own code.
Right. Would have been nice to benchmark the actual impact of the change. But if the choice is faster marking overall vs. slightly slower lazy values, we all go for faster marking. |
If I am not mistaken, it only affected marking performance for users of lazy values: you pay for a cache miss during marking to check whether a forward pointer can be short-circuited, i.e. not pointing to a lazy or a double (something you already paid for before prefetching was implemented, so if figures used to show that it is beneficial then they would still do). Stephen can correct me if I misunderstood. |
I read the code again and I now remember that it is more complicated than that. It is possible that it is free in terms of performance, but I would have to measure it. |
I have reimplemented the short-cutting optimisation (gadmm/ocaml@a7d84b8). Benchmark results suggest that the performance difference of marking is well below fluctuations from random factors. The optimisation requires an additional store instruction in the prefetch loop, and So the optimisation is free for non-users of lazy. Consequently I would like to know whether other factors are an obstacle to this optimisation in multicore OCaml. (I had to make only one change compared to my first naive implementation. I initially used a buffer of pairs instead of a pair of buffers, for which the performance impact was bigger, but nevertheless already hard to distinguish from random factors. For explaining why my second version is so cheap, my idea was that having two separate buffers would let the CPU treat the two sets of data differently in cache; but there may be other factors I am not aware of.) |
Note that Stephen tried to benchmark the impact of his changes on lazy values but, despite asking pretty widely, could not find any programs whose performance actually depended on the performance of lazy values. Until someone actually has such a program available, I would suggest that changing the prefetching marking loop to support short-circuiting would be a case of premature optimization. |
This approach was encouraged by @lthls given the simplifications it brings for flambda, so I've been reminded to reply to this discussion. The reason for discussing the short-cutting optimisation is that it is harder for me to justify at a scientific level working on an efficient thread-safe design of lazys if I believe that an unrelated incoherent aspect of the implementation should discourage programmers from exploring uses where efficiency matters.
I think that this was asking the wrong question for several reasons. But to focus on the main issue:
So, to me, the two reasonable positions are either to have short-cutting fully implemented or to remove the optimizations entirely; I do not see the rationale for the current intermediate state. |
I think that may be a stronger statement than is warranted from your micro-benchmark. The pre-fetching patch produced large improvements in the performance of real programs (often > 10%) because the performance of the marking loop is very important. You need to be quite sure that you're not making it slower, and have a really good reason for doing so, if you are going to make changes there. |
This issue is collecting my understanding of the design of Lazy under Multicore.
Sequential Lazy
With Sequential OCaml, trying to force a value that is currently being forced raises Undefined. This can be caused by a thunk trying to force itself recursively, or two systhreads racing to force a value.
This is implemented by mutating the lacy block twice during forcing, once to replace the thunk closure with
fun () -> raise Undefined
before computation (the tag remains Lazy_tag), and once to set the result (the tag remains Lazy_tag in case of an exception, and becomes Forward_tag otherwise).(Note: there is no allocation/yield between setting the thunk to
fun () -> raise Undefined
and invoking the thunk closure, so if two systhreads race to force a value, exactly one of them will get the real thunk closure.)Concurrency problems
With Multicore OCaml, synchronization was introduced to avoid two potential issues:
(This is in contrast with other mutable-state libraries like Stack where lack of synchronization can break programs but not break memory-safety.)
Note: it is clear that "memory unsafety" is a blocking issue, and that solving it requires some synchronization. However, whether "duplicated computations" is a blocking issue or not is a design choice, we could have decided that it is okay to have two domains racing. This being said, I don't think that relaxing this design constraint would lead to a substantially simpler implementation, as we need synchronization anyway to avoid memory unsafety.
Current approach (November 2021)
The solution used in Multicore OCaml is basically to treat Lazy_tag values as atomic references, with the state indicated in the tag:
Undefiend
. The block argument stores a function closure that contains the thunk computation, or is of the formfun () -> raise e
case (2).Forcing a value performs two compare-and-swap instructions on the tag, moving once from Lazy_tag to Forcing_tag, then from Forcing_tag to Forward_tag (if the thunk returns a value) or Lazy_tag (if the thunk raises an exception).
The Forcing_tag has a lock invariant: only one domain can succeed in setting it concurrently, and this ownership is required to (1) call the thunk closure and (2) mutate the block argument. (Setting the tag back to Lazy_tag or Forward_tag releases the lock.)
In particular, this lock invariant avoids both problems of duplicated comptuations and memory unsafety: two domains cannot race in calling the thunk closure, or one in calling the closure and another in writing a result.
Note: the compare-and-swap parts had to be written in C to avoid atomic-breakage bugs caused by poll points. See #483 / #501 .
Some problems with racy forcing
If someone tries to force a thunk that is already in the Forcing_tag state, forcing fails with an exception. The Multicore implementations tries to distinguish between "access from the same domain", and fail with Undefined in this case, or "access from another domain", which fails with RacyLazy.
Problem 1: fragility of the implementation
To make this distinction, the racy forcer looks at the block argument, which is a marker of the forcer's domain. This is however fairly fragile as this lookup is racy (when doing the lookup, we don't hold the Forcing lock). This fragility caused the bug #504, and the fix is sophisticated (using a "magic" value created by the GC that is supposed to be secret to the Lazy code in unspecified ways).
Problem 2: domains as an incomplete abstraction
Domains are not meant to be the final granularity of concurrency for
user-level OCaml code, but instead be populated by lightweight
abstractions like systhreads, fibers, etc. From a user of those
abstractions, the distinction "from this or another domain" is not the
right distinction, it does not suffice to distinguish concurrent
forcing from recursive forcing.
Problem 3: poor programmability
Problem 2 could be the job of users to fix: if they are using systhreads or fibers or tasks, etc., they can catch the forcing exceptions and perform their own concurrency logic after that.
The problem is that users cannot really tell, on receiving a RacyLazy exception, whether the Lazy.force call they made was racy (in which case they may want to try again later), or whether they correctly forced a thunk that itself raised RacyLazy as part of its computation (in which case they will forever get this exception back).
try_force
@gadmm identified "problem 3" above, and proposed (in #338) a
try_force
function that would return an option, with None unambiguously indicating that the thunk is being forced concurrently. All concurrent users of thunks should usetry_force
, andRacyLazy
can be treated as a programming error that should never be caught.This proposed implementation initially removed the distinction between Undefined and RacyLazy in
force
, both conditions (same-domain forcing and other-domain forcing) would raise Undefined, whiletry_force
would returnNone
on other-domain forcing and raise Undefined on same-domain forcing. This part was reverted, and the current code still raises RacyLazy.The text was updated successfully, but these errors were encountered: