-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] support for reading/writing OCaml values to shared memory #10272
Comments
Some operations that would be needed in the Gc module:
|
I’m not clear how “The SHM should not be under the control of any GC” leads to “Such a functionality should be designed and maintained by people who have Gc What do you envisage this offering which multicore OCaml will not (other than there being no GC)? |
In my opinion, this would better left to an external library. That said, it could interesting to see if we could make a variant of the generic marshaler/demarshaler which would be more suited to this use case, perhaps not supporting "custom blocks", but putting more emphasis on optimizing for speed. Could we for instance make it so that reading (from SHM) would just amount to blitting a contiguous block directly into the OCaml heap + a linear pass to relocate all pointers? |
...is pretty much the sole point of the multicore OCaml effort that is reported on regularly here. Have you: (i) looked at it and decided it won't solve your usecase? or (ii) need something to some particular time schedule? or (iii) ignored it? |
FWIW, we (LexiFi) rely a lot on multi-process parallelism, and it is unlikely that we would replace this architecture with multi-core. The reasons to stick to a multi-process approach:
|
But then how do you synchronize the processes that touch this global state today? I would assume that you either already do synchronization (maybe Multicore could provide similar synchronization patterns with equal or better performance) or (weirder) that each process has its own hashcons pool, in which case you wouldn't a need a lock. |
Each process has its own hash-consing (we guarantee maximal sharing within each process, of course), and those terms are serialized "manually", to recreate sharing upon import. The analog solution with multicore, IIUC, would be to use a thread-local hash-consing table (if thread-local variables are supported) or pass a context around; but how would one guarantee that a term created in one thread is never passed to another thread? The process barrier guarantees that one cannot make this mistake. Will multi-core have some ability to restrict some objects from crossing thread boundaries? We also have plenty of memoization tables, which, for the same reason (they contain algebraic terms) would need to be made thread-local. Plus various things kept in global variables to avoid having to do heavy plumbing through many layers of code. |
Not to my knowledge. (If I understand correctly, this is more-or-less possible in Rust, using a home-grown Copy trait for the type, but it requires a lot of type-level machinery we don't have.) But Multicore also does not dictate your concurrency abstraction, so you could use one (for example domain-sized actors with explicit message-passing) that enforces this discipline. You could get in trouble if you wrote a program that mixed two different concurrency libraries/approaches, there may be a risk of breaking this discipline, but:
|
Yes, but with multi-process, you need to explicit pass the data around. With multi-core, any global state could lead to subtle failures if not properly protected (through locks, or turning them into thread-local variables). Anyway, rewriting the whole code base to completely change the model and avoid all global state would be a huge effort (+ other arguments I mentioned above), so I don't see it happening in anytime soon. An incremental improvement to make it faster to serialize OCaml data structures across processes (when one knows those pieces of data are "safe") would be useful, though. |
I think you get enough bang for the buck by just implementing this as a library for shared Bigarrays across processes, which is essentially what python offers in its multiprocessing libraries. For more complex types, one can use serialization into shared Bigarrays. A fancier implementation would allow for primitive type hashtables using reference counting. In fact, I would be delighted to find out whether such a library is already available. |
Right now, the core OCaml dev team is busy with merging Multicore OCaml into the main line. Multicore OCaml is the only approach to parallelism in OCaml that we are going to actively support in the near future, so there will be no work on alternate, process-based approaches as outlined here. This said, and as exemplified by the work at Lexifi, process-based approaches to parallelism can certainly be developed as libraries and don't need special support in the core system. To me, it looks like the Ancient library does most of what @UnixJunkie describes. Maybe Ancient needs to be revived and extended, and I think the switch to the "no naked pointers" runtime will simplify Ancient, but again this can be done as an external library. The one difficulty I see with what @UnixJunkie and @alainfrisch describe is how to ensure that the shared memory segment or shared file is mapped in memory at the same address in all the processes. Otherwise, there can be no pointers in the shared data. Of course, you can give an absolute address to This said, I'm going to close this issue, as it is not an issue concerning the core OCaml system, just a wish to develop something that can be developed outside the core system. Feel free to continue this discussion on discuss.ocaml.org. |
Ok, thanks to all for the feedback. @bluddy I am not aware of a "fancier" implementation. E.g. Parmap does Marshal to/from mmaped bigarrays. |
I did not suggest to make values in the shared memory directly accessible, but rather to optimize the generic marshaler for the use case where values are passed/shared across process using shared memory, which might lead to different performance tradeoffs (e.g. less emphasis on the size, more on data alignment). I did a very quick experiment with a different marshaling strategy, where the marshaler produces a contiguous piece of data that can be directly blitted back into the heap by the demarshaler, simply adjusting all non-immediate fields by a fixed offset (this blit+relocation can be done with a simple linear pass). Early experiment suggests a speedup of around 30%-40% for the demarshaling part (compared to the current demarshaler), but a similar slowdown for the marshaling part (and much more compared to the No_sharing mode, which it itself around x4 faster than the mode with sharing on that micro-benchmark); and a very significant increase of the size for the marshaled data (around x9), which might itself become problematic in some use cases. Nevertheless, there might be some uses for such strategy, when the single piece of data is written once and read many times; and perhaps the marshaling part can be optimized (currently, it uses two passes) to avoid the slowdown. (Ref: trunk...alainfrisch:afrisch_obj_blit) |
That's how the first marshaler for Caml Light worked. Marshaled data was very big, causing a switch to a compact encoding similar to the one used in OCaml. I'm not surprised you're seeing no significant improvement w.r.t. compact marshaling, even when using shared memory (instead of a pipe or a socket) to transmit the marshaled data. There should be ways to reserve a given range of addresses in several processes so that they can share some memory at these addresses. But I'm afraid these ways are specific to the OS and the C library used. |
Thanks @xavierleroy for the historical perspective! Speaking as a complete outsider to this topic, I wonder if prefetching hacks (such as in #10195) could somehow speedup the marshaling side. Also, my quick experiment allowed me to confirm that #9353 introduced a significant slowdown on synthetic benchmarks; that's not a surprise, but something to keep in mind for people relying heavily on marshaling for cross-process communication. This makes me wonder whether it could be useful to have an intermediate where only some nodes are checked for sharing (for instance, I have cases in mind where only keeping sharing on strings would be useful, and restricting to string objects should significantly reduce the overhead of sharing detection). |
You're welcome. Actually, I remember stealing this idea from SML/NJ. An SML/NJ executable was really a dump of the heap (the code for functions was stored in the heap too), in a format that could be read back in memory and just relocated to recreate the heap.
I warned about that in #9353 with my own synthetic benchmark, so this is as expected. I managed to reduce the slowdown by at least one order of magnitude from the original proposal, but that's the best we can achieve, I'm afraid, The march of progress towards Multicore OCaml was deemed more important than those slowdowns.
For RPC systems that send lots of small messages, marshaling with the |
@xavierleroy wrote:
In "defence" of multicore OCaml taking priority here, I believe it'll eventually remove the need for marshalling as a mechanism for shared access to datastructures from multiple processes. Domains-only multicore, as proposed for OCaml 5.0, establishes a semantics for parallel access to the OCaml heap with a well-defined behaviour for what to do in the presence of data races. This parallel access does not have to be restricted to the same address space -- it should be reasonably straightforward to add support for a Domain interface that spawns a process, maps some shared memory to the main mutator process to coordinate GC activity, and subsequently shares a mapping for a portion of the OCaml heap that is suitable for parallel access (without the need for any marshalling across them). The synchronisation costs between processes will be higher than from same-process threads, but hopefully not unbearably so. The multicore team is laser focussed on getting to 5.0 right now so we don't have time to immediately conduct these experiments, but we'd be happy to advise on anyone who does wish to prototype this. If noone does, we'll definitely get around to it at some point post-OCaml 5.0. |
@avsm, I might be interested and know others who might be. Who should I contact? |
Feel free to email me. |
keywords: high performance computing, parallel programming,
inter process communication, Amdahl's law.
The goal of the proposed feature is to avoid the overhead of
marshaling/unmarshaling OCaml values so that they can be shared between
several processes running on the same computer.
The proposed feature should allow better scalability of OCaml parallel programs
on multi-core computers.
Here is a loose specification:
The SHM should not be under the control of any GC (for performance reasons).
It is a kind of region, with a maximum size decided at creation time.
It is initialized once. It might be destroyed (at once) and reinitialized
in the future (SHM-allocated memory reuse).
The writer can write in it as long as their is enough space left to hold
more values.
After initialization, a reader can read in it any value, anywhere (random access).
The access policy is one writer, several (potentially simultaneous
and overlapping) readers.
Access to the SHM is not protected by any locking mechanism (the programmer-user
must synchronize his processes correctly, in some other external way).
If a reader process reads a SHM which is not yet initialized or is being
overwritten, it would be nice (but not mandatory, especially if performance
would suffer from this property) if it would crash right away.
A possible road map:
Basic: support reading/writing any "basic type": bool, char, int, float, string
to a SHM.
Intermediate: support any data structure or type the user might have defined
(records, enumerations, etc.).
Advanced: support functions which are declared as shared-memory-initializing.
I.e. the return value of the function is an initialized shm (no "deep-copy" to a shm is required).
Related projects that could offer some of these features:
Such a functionality should be designed and maintained by people who have Gc
expertise.
Interested users are all the users of the opam packages:
parmap, parany, forkwork, hack_parallel, lwt-parallel (maybe),
rpc_parallel (maybe).
Regards,
Francois.
PS: I have really no idea of how this could/should be supported under windows.
The text was updated successfully, but these errors were encountered: