Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] support for reading/writing OCaml values to shared memory #10272

Closed
UnixJunkie opened this issue Mar 5, 2021 · 20 comments
Closed

Comments

@UnixJunkie
Copy link
Contributor

UnixJunkie commented Mar 5, 2021

keywords: high performance computing, parallel programming,
inter process communication, Amdahl's law.

The goal of the proposed feature is to avoid the overhead of
marshaling/unmarshaling OCaml values so that they can be shared between
several processes running on the same computer.
The proposed feature should allow better scalability of OCaml parallel programs
on multi-core computers.

Here is a loose specification:

The SHM should not be under the control of any GC (for performance reasons).
It is a kind of region, with a maximum size decided at creation time.

It is initialized once. It might be destroyed (at once) and reinitialized
in the future (SHM-allocated memory reuse).

The writer can write in it as long as their is enough space left to hold
more values.

After initialization, a reader can read in it any value, anywhere (random access).

The access policy is one writer, several (potentially simultaneous
and overlapping) readers.

Access to the SHM is not protected by any locking mechanism (the programmer-user
must synchronize his processes correctly, in some other external way).

If a reader process reads a SHM which is not yet initialized or is being
overwritten, it would be nice (but not mandatory, especially if performance
would suffer from this property) if it would crash right away.

A possible road map:

Basic: support reading/writing any "basic type": bool, char, int, float, string
to a SHM.

Intermediate: support any data structure or type the user might have defined
(records, enumerations, etc.).

Advanced: support functions which are declared as shared-memory-initializing.
I.e. the return value of the function is an initialized shm (no "deep-copy" to a shm is required).

Related projects that could offer some of these features:

  • ocamlnet (it's SHM is GC-controlled though)
  • ancient (abandoned and unmaintained)

Such a functionality should be designed and maintained by people who have Gc
expertise.

Interested users are all the users of the opam packages:
parmap, parany, forkwork, hack_parallel, lwt-parallel (maybe),
rpc_parallel (maybe).

Regards,
Francois.

PS: I have really no idea of how this could/should be supported under windows.

@UnixJunkie
Copy link
Contributor Author

UnixJunkie commented Mar 5, 2021

@UnixJunkie
Copy link
Contributor Author

Some operations that would be needed in the Gc module:

  • "deep_copy": append a value from the Gc-managed memory to a Shm (or clearly fail to do so)
  • "deep_read": read value at given index from the Shm to the process' own Gc-managed memory.
    At the maximum, the overhead should only be due to copying memory areas (the poor man's/reference implementation being marshalling to/from the shm).

@dra27
Copy link
Member

dra27 commented Mar 5, 2021

I’m not clear how “The SHM should not be under the control of any GC” leads to “Such a functionality should be designed and maintained by people who have Gc
expertise.”? In fact, the first statement leaves me unclear why this would need to live in the compiler?

What do you envisage this offering which multicore OCaml will not (other than there being no GC)?

@alainfrisch
Copy link
Contributor

In my opinion, this would better left to an external library.

That said, it could interesting to see if we could make a variant of the generic marshaler/demarshaler which would be more suited to this use case, perhaps not supporting "custom blocks", but putting more emphasis on optimizing for speed. Could we for instance make it so that reading (from SHM) would just amount to blitting a contiguous block directly into the OCaml heap + a linear pass to relocate all pointers?

@avsm
Copy link
Member

avsm commented Mar 5, 2021

The goal of the proposed feature is to avoid the overhead of marshaling/unmarshaling OCaml values so that they can be shared between several processes running on the same computer. The proposed feature should allow better scalability of OCaml parallel programs on multi-core computers.

...is pretty much the sole point of the multicore OCaml effort that is reported on regularly here. Have you: (i) looked at it and decided it won't solve your usecase? or (ii) need something to some particular time schedule? or (iii) ignored it?

@alainfrisch
Copy link
Contributor

FWIW, we (LexiFi) rely a lot on multi-process parallelism, and it is unlikely that we would replace this architecture with multi-core. The reasons to stick to a multi-process approach:

  • Reliance on global state, in particular hash-consing of algebraic terms (through weak tables), which would require a global lock (and probably impact performance too much).
  • More protection between processes: being able to properly kill one task, monitor its memory usage, etc.
  • A computation model closer to distributed computation, with a shared heap is just not an option.

@gasche
Copy link
Member

gasche commented Mar 5, 2021

Reliance on global state, in particular hash-consing of algebraic terms (through weak tables), which would require a global lock (and probably impact performance too much).

But then how do you synchronize the processes that touch this global state today? I would assume that you either already do synchronization (maybe Multicore could provide similar synchronization patterns with equal or better performance) or (weirder) that each process has its own hashcons pool, in which case you wouldn't a need a lock.

@alainfrisch
Copy link
Contributor

Each process has its own hash-consing (we guarantee maximal sharing within each process, of course), and those terms are serialized "manually", to recreate sharing upon import.

The analog solution with multicore, IIUC, would be to use a thread-local hash-consing table (if thread-local variables are supported) or pass a context around; but how would one guarantee that a term created in one thread is never passed to another thread? The process barrier guarantees that one cannot make this mistake. Will multi-core have some ability to restrict some objects from crossing thread boundaries?

We also have plenty of memoization tables, which, for the same reason (they contain algebraic terms) would need to be made thread-local. Plus various things kept in global variables to avoid having to do heavy plumbing through many layers of code.

@gasche
Copy link
Member

gasche commented Mar 5, 2021

Will multi-core have some ability to restrict some objects from crossing thread boundaries?

Not to my knowledge. (If I understand correctly, this is more-or-less possible in Rust, using a home-grown Copy trait for the type, but it requires a lot of type-level machinery we don't have.) But Multicore also does not dictate your concurrency abstraction, so you could use one (for example domain-sized actors with explicit message-passing) that enforces this discipline. You could get in trouble if you wrote a program that mixed two different concurrency libraries/approaches, there may be a risk of breaking this discipline, but:

  1. all uses of mutable state have this problem (you need to either guarantee isolation or lock, and not mix up what is what), which suggests that this is a problem programers would be aware of, and
  2. the problem already exists today with multi-processing, you could shoot yourself in the foot by mixing two IPC approaches, one of them just passes marshalled data around (or uses shared memory to the same effect) and does not preserve the hashconsing invariant.

@alainfrisch
Copy link
Contributor

alainfrisch commented Mar 5, 2021

Yes, but with multi-process, you need to explicit pass the data around. With multi-core, any global state could lead to subtle failures if not properly protected (through locks, or turning them into thread-local variables).

Anyway, rewriting the whole code base to completely change the model and avoid all global state would be a huge effort (+ other arguments I mentioned above), so I don't see it happening in anytime soon. An incremental improvement to make it faster to serialize OCaml data structures across processes (when one knows those pieces of data are "safe") would be useful, though.

@bluddy
Copy link
Contributor

bluddy commented Mar 5, 2021

I think you get enough bang for the buck by just implementing this as a library for shared Bigarrays across processes, which is essentially what python offers in its multiprocessing libraries. For more complex types, one can use serialization into shared Bigarrays. A fancier implementation would allow for primitive type hashtables using reference counting. In fact, I would be delighted to find out whether such a library is already available.

@xavierleroy
Copy link
Contributor

Right now, the core OCaml dev team is busy with merging Multicore OCaml into the main line. Multicore OCaml is the only approach to parallelism in OCaml that we are going to actively support in the near future, so there will be no work on alternate, process-based approaches as outlined here.

This said, and as exemplified by the work at Lexifi, process-based approaches to parallelism can certainly be developed as libraries and don't need special support in the core system. To me, it looks like the Ancient library does most of what @UnixJunkie describes. Maybe Ancient needs to be revived and extended, and I think the switch to the "no naked pointers" runtime will simplify Ancient, but again this can be done as an external library.

The one difficulty I see with what @UnixJunkie and @alainfrisch describe is how to ensure that the shared memory segment or shared file is mapped in memory at the same address in all the processes. Otherwise, there can be no pointers in the shared data. Of course, you can give an absolute address to shmat or to mmap, but what guarantees that this address is not in use already?

This said, I'm going to close this issue, as it is not an issue concerning the core OCaml system, just a wish to develop something that can be developed outside the core system. Feel free to continue this discussion on discuss.ocaml.org.

@UnixJunkie
Copy link
Contributor Author

Ok, thanks to all for the feedback.

@bluddy I am not aware of a "fancier" implementation. E.g. Parmap does Marshal to/from mmaped bigarrays.

@alainfrisch
Copy link
Contributor

The one difficulty I see with what @UnixJunkie and @alainfrisch describe is how to ensure that the shared memory segment or shared file is mapped in memory at the same address in all the processes. Otherwise, there can be no pointers in the shared data.

I did not suggest to make values in the shared memory directly accessible, but rather to optimize the generic marshaler for the use case where values are passed/shared across process using shared memory, which might lead to different performance tradeoffs (e.g. less emphasis on the size, more on data alignment).

I did a very quick experiment with a different marshaling strategy, where the marshaler produces a contiguous piece of data that can be directly blitted back into the heap by the demarshaler, simply adjusting all non-immediate fields by a fixed offset (this blit+relocation can be done with a simple linear pass). Early experiment suggests a speedup of around 30%-40% for the demarshaling part (compared to the current demarshaler), but a similar slowdown for the marshaling part (and much more compared to the No_sharing mode, which it itself around x4 faster than the mode with sharing on that micro-benchmark); and a very significant increase of the size for the marshaled data (around x9), which might itself become problematic in some use cases. Nevertheless, there might be some uses for such strategy, when the single piece of data is written once and read many times; and perhaps the marshaling part can be optimized (currently, it uses two passes) to avoid the slowdown. (Ref: trunk...alainfrisch:afrisch_obj_blit)

@xavierleroy
Copy link
Contributor

I did a very quick experiment with a different marshaling strategy, where the marshaler produces a contiguous piece of data that can be directly blitted back into the heap by the demarshaler, simply adjusting all non-immediate fields by a fixed offset

That's how the first marshaler for Caml Light worked. Marshaled data was very big, causing a switch to a compact encoding similar to the one used in OCaml. I'm not surprised you're seeing no significant improvement w.r.t. compact marshaling, even when using shared memory (instead of a pipe or a socket) to transmit the marshaled data.

There should be ways to reserve a given range of addresses in several processes so that they can share some memory at these addresses. But I'm afraid these ways are specific to the OS and the C library used.

@alainfrisch
Copy link
Contributor

Thanks @xavierleroy for the historical perspective!

Speaking as a complete outsider to this topic, I wonder if prefetching hacks (such as in #10195) could somehow speedup the marshaling side.

Also, my quick experiment allowed me to confirm that #9353 introduced a significant slowdown on synthetic benchmarks; that's not a surprise, but something to keep in mind for people relying heavily on marshaling for cross-process communication. This makes me wonder whether it could be useful to have an intermediate where only some nodes are checked for sharing (for instance, I have cases in mind where only keeping sharing on strings would be useful, and restricting to string objects should significantly reduce the overhead of sharing detection).

@xavierleroy
Copy link
Contributor

Thanks @xavierleroy for the historical perspective!

You're welcome. Actually, I remember stealing this idea from SML/NJ. An SML/NJ executable was really a dump of the heap (the code for functions was stored in the heap too), in a format that could be read back in memory and just relocated to recreate the heap.

Also, my quick experiment allowed me to confirm that #9353 introduced a significant slowdown on synthetic benchmarks;

I warned about that in #9353 with my own synthetic benchmark, so this is as expected. I managed to reduce the slowdown by at least one order of magnitude from the original proposal, but that's the best we can achieve, I'm afraid, The march of progress towards Multicore OCaml was deemed more important than those slowdowns.

that's not a surprise, but something to keep in mind for people relying heavily on marshaling for cross-process communication.

For RPC systems that send lots of small messages, marshaling with the No_sharing flag is recommended, and should cause no slowdown. If you're exchanging big data structures where internal sharing matters, perhaps the slowdown is tolerable compared with the time it takes to build the data structure in the first place. If you're repeatedly marshaling the same big data structure with internal sharing, maybe you need to rethink your application.

@avsm
Copy link
Member

avsm commented Mar 16, 2021

@xavierleroy wrote:

The march of progress towards Multicore OCaml was deemed more important than those slowdowns.

In "defence" of multicore OCaml taking priority here, I believe it'll eventually remove the need for marshalling as a mechanism for shared access to datastructures from multiple processes. Domains-only multicore, as proposed for OCaml 5.0, establishes a semantics for parallel access to the OCaml heap with a well-defined behaviour for what to do in the presence of data races.

This parallel access does not have to be restricted to the same address space -- it should be reasonably straightforward to add support for a Domain interface that spawns a process, maps some shared memory to the main mutator process to coordinate GC activity, and subsequently shares a mapping for a portion of the OCaml heap that is suitable for parallel access (without the need for any marshalling across them). The synchronisation costs between processes will be higher than from same-process threads, but hopefully not unbearably so.

The multicore team is laser focussed on getting to 5.0 right now so we don't have time to immediately conduct these experiments, but we'd be happy to advise on anyone who does wish to prototype this. If noone does, we'll definitely get around to it at some point post-OCaml 5.0.

@gadmm
Copy link
Contributor

gadmm commented Mar 16, 2021

@avsm, I might be interested and know others who might be. Who should I contact?

@avsm
Copy link
Member

avsm commented Mar 17, 2021

Feel free to email me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants