Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory allocation alignment, and garbage collector compatibility #623

Open
KTSnowy opened this issue Jul 19, 2023 · 6 comments
Open

Memory allocation alignment, and garbage collector compatibility #623

KTSnowy opened this issue Jul 19, 2023 · 6 comments

Comments

@KTSnowy
Copy link

KTSnowy commented Jul 19, 2023

Hi, we've been looking for a memory allocator for our COBOL runtime's garbage collector, and had a few questions.

What is the memory alignment for the pointers returned by snmalloc?

Would it be possible to configure snmalloc in a way that a thread can request memory from another thread?

I found the message passing architecture really interesting, and thought I'd try using it in a concurrent reference counting garbage collector (running in a separate thread). I see that remote deallocations are possible with snmalloc, so it would be interesting if remote allocations through message passing were also possible.

That could allows some interesting scenarios where the garbage collector is running in its own thread, and user threads request memory from it and then return it back.

@mjp41
Copy link
Member

mjp41 commented Jul 19, 2023

What is the memory alignment for the pointers returned by snmalloc?

So the allocator's design currently has very strong alignment. Any allocation is guaranteed to be the aligned by the largest power of two that can divide the size. That is,

size alignment
16 16
32 32
48 16
64 64
80 16
96 32
112 16
128 128
160 32
... ...

Would it be possible to configure snmalloc in a way that a thread can request memory from another thread?

There is no way to do this. We also do not intend to add this. Freeing memory is something that can be done asynchronously, but allocating is not. If a call to free is delayed, then the program will use more memory. If a call to malloc is delayed, then the allocating thread will have to wait for the result. This means you would have one thread waiting on another to complete a task. This would not be great for performance.

Why would you want this?

The initial design was around supporting garbage collection, where a GC threads processed some garbage, and then the original allocating threads could then use the memory again. Hence, the mutator threads did all the allocation and processed the remote frees, and the GC threads found garbage, and returned it to the original mutator thread.

@nwf-msr
Copy link
Contributor

nwf-msr commented Jul 19, 2023

To expand a bit on what @mjp41 says...

While there's no explicit cross-thread allocation request (which, again, might have to block the requester until the requestee wakes up and handles the message), memory isn't necessarily doomed to be pinned to any one (thread's) allocator "frontend" forever. If an allocator determines that it has significant amounts of free address space, in the shape of completely empty slabs (~16KiB spans), it will pass that space back to a shared global pool managed by the "backend".

One purpose of the remote messages is to handle inter-thread producer/consumer workloads, where, say, I allocate objects and hand them to you to process and free. If your workload doesn't otherwise allocate and mine doesn't otherwise free, then in a traditional thread-cache allocator, my cache is then constantly being refilled and yours is constantly being evicted, with neither really buying us much. The "return to home" nature of snmalloc's queues basically means that pipelines (or even DAGlines) have steady states that don't need to acquire global locks within the allocator. However, this does mean that the allocator frontend must periodically pause to process its message queue, and maybe that's sometimes not ideal.

It should be easily possible, in terms of LoC added to or around snmalloc, to change the notion of "home" for a slab of objects: it's just swinging a pointer in the PageMap. (We already support, I think, slabs not otherwise indexed by a frontend -- though that may have changed recently with the introduction of "laden" slabs(?). In any case, I don't think it'd be a blocker.) What this could enable, if there were compelling reason, is for an allocator frontend to relinquish control of a slab -- probably a mostly-full or mostly-empty one -- by transferring it to another thread, such as one of your GC threads, for example. This could let an allocator frontend make use of slabs while expedient, probably only occasionally needing to pay attention to its message queue, and pass less expeditiously useful slabs off to the GC, allowing the GC threads to perform local frees, and then have the GC pass the slab back to the backend pool for redistribution once clean. @mjp41 has thought a fair bit about making the backend support lockless concurrency, which might make this story even more attractive.

There's also, I think, a fair bit of design space around the return of slabs to the backend. One could imagine tuning the current policy for particular workloads, enabling per-frontend policies, adding global information, &c. There's been some call for this in very address-space-constrained environments (#615), where disabling frontend caching and returning slabs to the shared pool immediately had some modest performance cost but, much more importantly, kept the system from OOMing.

@KTSnowy
Copy link
Author

KTSnowy commented Jul 19, 2023

Why would you want this?

COBOL has a different concurrency model from the traditional multithreaded one. The language instead uses multiple separate processes running concurrently, exchanging data through message passing. This essentially guarantees at the language level that there will only be a single user thread per process.

In this case it could be useful to implement a system where allocations are done by the single user thread, and deallocations are done by the GC thread. With the user thread borrowing memory from the GC, and returning it back to a deallocation queue that is monitored by the GC thread.

This might improve the performance of an RC-based GC, by deferring deallocations to another thread.

snmalloc was the only allocator I found with a similar idea to this.

Any allocation is guaranteed to be the aligned by the largest power of two that can divide the size.

Is there a way to guarantee it to always be 16-byte aligned?

@mjp41
Copy link
Member

mjp41 commented Jul 19, 2023

Why would you want this?

COBOL has a different concurrency model from the traditional multithreaded one. The language instead uses multiple separate processes running concurrently, exchanging data through message passing. This essentially guarantees at the language level that there will only be a single user thread per process.

In this case it could be useful to implement a system where allocations are done by the single thread, and deallocations are done by the GC thread. With the user thread borrowing memory from the GC, and returning it back to a deallocation queue that is monitored by the GC thread.

This might improve the performance of an RC-based GC, by deferring deallocations to another thread.

snmalloc was the only allocator I found with a similar idea to this.

Interesting you want to separate the "frontend" allocator into two parts that additionally communicate via message passing. This would be a reasonable amount of work, but possible. I'm concerned we might incur an atomic instruction on the fast path of free to make this work.

Any allocation is guaranteed to be the aligned by the largest power of two that can divide the size.

Is there a way to guarantee it to always be 16-byte aligned?

The minimum allocation size is two pointers sized words, and on any 64 bit platform will be 16-byte aligned.

static constexpr size_t MIN_ALLOC_SIZE = 2 * sizeof(void*);

Nothing will be aligned less than the minimum allocation size as allocations are rounded up to a multiple of that size.

@nwf-msr
Copy link
Contributor

nwf-msr commented Jul 19, 2023

Is there a way to guarantee it to always be 16-byte aligned?

If you mean "at least 16-byte aligned", then that's already the case on most platforms, because the smallest size class supported has to be (at least) 2 * sizeof(void*), and the smallest size classes are denormalized (so while the usual cadence is "four divisions between powers of two", the first four classes are multiples of the smallest size: so if sizeof(void*) is 8 then that's 16, 32, 48, 64, then 80, 96, 112, 128, 160, 192, ...; see

constexpr size_t from_exp_mant(size_t m_e)
for the gory details). In principle, larger values are possible by tweaking
static constexpr size_t MIN_ALLOC_SIZE = 2 * sizeof(void*);
so you could make 32-bit platforms also impose 16-byte alignment.

If you mean that the allocator could source, say, a 64 byte object that was 16- but not 32-byte aligned, then "probably not". The design of snmalloc's heap is that large chunks of address space (~16KiB) are given a single size class, so object alignment is directly correlated with object size.

@davidchisnall
Copy link
Collaborator

COBOL has a different concurrency model from the traditional multithreaded one. The language instead uses multiple separate processes running concurrently, exchanging data through message passing. This essentially guarantees at the language level that there will only be a single user thread per process.

You might like to take a look at the use of snmalloc in https://github.com/Microsoft/Verona-sandbox. This uses a model of a single parent process with many children, and a shared heap between the parent and children, with fast allocation and deallocation on the shared heap from both (including allocating in one and freeing in the other). We use this to build a higher-level message-passing model, where objects are allocated in the shared heaps and we just pass pointers, so we get one-copy I/O between parent and child (or zero-copy if we're not doing defensive copies).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants