Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eager Decommits in Medium Sized Slabs #484

Open
SchrodingerZhu opened this issue Mar 20, 2022 · 4 comments
Open

Eager Decommits in Medium Sized Slabs #484

SchrodingerZhu opened this issue Mar 20, 2022 · 4 comments

Comments

@SchrodingerZhu
Copy link
Contributor

This is reported by one of my friends (@misakikasumi). In his application, he used snmalloc to manage 16bit 1080P frame buffers, which takes 1/4 of the medium sized slab. Based on his observation, this layout actually causes a very frequent de-committing pattern and introduces a bit slow down on Windows platform comparing with system allocators or mimalloc.

By design, LocalCache only stores free lists for small size classes. If it is possible to introduce some additional bounded free lists for relatively large objects (so such we can avoid eager de-commmitting)?


@misakikasumi would you like to share more details?

@ghost
Copy link

ghost commented Mar 20, 2022

@SchrodingerZhu Thank you for open this issue. I would like to mention some details about my situation. I'm doing multi-threading video processing and typical allocations are 1080p 16bit frame buffers (or more specifically, buffers of frame planes; 4,147,200 bytes for each). There are also frame buffers of other byte lengths. In my application, these frame buffers are allocated in arbitrary worker threads and deallocated in one GC thread. Considering the performance of returning free spaces from GC thread to worker threads in deallocation, I have adopted snmalloc as the allocator of frame buffers.

I first tried snmalloc1. Because the default config of chunk size of snmalloc is not big enough to deal with frame buffers, I have configured snmalloc to use 16MB large chunks. Under this configuration, one mediumslab can hold four 16bit 1080p frame buffers; it can hold fewer frame buffers with larger size. In current implementation of snmalloc, one mediumslab is deallocated once it becomes empty, and goes into the global large object stack with synchronization. Given that in GC, frame buffers are deallocated in batch, and one mediumslab can only hold four, it is expected that mediumslabs will frequently become empty and trigger large deallocation. This scenario is not unique to my application—compared with superslabs, mediumslabs can contain fewer objects because these objects are larger, so mediumslabs are easier to become empty in nature, if medium objects are alloc/dealloc frequently.

I also tried snmalloc2. I have configured snmalloc to use 16MB large buddy. It decommits a 16MB block once it become empty, resulting in worse performance.

My idea is that maybe we can add a freelist of mediumslabs (or large buddy blocks) with limited capacity. When a mediumslab becomes empty, it can first go into this local freelist instead of triggering immediate global large deallocation/decommits.

@nwf
Copy link
Contributor

nwf commented Mar 20, 2022

I think (but @mjp41 is actually the authority here) that, for snmalloc2, there are several things conspiring to hurt performance here: not only does it notify_not_using everything exiting the per-thread large buddy allocator and entering the backend global (synchronized) large buddy allocator, it also disassociates the per-thread allocator from that address space at the same time, which will require (here) updating 16M/MIN_CHUNK_SIZE = 16M/16K = 1024 pagemap entries. That's a fairly sizeable critical section and so is likely to increase contention on the GlobaRange's flag lock, which, being as it's just a spinlock, probably isn't coping particularly well.

I can think of a few options that might be worth pursuing:

  1. Introduce a LargeMagazine range type that just holds on to a few (16M) regions before forwarding them to its parent range, and use this in the frontend allocators, between their SmallBuddyRange and LargeBuddyRange.
  2. Introduce a notion of "slop" into the LargeBuddyRange where it can (optionally) hold on to ranges larger than what it would usually push up to its parent.
  3. Revisit the decision to put the RemoteAllocator* directly in the pagemap and instead have it inside the associated Metaslab structure. This would serve to reduce the O(chunks) work done under the GlobalRange flag lock to O(1), but it comes at the cost of an additional pointer indirection on the dealloc fast path.

This is not meant to be an exclusive list, just food for thought.

@ghost ghost mentioned this issue Mar 20, 2022
@mjp41
Copy link
Member

mjp41 commented Mar 20, 2022

@nwf the pagemap is updated with the BackendAllocator, which does not hold the global lock.

My plan was the use the code in ChunkAllocator. This currently performs a time based scrubbing of chunks. This code was not refactored with the move to Ranges, but needs to be for this precise case. It would keep a local cache of larger sizes, and only push them centrally once a certain amount of time has passed. I think this would fit this use case perfectly.

@ghost
Copy link

ghost commented Mar 22, 2022

BTW, I'm measuring the cost of decommits on Windows and trying to find alternatives. I tested snmalloc2 with 4 different PALs.

Decommit

static void notify_not_using(void* p, size_t size) noexcept
{
    SNMALLOC_ASSERT(is_aligned_block<page_size>(p, size));

    BOOL ok = VirtualFree(p, size, MEM_DECOMMIT);

    if (!ok)
    error("VirtualFree failed");
}

No-op

static void notify_not_using(void* p, size_t size) noexcept {}

Reset

static void notify_not_using(void* p, size_t size) noexcept
{
    SNMALLOC_ASSERT(is_aligned_block<page_size>(p, size));

    void* r = VirtualAlloc(p, size, MEM_RESET, PAGE_READWRITE);

    if (r == nullptr)
      error("VirtualAlloc failed");
}

Offer & Reclaim

static void notify_not_using(void* p, size_t size) noexcept
{
    SNMALLOC_ASSERT(is_aligned_block<page_size>(p, size));

    auto rv = OfferVirtualMemory(p, size, VmOfferPriorityNormal);

    if (rv != ERROR_SUCCESS)
    error("OfferVirtualMemory failed");
}

template<ZeroMem zero_mem>
static void notify_using(void* p, size_t size) noexcept
{
    SNMALLOC_ASSERT(
    is_aligned_block<page_size>(p, size) || (zero_mem == NoZero));
    
    auto rv = ReclaimVirtualMemory(p, size);
    if (rv == ERROR_SUCCESS || rv == ERROR_BUSY)
      return;

    void* r = VirtualAlloc(p, size, MEM_COMMIT, PAGE_READWRITE);

    if (r == nullptr)
    report_fatal_error(
        "out of memory: {} ({}) could not be committed", p, size);
}

I measured performance (in my application, frames processed per second) and page faults (10000 frames). The result is interesting:

PAL FPS Page Faults
Decommit 340.21 12,557,692
No-op 382.05 141,524
Reset 365.97 151,490
Offer & Reclaim 246.59 17,417,188

I'm surprised that offer & reclaim has very bad performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants