Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GC heap hard limit for 32 bit #101024

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gbalykov
Copy link
Member

This change enables heap hard limit on 32 bit.

Approach is similar to DOTNET_GCSegmentSize (GCConfig::GetSegmentSize), which allows to set size of segment for SOH/LOH/POH, and guarantees that there's no oveflow during computations (for example, during size_t initial_heap_size = soh_segment_size + loh_segment_size + poh_segment_size;). When DOTNET_GCSegmentSize is set on 32bit, it's rounded down to power of 2, so largest possible value of provided segment (SOH) is 2 Gb (4Mb<=soh_segment_size<=2Gb). For LOH/POH same value is divided by 2 and then also rounded down to power of 2, so largest possible value of LOH/POH segment is 1 Gb (4Mb<=loh_segment_size<=1Gb, 0<=poh_segment_size<=1Gb). So, segment size for SOH/LOH/POH never overflows, as well as initial_heap_size (except for oveflow of initial_heap_size to 0, which will lead to failed allocation later anyway).

Similar thing happens when DOTNET_GCHeapHardLimit or DOTNET_GCHeapHardLimitSOH/DOTNET_GCHeapHardLimitLOH/DOTNET_GCHeapHardLimitPOH are set on 32bit. There're limits on these values:

  1. for heap-specific limits:
    0 <= (heap_hard_limit = heap_hard_limit_oh[soh] + heap_hard_limit_oh[loh] + heap_hard_limit_oh[poh]) < 4Gb
    a) 0 <= heap_hard_limit_oh[soh] < 2Gb, 0 <= heap_hard_limit_oh[loh] <= 1Gb, 0 <= heap_hard_limit_oh[poh] <= 1Gb
    b) 0 <= heap_hard_limit_oh[soh] <= 1Gb, 0 <= heap_hard_limit_oh[loh] < 2Gb, 0 <= heap_hard_limit_oh[poh] <= 1Gb
    c) 0 <= heap_hard_limit_oh[soh] <= 1Gb, 0 <= heap_hard_limit_oh[loh] <= 1Gb, 0 <= heap_hard_limit_oh[poh] < 2Gb
  2. for same limit for all heaps:
    0 <= heap_hard_limit <= 1Gb

These ranges guarantee that calculation of soh_segment_size, loh_segment_size and poh_segment_size with alignment and round up won't overflow, as well as calculation of sum of them for allocation (overflow to 0 is allowed, same as for DOTNET_GCSegmentSize). When values specified by user with env variables don't meet the requirements above, runtime exits with CLR_E_GC_BAD_HARD_LIMIT. When allocation (with mmap on Linux) fails, runtime exits with same error as for large segment size specified with DOTNET_GCSegmentSize.

This patch doesn't enable heap hard limit on 32bit in containers (is_restricted_physical_mem), because current heap hard limit approach is to reserve one large GC heap segment with size equal to specified heap hard limit, and no new segments are reserved in future. On 64 bit in containers by default heap hard limit is set to 75% of total physical memory available, and this is fine, because virtual address space on 64 bit is much larger than actual physical size on devices. In contrast, on 32 bit virtual address space size might be the same as available physical size on device (e.g. 4 Gb for both). This means that reserving 75% of total physical mem will reserve 75% of whole virtual address space, which might be both undesirable (e.g. process later expects more available memory) and mmap on Linux will probably fail anyway.

I've ran release CLR tests on armel Linux with this patch on top of f84d33 with different limits, there seems to be no issues with 4Mb/8Mb/16Mb and 512Mb heap hard limits (some tests fail with "Out of memory"/"OOM" and "System.OutOfMemoryException", as expected, or "System.InvalidOperationException: NoGCRegion mode must be set" when NoGC region is larger than limit). Same for GCStresss 0xc and 0x3 with 4 Mb heap hard limit, and same for debug CLR tests on armel Linux with 4Mb heap hard limit.

@Maoni0 @cshung please share what you think. Thank you.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 14, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@gbalykov gbalykov marked this pull request as ready for review April 14, 2024 14:32
@cshung
Copy link
Member

cshung commented Apr 14, 2024

@gbalykov, I am wondering why you would want a heap_hard_limit for 32 bit platforms?

I am working on #100380 (forgive the misleading title) that enabled the computation of committed_by* value at all times independent of heap_hard_limit.

To make heap_hard_limit work on 32 bits, we need to have an accurate way of maintaining how many bytes are currently committed, I don't think I get the calculation right there yet, and that is perhaps why the CI is failing on 32 bit platforms for now.

The failure is an assertion - so you would need to run it under CHECK or DEBUG to reproduce it. The assertion to prevent an underflow in subtraction. In release, it will run fine (and fail later because of wrong values).

I was thinking about just limiting my work to just 64 bits, but if there is a reason to, we can figure how to get the calculation right on 32 bit platforms as well.

@gbalykov
Copy link
Member Author

I am wondering why you would want a heap_hard_limit for 32 bit platforms?

It's the same as for 64bit: to be able to limit memory consumption of process, and, specifically, GC heap size. Setting the limit specifically on managed heap size allows for much better granularity than using limits for whole process with cgroups, etc. Even though that this change doesn't set default value for containerized environments (is_restricted_physical_mem), it allows to set heap limit manually, which is also very useful.

Also here's related discussion about heap limits for 32 bit: #78128.

To make heap_hard_limit work on 32 bits, we need to have an accurate way of maintaining how many bytes are currently committed

Can you share more details about this? During my anylysis of code (f84d33c) I didn't find any other related places that depend on bitness, and it seems that with this change behavior on 64 and 32 bit is pretty much the same. For example, committed_by_oh_per_heap are used only under _DEBUG && MULTIPLE_HEAPS on 64 bit even without this change, and committed_by_oh is updated the same way both on 64 and 32 bit with this change. Please correct me if I'm wrong here.

I was also able to test this change with more complex apps, and physical size of heap (Private_Dirty of GC vmas) always remained in allowed limit as expected (of course, assuming that heap is able to fit in that limit with more frequent GC, for example).

@cshung
Copy link
Member

cshung commented Apr 15, 2024

It does make sense to use heap_hard_limit to control memory usage under 32 bit platforms, and I can work with you to get this done.

To start with, the key difference between 32 bits and 64 bits is USE_REGIONS.

Here is where USE_REGIONS is defined in gcpriv.h

#if defined (HOST_64BIT) && !defined (BUILD_AS_STANDALONE) && !defined(__APPLE__)
#define USE_REGIONS
#endif //HOST_64BIT && BUILD_AS_STANDALONE

Forgive the inaccurate comments please, sorry.

USE_REGIONS is not enabled in 32 bits, and therefore the behavior on x86 or arm32 is significantly different from the 64 bit versions.

Historical Context (Good for knowledge, but feel free to skip on first read)

With respect to commit accounting, let's go back in time and see why we are doing it.

By the time we introduced heap_hard_limit, we needed to know how much memory was committed, and therefore we had a simple counter current_total_committed that is incremented during commit and decremented during virtual_decommit. That was pretty simple.

Then I introduced per object heap hard limit (#36731) by checking against committed_by_oh, it was meant to solve the problem related to initial commit for large pages support. In that change, I proposed a testing scheme by validating the number against the actual heap. The idea is that, we can count the committed bytes during commit/decommit operations, but we can also count the committed bytes by walking the heap data structures, and they should be identical. With that, we can both validate the numbers are correct and that we didn't mess up with the data structures.

By that time, the verification logic was unnecessarily complicated, and therefore we didn't merge those in.

That was before USE_REGIONS. At some point, we enabled USE_REGIONS, the verification work can be done make simpler under USE_REGIONS, so we did it. As a side-effect of that change, I introduced committed_by_oh_per_heap, which make per heap validation possible without taking locks.

Much of the verification logic was written with only USE_REGIONS in mind. I wouldn't be surprised that it does not work with segments (i.e. the code without USE_REGIONS, 32 bits in particular).

Keeping track of these number is non-trivial, at the very least, we need to take the check_commit_cs lock. Therefore, to make sure we don't regress for performance, we chose to keep those values only if heap_hard_limit is enabled.

Then it comes to RefreshMemoryLimit, a feature to allow user to specify or ask the system to redetect the system memory limits. This allow user updating container limits and also let the GC knows.

A technical challenge at that point of time is that if heap_hard_limit was not specified before the RefreshMemoryLimit call, and we establish a new one, we will need to have the various commit counter values. In USE_REGIONS, we know how to do it, but in the non-USE_REGIONS case, we don't have a good way to do it, so we just don't support it at the moment.

One simple way to get around it is to always keep the counters, which is what I am trying to do right now, we have to be careful with respect to performance (which I haven't checked yet). I am hoping it is okay.

The known problem - overlapped commit

While it appears that it is easy to just add when we commit and subtract when we decommit, it is not that simple. The problem is overlapped commits.

Under USE_REGIONS, we never call virtual_commit with a memory range that contains committed memory. Therefore the add and subtract always works.

Thanks to the forgiving OS memory management, we can call virtual_commit with a memory range that contains committed memory, and the OS will just ignore the committed part.

Under non-USE_REGIONS, however, it is known that we sometimes call virtual_commit with a memory range that contains committed memory. The simple add will double count.

The double count on it own is wrong, but is not fatal. With that happened repeatedly, however, we will leak those bytes (while memory is fine), and so we ran out of bytes when you try to commit (but in fact you do have memory).

One of the known case of overlapping commit is for mark arrays. We don't keep track of the exact bound where the mark array was committed, so any time we need to ensure mark array is available and it isn't fully committed, we commit the whole mark array range every time we need it.

And there could be other unknown cases that we just don't know in the moment.

The unknown problem - overflow?

Beyond just overlapped commit, I am seeing other problems as well.

In case where a heap hard limit is not provided, I suspect the counter overflowed. When I decommit, I am hitting an assert that we are decommitted more memory than we have, and the counter is suscipiously low at that point.

Of course, that's just a guess. There could be other things going on as well. The only thing that I am sure is the number must be wrong and we can't decommit memory that wasn't committed.

Plan of attack

Fundamentally, it is a calculation error. Unfortunately, we don't have any symptoms when the wrong happens, and failing at the end is too late. To make this easier, I think we can

1.) Track mark array usage, and make sure we don't do overlapped commit.
2.) Check the counter against the heap data structures, and make sure they are consistent, and
3.) Use logging to figure out what went wrong.

Notice that dprintf(3, ("commit-accounting ... lines. They are designed so that we know exactly what range of memory is committed/decommitted/transferred between buckets. We can use that to figure out what went wrong.

Final words

As you can see, it is not a trivial problem to solve. With efforts, I think we can nail it. It will worth the effort to make 32 bits app work better.

@gbalykov
Copy link
Member Author

Thanks for such detailed description!

Under non-USE_REGIONS, however, it is known that we sometimes call virtual_commit with a memory range that contains committed memory. The simple add will double count.

The double count on it own is wrong, but is not fatal. With that happened repeatedly, however, we will leak those bytes (while memory is fine), and so we ran out of bytes when you try to commit (but in fact you do have memory).

Does this mean that heap hard limit doesn't work correctly with segments even on 64bit on current main? Original PR to add heap hard limit support (dotnet/coreclr#22180) was merged back in 2019, and it seems that GC regions support (#59283) was enabled only in 2022, so during my work on this patch I thought that segments on 64 bit fully support heap hard limit.

If heap hard limit with segment is not fully correct on 64 bit on main now, is it safe to assume that it was correct before GC regions support was enabled by default (e.g. in .net 6)? Or is this possible leak in bytes accounting (e.g. for mark array) a known bug of heap hard limit with segments that was present since 2019?

The idea is that, we can count the committed bytes during commit/decommit operations, but we can also count the committed bytes by walking the heap data structures, and they should be identical.

Much of the verification logic was written with only USE_REGIONS in mind. I wouldn't be surprised that it does not work with segments (i.e. the code without USE_REGIONS, 32 bits in particular).

If I understand correctly, this logic for counting committed bytes by walking the heap data structures is needed for two things:

  • verification of GC counters and data structures (which is checked under DEBUG && (COMMITTED_BYTES_SHADOW || heap_hard_limit), supported only for 64bit with GC regions)
  • RefreshMemoryLimit (because there might be no limit before it's called, supported only for 64bit with GC regions)

So, if both are disabled for 64 bit, it seems it should be the same as it was before GC regions were enabled by default. Is it correct?
If yes, then it seems that limited 32bit support (same level as 64bit support before GC regions were enabled by default, even though there's a possible leak in accounting) can be enabled with this PR without verification logic and RefreshMemoryLimit, and then we can further work on full support of accounting with segments.

One simple way to get around it is to always keep the counters, which is what I am trying to do right now

Can you share why is it needed to always keep counters on 64bit with regions if there's already a way to update them by walking the heap data structures? Is it needed for better validation of counters and heap structures?

If there's no way to update GC counters by walking the heap data structures with segments, then always keeping the counters seems the only way to fully support limits on 32 bit (including verification of counters and RefreshMemoryLimit, which I'm also interested in). I'm very interested in your patch from #100380, let me try it with this PR to check the behavior. Can you share the C# program that you use for testing?

Thanks again for your interest in this!

@cshung
Copy link
Member

cshung commented Apr 22, 2024

Sorry for the late reply, I took the last few days trying to figure out these.

> Does this mean that heap hard limit doesn't work correctly with segments even on 64bit on current main?

I confirm this is the case, to prove that, I have found a couple of issues, I plan to get these fixed.

Bug 1

When we get a new segment by committing memory for large object heap or pinned object heap, it is incorrectly accounted for as Gen 2, this will impact the committed_by_oh values, but should be fine with the overall current_total_committed value.

gc_heap::get_segment (size_t size, gc_oh_num oh)
{
...
result = make_heap_segment ((uint8_t*)mem, size, __this, (uoh_p ? max_generation : 0));
...
}

Bug 2

When we release a segment, we used virtual_free, note that it does not decrease the current_total_committed vaiue.

void gc_heap::virtual_free (void* add, size_t allocated_size, heap_segment* sg)
{
    bool release_succeeded_p = GCToOSInterface::VirtualRelease (add, allocated_size);
    if (release_succeeded_p)
    {
        reserved_memory -= allocated_size;
        dprintf (2, ("Virtual Free size %zd: [%zx, %zx[",
                    allocated_size, (size_t)add, (size_t)((uint8_t*)add + allocated_size)));
    }
}

But when we call it in release_segment, the segment contained committed memory. It is okay from a memory perspective because virtual_free will decommit, but the bytes are wrong.

This will impact both committed_by_oh and current_total_committed, sometimes the error can be quite large there.

> If heap hard limit with segment is not fully correct on 64 bit on main now, is it safe to assume that it was correct before GC regions support was enabled by default (e.g. in .net 6)?

With the bugs I just shown above, It is pretty clear the answer is no. The bugs I have found above has nothing to do with USE_REGIONS.

> Can you share why is it needed to always keep counters on 64bit with regions if there's already a way to update them by walking the heap data structures? Is it needed for better validation of counters and heap structures?

When I built it, it was meant to be a stepping stone. As you probably know, the logic is quite complicated and easy to make mistakes. Building this logic helped me a lot in validating that I am doing the right thing.

With the PR, my goal was to get refresh_memory_limit to work on segments, always keeping the counter is really meant to be a shortcut so that I don't have to make sure the current_total_committed is good. With the same reasoning as yours, I assumed it was good enough (i.e. I knew there were bugs, but if it was enabled back in 2019, the bug can't be too bad). And so if I keep it, it should be as good as it was.

Except my investigations above proved otherwise, the bug is actually quite bad, the current_total_committed value can be off by a few MB per GC, it is not difficult to see the counter will overflow way faster than memory does.

> Can you share the C# program that you use for testing?

It is this failing test case in #100380.

With that test case, I can:

  1. Reproduce the assertion pretty consistently on x86, check, workstation, on top of 100380 bits, on a DevBox.
  2. Added a validation to check the committed_by_oh values for only the heap memory, and
  3. With a tentative fix for the two issues found above, I am able to get the test to pass on x86, check, workstation, but and server.
  4. The test case is still failing on server GC to validate the committed_by_oh values , something else is still wrong, even without the mark array issue.
  5. The CI is still reporting wks failure on osx and 32 bits, workstation. That means something else is still wrong with the numbers in segments. (we use segments for OSX as well)
  6. The CI is passing with the committed counter validated for the heap segments!

Unfortunately, getting these committed counters correct for segments isn't a priority right now. It would be great if you could help with the debugging. Essentially, you want to find where in the code that is probably MULTIPLE_HEAPS specific that does not consistently update the counter and the heap. You will probably want to keep a history of what happened using logging or otherwise, discard the history before the last successful verify_committed_bytes_per_heap check because you knew everything before it was correct, and track the changes to see which one is wrong.

Now you probably know why building up these verification routines are very useful stepping stone to find bugs.

I know, it is easier said than done, it took me quite a few days to find the couple of issues above. Let me know if there are things I could help with. I am more than happy to help you with this.

@cshung
Copy link
Member

cshung commented Apr 28, 2024

@gbalykov

With another week of debugging, I figured the bug related to server GC. It isn't related to threading, it is just that I missed one call site to virtual_free that I need to change to make sure the committed byte counters are decremented. With that fix, I am able to get the CI passing (See #100380) with the heap_segment related counters verified for every blocking GC.

With that, we can focus on the recorded_committed_bookkeeping_bucket next. The recorded_committed_free_bucket should always be empty in segments, so if we complete the bookkeepping bucket, we are done with the numbers.

@gbalykov
Copy link
Member Author

gbalykov commented May 2, 2024

@cshung Great work! I've been on vacation and busy with other task right now, but plan to switch back to this starting from the next week.

@cshung
Copy link
Member

cshung commented May 9, 2024

@gbalykov, welcome back from your vacation, hope you had a good time.

Another update, I have got the bookkeeping numbers as well (sadly I gave up the mark array bytes for simplicity). With that, I have gone through a round of review with @Maoni0 with my change. I think the change is more or less good to go and I will merge it whenever I can. It would be great if you can rebase this change on top of that, test it on some 32 bit platforms and see if that still works?

To stress test the calculation, you can turn on the COMMITTED_BYTES_SHADOW, that will stress the size computation logic a lot more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants