Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question/Bug] Endless stream of GC events occupying a CPU core: What could be happening ? #101563

Closed
hexawyz opened this issue Apr 25, 2024 · 6 comments
Labels
area-GC-coreclr tenet-performance Performance related issue

Comments

@hexawyz
Copy link

hexawyz commented Apr 25, 2024

Description

This is a scenario that I've observed a few times with the service I've developped here: https://github.com/hexawyz/Exo/tree/master/Exo.Service

After some run time, the service, which is mostly idle, will start consuming the equivalent of a whole CPU Core.

NB: This service uses Workstation GC instead of Server GC in order to reduce the memory footprint.

If the process is attached to Visual Studio, before or after this happens, Diagnostic Tools will show the following:

  • GC events are generated continuously
  • GC events will be triggered even when there have been exactly zero changes on the managed heap (see screenshot below)
  • The service does not seem to be too much affected otherwise (Most visible symptom is the rise in CPU usage)
  • Also, the VS Tasks/Parallel Stacks view will look totally normal, unless I missed something obvious

Screenshot of the service

This behavior generally occurs after a quite long time (~2 days of run time for example, which is what lead to the screenshot above), but I've also observed it in a shorter session where it was started in Debug mode from within Visual Studio. In total, I believe I've observed this behavior about 4 times, so while the trigger seems to be somewhat random, the behavior is not.

The service is realively complex, and I don't excldue the possibility that this quirk in the GC could be triggered by some erroneous code in the service (I hope not). However, because I have observed this specific behavior multiple times, it seems not entirely random, so I believe it is worth asking for help here.

What I'm mainly asking here is if you would be aware of a situation that could trigger this behavior in the GC.
And also, what could I do to investigate this more in depth?

Reproduction Steps

This is how I reproduce the problem, but I don't ask that you run this, as it is not deterministic and the service might not work exactly the same on all computers.

  1. Run the service
  2. Wait / Run some requests on the service using associated UI

Expected behavior

The service would stay at close to 0% CPU usage all the time, and there would not be a continuous stream of GC events

Actual behavior

At some point, the service will start consuming a lot of CPU, and it will be show as a continuous stream of GC events by VS diagnostic tools.

Regression?

No response

Known Workarounds

(Restart my service, but that's not very helpful)

Configuration

.NET SDK: 8.0.104
OS: Windows 11 23H2
Architecture: x64
GC: Workstation

Other information

The service makes use of a lot of low level or advanced features such as native function pointers, GCHandle, WeakReference, DependentHandle, AssemblyLoadContext, etc. I don't believe using any of those should lead to the problem I observed in any case, but that might be worth knowing.
Also, I strongly suspect that the instabilities could be triggered when the associated UI is interacting with the service (through named pipes), as the service is otherwise doing nothing most of the time. (Or at least, doing nothing as much as possible)
However, even if this happens to be the case, the trigger remains non deterministic.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 25, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 25, 2024
@hexawyz hexawyz changed the title [Question/Bug] Endless stream of GC events occupying a CPU core: What coule be happening ? [Question/Bug] Endless stream of GC events occupying a CPU core: What could be happening ? Apr 25, 2024
@vcsjones vcsjones added area-GC-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Apr 25, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@mangod9
Copy link
Member

mangod9 commented Apr 25, 2024

would you be able to collect a GCCollectOnly trace or a dump when the issue repros? Have you noticed this only after upgrading to 8?

@hexawyz
Copy link
Author

hexawyz commented Apr 25, 2024

Oh yes sure, I'll try to do that next time I observe this behavior

Regarding versions of .NET, I have no real point of comparison to tell if this behavior would have happened earlier. (While the code predates .NET 8.0, the app was not in a state to be really useful prior to a few months ago)

@Maoni0 Maoni0 added tenet-performance Performance related issue and removed tenet-performance Performance related issue area-GC-coreclr untriaged New issue has not been triaged by the area owner labels May 3, 2024
@Maoni0
Copy link
Member

Maoni0 commented May 3, 2024

I was going to suggest that you should get a GCCollectOnly trace with GCTriggeredStacks so you can actually see callstacks that triggered GCs. but I just found that it wasn't collecting the rundown events (which means you can't actually decode the managed call frames). so I just submitted a fix: 2031. obviously not available in perfview downloads yet. so I shared a private build at https://github.com/Maoni0/clrgc/tree/main/PerfView-bins if you like to try before the official release build is available.

PerfView.exe /GCCollectOnly /GCTriggeredStacks /nogui collect gctriggers.etl

Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@hexawyz
Copy link
Author

hexawyz commented May 7, 2024

Thanks for your suggestion of using perfview. I managed to reproduce the problem today and found that all GC events were caused by AllocSmall. After digging up by reattaching to VS, I'm pretty confident I found the problem, which was obviously in my code.

So, what can cause this pattern is apparently an infinite loop that generates small allocations. (The infinite loop was one bug; the allocations were another one)
As it seems, the allocations were quite small and extremely short lived, so the GC was just extremely efficient in doing its work. As such, I guess the mystery is solved 😅

Sorry for consuming some of your time, but thanks a lot for the help ! 😊

@hexawyz hexawyz closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants