Deadlock in balancing implementation #2380

kalduzov · 2024-03-07T06:11:20Z

We have already set tasks related to balancing. You corrected them and now everything works as it should. We successfully wrote our own balancers and even transferred about 100 services to them. Everything worked perfectly for more than 2 months.

But yesterday we encountered a deadlock in the balancer code. We can't reproduce it right now, but its presence really worries us.

The next 2 lines of code, taking into account blocking, in some situation block each other.

https://github.com/grpc/grpc-dotnet/blob/master/src/Grpc.Net.Client/Balancer/Internal/ConnectionManager.cs#L291

https://github.com/grpc/grpc-dotnet/blob/master/src/Grpc.Net.Client/Balancer/Internal/ConnectionManager.cs#L368

We use standard classes for our balancers and pickers.

Our balancer works according to the push model, i.e. it receives new data on endpoints, and then we send it to the standard Listener().

At this time, our Picker implementations are also working - they are constantly being edited. Every time new endpoints appear, we recreate the pickers with new endpoints (just like in the examples in the documentation).

In the example, it looks like the PickAsync method simply gets stuck in the WaitAsync method immediately after new endpoints arrive. The service continues to work, but each new request in balancing begins to eat up the thread from the ThreadPool. As a result, the picture looks like this:

Calls to a quiet client no longer work (but with us they are rejected by CancellationToken or Deadline)
The lock continues to hang indefinitely and block the pool thread.
The pool begins to swell because they still try to make requests to clients.
And because our services in pods with memory limitations - after a while OOM comes (each thread eats up memory a little) and restarts the pod.
After this balancing it works stably.

This is a really rare situation. The discovery happened completely by chance out of about 500 pods that occurred in just 3 months over the past month.

We would really like to somehow solve this problem - it greatly hinders us, because... We don’t understand how it works at all and whether it blocks the extreme API for us at some point.

Attached is a screenshot from the captured trace from the pod

Library version: 2.60.0

krasin-ga · 2024-03-10T14:47:11Z

The code here assumes that ResolveNowAsync will execute asynchronously, allowing the method to exit and freeing all the locks taken on the way up here. However, this may not be the case if ResolveAsync is implemented synchronously. It will not cause problems on its own because Monitor locks are reentrant, but there are two potentially problematic cases:

If, for some reason, the PollingResolver implementation is holding locks itself when resolving and calling base.Listener() (This may be the case for @kalduzov).
If locks inside ConnectionManager will be refactored to non-reentrant alternatives.

grpc-dotnet/src/Grpc.Net.Client/Balancer/PollingResolver.cs

Lines 132 to 143 in f29d927

    
                   lock (_lock) 
        
                   { 
        
                       Log.ResolverRefreshRequested(_logger, GetType()); 
        
                       if (_resolveTask.IsCompleted) 
        
                       { 
        
                           _resolveTask = ResolveNowAsync(_cts.Token); 
        
                           _resolveTask.ContinueWith(static (t, state) => 
        
                           { 
        
                               var pollingResolver = (PollingResolver)state!; 
        
                               Log.ResolveTaskCompleted(pollingResolver._logger, pollingResolver.GetType()); 
        
                           }, this);

Maybe it's worth nesting the call to ResolveNowAsync inside Task.Run or adding await Task.Yield() at the beginning of the ResolveNowAsync method to ensure that control will return to the caller and the method can exit.

JamesNK · 2024-03-11T01:41:00Z

@krasin-ga Good find. I have a PR with test for your scenario: #2385. Would be great if you could take a look.

JamesNK · 2024-03-16T12:36:41Z

@kalduzov The change is merged. It will take a while for it to be released, but you can try out a prerelease version on the NuGet feed - https://github.com/grpc/grpc-dotnet#grpc-nuget-feed

kalduzov added the bug Something isn't working label Mar 7, 2024

JamesNK mentioned this issue Mar 11, 2024

Prevent block inside ResolveAsync from blocking PollingResolver.Refresh #2385

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in balancing implementation #2380

Deadlock in balancing implementation #2380

kalduzov commented Mar 7, 2024 •

edited

krasin-ga commented Mar 10, 2024

JamesNK commented Mar 11, 2024

JamesNK commented Mar 16, 2024

Deadlock in balancing implementation #2380

Deadlock in balancing implementation #2380

Comments

kalduzov commented Mar 7, 2024 • edited

krasin-ga commented Mar 10, 2024

JamesNK commented Mar 11, 2024

JamesNK commented Mar 16, 2024

kalduzov commented Mar 7, 2024 •

edited