New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in balancing implementation #2380
Comments
The code here assumes that
grpc-dotnet/src/Grpc.Net.Client/Balancer/PollingResolver.cs Lines 132 to 143 in f29d927
Maybe it's worth nesting the call to |
@krasin-ga Good find. I have a PR with test for your scenario: #2385. Would be great if you could take a look. |
@kalduzov The change is merged. It will take a while for it to be released, but you can try out a prerelease version on the NuGet feed - https://github.com/grpc/grpc-dotnet#grpc-nuget-feed |
Hi @JamesNK
We have already set tasks related to balancing. You corrected them and now everything works as it should. We successfully wrote our own balancers and even transferred about 100 services to them. Everything worked perfectly for more than 2 months.
But yesterday we encountered a deadlock in the balancer code. We can't reproduce it right now, but its presence really worries us.
The next 2 lines of code, taking into account blocking, in some situation block each other.
https://github.com/grpc/grpc-dotnet/blob/master/src/Grpc.Net.Client/Balancer/Internal/ConnectionManager.cs#L291
https://github.com/grpc/grpc-dotnet/blob/master/src/Grpc.Net.Client/Balancer/Internal/ConnectionManager.cs#L368
We use standard classes for our balancers and pickers.
Our balancer works according to the push model, i.e. it receives new data on endpoints, and then we send it to the standard Listener().
At this time, our Picker implementations are also working - they are constantly being edited. Every time new endpoints appear, we recreate the pickers with new endpoints (just like in the examples in the documentation).
In the example, it looks like the PickAsync method simply gets stuck in the WaitAsync method immediately after new endpoints arrive. The service continues to work, but each new request in balancing begins to eat up the thread from the ThreadPool. As a result, the picture looks like this:
This is a really rare situation. The discovery happened completely by chance out of about 500 pods that occurred in just 3 months over the past month.
We would really like to somehow solve this problem - it greatly hinders us, because... We don’t understand how it works at all and whether it blocks the extreme API for us at some point.
Attached is a screenshot from the captured trace from the pod
Library version: 2.60.0
The text was updated successfully, but these errors were encountered: