Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"14 UNAVAILABLE: failed to connect to all addresses" exception is thrown by language worker #482

Open
alrod opened this issue Oct 18, 2021 · 8 comments

Comments

@alrod
Copy link
Member

alrod commented Oct 18, 2021

On restarting a worker language channel (worker crash or timeout) we need to check for grpc server healthiness and shutdown the host itself if the grpc server is unhealthy.

CRI1
CRI2
https://stackoverflow.com/questions/59823424/grpc-14-unavailable-failed-to-connect-to-all-addresses

@pragnagopa
Copy link
Member

Documentation on Channel State API: https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md#channel-state-api

@fabiocav - please find an owner.

@TeplrGuy
Copy link

@pragnagopa @fabiocav can we please get an ETA on this? Even an estimation will suffice. Thanks a lot team.

@fabiocav
Copy link
Member

@TeplrGuy this has been assigned to sprint 114. We'll continue to update the issue as we make progress.

@kshyju
Copy link
Member

kshyju commented Nov 29, 2021

@alrod Looked into the functions logs for the error mentioned in the attached CRIs (14 UNAVAILABLE: failed to connect to all addresses) and I can see that this error is reported from the node.js language worker. Queried logs for the last 3 days in CUS and all the entries are coming from node.js worker.

Channel is a connection abstraction on the client side. A channel instance is needed on the client side to establish a connection to a grpc host/server so that a grpc client/stub instance can be created for further communication to the server. On a node.js client. the channel state check should be done using getConnectivityState or watchConnectivityState APIs. (Link to docs)

I think the next action item here is to investigate the node.js language worker implementation to see why it is getting the connectivity error. I did a quick scan on the node.js worker repo and I do not see the above-mentioned APIs are being used. I tried to repro this error locally with a node.js language worker (v14.16.0), but was unsuccessful in doing so(this could be a race condition issue).

Transferring this to node.js worker repo for next steps.

@kshyju kshyju removed their assignment Nov 29, 2021
@kshyju kshyju transferred this issue from Azure/azure-functions-host Nov 29, 2021
@alrod alrod closed this as completed Dec 13, 2021
@alrod alrod changed the title Check grpc server healthiness on channel restart. "14 UNAVAILABLE: failed to connect to all addresses" exception is thrown by language worker Dec 13, 2021
@alrod
Copy link
Member Author

alrod commented Dec 13, 2021

Reopening the issue, this fix will help to recover function host from "failed to connect to all addresses" grpc error:
Azure/azure-functions-host#7979

We still can not reproduce "14 UNAVAILABLE: failed to connect to all addresses" error but the fix mentioned above will improve automatic recovering after the error.

@alrod alrod reopened this Dec 13, 2021
@alrod
Copy link
Member Author

alrod commented Dec 15, 2021

Fixing race during language worker start:
Azure/azure-functions-host@5fe7711

@ejizba ejizba added this to the December 2021 milestone Jan 28, 2022
@ejizba
Copy link
Contributor

ejizba commented Jan 28, 2022

@alrod was this fixed in the linked PR/commit? Or is there still remaining work?

@ejizba ejizba modified the milestones: December 2021, February 2022 Feb 14, 2022
@ejizba ejizba modified the milestones: February 2022, March 2022 Mar 2, 2022
@alrod
Copy link
Member Author

alrod commented Mar 4, 2022

@ejizba, we did some work in the function host to ensure a worker is recovered after "14 UNAVAILABLE: failed to connect to all addresses".

we still want this issue to be opened to get more details or repro steps as it's not clear what leads the worker to the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants