Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race in grpclb causes first call(s) to fail with "all SubConns are in TransientFailure" #4029

Closed
technicianted opened this issue Nov 11, 2020 · 2 comments · Fixed by #4031

Comments

@technicianted
Copy link

Please see the FAQ in our main README.md, then answer the questions below before
submitting your issue.

What version of gRPC are you using?

1.33.2

What version of Go are you using (go version)?

1.14.2

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

Use a DNS server to enable grpclb by sending both SRV and TXT records. Run a grpclb server to respond to LB calls. Then using a client to connect to a localhost hello service.

What did you expect to see?

Call succeeds with no special dial/call options.

What did you see instead?

Call fails with "all SubConns are in TransientFailure". The only way to make it work is to either dial with grpc.WithBlock() or call with grpc.WaitForReady().

This is certainly a regression.

Looking at the code, function (lb *lbBalancer) aggregateSubConnStates() connectivity.State goes through the failure path. Tracing the call, it seems that initially the subchannel state is set to Idle and state is placed in the cache before calling Connect(). However, aggregateSubConnStates() seems to use the cached state, which is still Idle causing the channel to go in TransientFailure. I've confirmed the discrepancy using a debugger (cached state Idle, subchannel state Connecting).

When setting either grpc.WithBlock() or grpc.WaitForReady() it forces the call to wait.

The following logs see the incorrect sequence: Channel seems to go into TransientFailure before [grpclb] lbBalancer: handle SubConn state change: 0xc0001ba3b0, CONNECTING.

Logs:

INFO: 2020/11/11 14:46:33 [core] parsed scheme: "dns"
INFO: 2020/11/11 14:46:33 [core] ccResolverWrapper: sending update to cc: {[] 0xc0001ec7e0 0xc00070c5e0}
INFO: 2020/11/11 14:46:33 [core] ClientConn switching balancer to "grpclb"
INFO: 2020/11/11 14:46:33 [core] Channel switches to new LB policy "grpclb"
INFO: 2020/11/11 14:46:33 [grpclb] lbBalancer: UpdateClientConnState: {ResolverState:{Addresses:[] ServiceConfig:0xc0001ec7e0 Attributes:0xc00070c5e0} BalancerConfig:0xc0001ec660}
INFO: 2020/11/11 14:46:33 [core] parsed scheme: "grpclb-internal"
INFO: 2020/11/11 14:46:33 [core] ccResolverWrapper: sending update to cc: {[{127.0.0.1:1234 localhost. <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/11/11 14:46:33 [core] ClientConn switching balancer to "pick_first"
INFO: 2020/11/11 14:46:33 [core] Channel switches to new LB policy "pick_first"
INFO: 2020/11/11 14:46:33 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/11/11 14:46:33 [core] blockingPicker: the picked transport is not ready, loop back to repick
INFO: 2020/11/11 14:46:33 [core] pickfirstBalancer: UpdateSubConnState: 0xc000402450, {CONNECTING <nil>}
INFO: 2020/11/11 14:46:33 [core] Channel Connectivity change to CONNECTING
INFO: 2020/11/11 14:46:33 [core] Subchannel picks a new address "127.0.0.1:1234" to connect
INFO: 2020/11/11 14:46:33 [core] Subchannel Connectivity change to READY
INFO: 2020/11/11 14:46:33 [core] pickfirstBalancer: UpdateSubConnState: 0xc000402450, {READY <nil>}
INFO: 2020/11/11 14:46:33 [core] Channel Connectivity change to READY
INFO: 2020/11/11 14:46:33 [grpclb] lbBalancer: processing server list: servers:{ip_address:"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xff\x00\x00\x01" port:12340 load_balance_token:"hereisatoken"}
INFO: 2020/11/11 14:46:33 [grpclb] lbBalancer: server list entry[0]: ipStr:|127.0.0.1|, port:|12340|, load balancer token:|token|
INFO: 2020/11/11 14:46:33 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/11/11 14:46:33 [core] Channel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/11/11 14:46:33 [core] Subchannel picks a new address "127.0.0.1:12340" to connect
INFO: 2020/11/11 14:46:33 [grpclb] lbBalancer: handle SubConn state change: 0xc0001ba3b0, CONNECTING
INFO: 2020/11/11 14:46:33 [core] Subchannel Connectivity change to READY
INFO: 2020/11/11 14:46:33 [core] Channel Connectivity change to CONNECTING
INFO: 2020/11/11 14:46:33 [grpclb] lbBalancer: handle SubConn state change: 0xc0001ba3b0, READY
INFO: 2020/11/11 14:46:33 [core] Channel Connectivity change to READY
@menghanl
Copy link
Contributor

Thanks for the report. Please take a look at the fix in #4031. Thanks.

@technicianted
Copy link
Author

Thanks. Manually applying the fix works.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants