Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: interface conversion: interface is nil, not balancer.SubConn goroutine 326 [running] #6453

Closed
chenye2017 opened this issue Jul 15, 2023 · 34 comments

Comments

@chenye2017
Copy link

NOTE: if you are reporting is a potential security vulnerability or a crash,
please follow our CVE process at
https://github.com/grpc/proposal/blob/master/P4-grpc-cve-process.md instead of
filing an issue here.

Please see the FAQ in our main README.md, then answer the questions below
before submitting your issue.

What version of gRPC are you using?

1.46.2

What version of Go are you using (go version)?

1.19

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

Occasionally panic occurs in the production environment

What did you expect to see?

let assert safe

What did you see instead?

function regeneratePicker in /balancer/base/balancer.go

for _, addr := range b.subConns.Keys() {
		sci, _ := b.subConns.Get(addr)
		sc := sci.(balancer.SubConn)
		if st, ok := b.scStates[sc]; ok && st == connectivity.Ready {
			readySCs[sc] = SubConnInfo{Address: addr}
		}
	}

if b.subConns.Get(addr) is nil, the next assert will panic, instead

for _, addr := range b.subConns.Keys() {
		sci, _ := b.subConns.Get(addr)
		sc, ok := sci.(balancer.SubConn)
		if !ok {
			continue
		}
		if st, ok := b.scStates[sc]; ok && st == connectivity.Ready {
			readySCs[sc] = SubConnInfo{Address: addr}
		}
	}
@dfawley
Copy link
Member

dfawley commented Jul 17, 2023

if b.subConns.Get(addr) is nil, the next assert will panic

subConns.Get can't be nil, since it was returned as a key from subConns.Keys, or else there is a different bug. Can you provide more information about your usage, since it sounds like you've hit this in production?

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@shenqidebaozi
Copy link

We also encountered this issue during use, using gRPC version 1.45

panic: interface conversion: interface is nil, not balancer.SubConn

goroutine 734 [running]:
google.golang.org/grpc/balancer/base.(*baseBalancer).UpdateClientConnState(0xc0090e7a00, {{{0xc016e5c360, 0x4, 0x4}, 0x0, 0x0}, {0x0, 0x0}})
	/go/pkg/mod/google.golang.org/grpc@v1.45.0/balancer/base/balancer.go:120 +0x4d7
google.golang.org/grpc.(*ccBalancerWrapper).updateClientConnState(0xc0097523f8?, 0xc016e5c360?)
	/go/pkg/mod/google.golang.org/grpc@v1.45.0/balancer_conn_wrappers.go:167 +0x11d
google.golang.org/grpc.(*ClientConn).updateResolverState(0xc009752000, {{0xc016e5c360, 0x4, 0x4}, 0x0, 0x0}, {0x0, 0x0})
	/go/pkg/mod/google.golang.org/grpc@v1.45.0/clientconn.go:691 +0x7d9
google.golang.org/grpc.(*ccResolverWrapper).UpdateState(0xc0093d6a80, {{0xc016e5c360, 0x4, 0x4}, 0x0, 0x0})
	/go/pkg/mod/google.golang.org/grpc@v1.45.0/resolver_conn_wrapper.go:105 +0x299
github.com/go-kratos/kratos/v2/transport/grpc/resolver/discovery.(*discoveryResolver).update(0xc00946a7e0, {0xc00552a760, 0x4, 0x4})
	/go/pkg/mod/github.com/go-kratos/kratos/v2@v2.5.5-0.20230227074321-834b781ee211/transport/grpc/resolver/discovery/resolver.go:88 +0x751
github.com/go-kratos/kratos/v2/transport/grpc/resolver/discovery.(*discoveryResolver).watch(0xc00946a7e0)
	/go/pkg/mod/github.com/go-kratos/kratos/v2@v2.5.5-0.20230227074321-834b781ee211/transport/grpc/resolver/discovery/resolver.go:47 +0x72
created by github.com/go-kratos/kratos/v2/transport/grpc/resolver/discovery.(*builder).Build
	/go/pkg/mod/github.com/go-kratos/kratos/v2@v2.5.5-0.20230227074321-834b781ee211/transport/grpc/resolver/discovery/builder.go:108 +0x4df

@shenqidebaozi
Copy link

shenqidebaozi commented Sep 25, 2023

b.subConns is implemented based on a map, but mutex is not used when using it. From the phenomenon, after obtaining the address list from b.subConns.Keys(), it cannot be obtained through the b.subConns.Get method. Failure to determine whether it is ok ay during use leads to subsequent nil assertions

企业微信截图_04aab6e6-f394-4499-b160-b1dc34e10818 企业微信截图_10708e4d-0df7-4bd8-9730-bc5817754746

type AddressMap struct {

for _, a := range b.subConns.Keys() {

@shenqidebaozi
Copy link

@dfawley PTAL.

I can provide more information

@dfawley
Copy link
Member

dfawley commented Sep 25, 2023

@shenqidebaozi,

b.subConns shouldn't need a mutex, as it is only accessed in methods that are guaranteed to be called mutually exclusively.

Can you provide a minimal reproduction for this?

@dfawley dfawley reopened this Sep 25, 2023
@dfawley dfawley assigned shenqidebaozi and unassigned chenye2017 Sep 25, 2023
@dfawley dfawley added P2 and removed stale labels Sep 25, 2023
@github-actions
Copy link

github-actions bot commented Oct 2, 2023

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Oct 2, 2023
@dylan-bourque
Copy link

We're seeing a similar panic coming from baseBalancer.regeneratePicker() using 1.58.1. I'm not able to share our code as it's completely internal, but please don't close this issue. We've had multiple teams have to explicitly downgrade gRPC to work around these crashes.

panic: interface conversion: interface is nil, not balancer.SubConn

goroutine 166 [running]:
google.golang.org/grpc/balancer/base.(*baseBalancer).regeneratePicker(0xc001e0eb80)
	google.golang.org/grpc@v1.58.1/balancer/base/balancer.go:177 +0x305
google.golang.org/grpc/balancer/base.(*baseBalancer).updateSubConnState(0xc001e0eb80, {0x40d1b00, 0xc0027b5110}, {0xc001d50458?, {0x0?, 0x0?}})
	google.golang.org/grpc@v1.58.1/balancer/base/balancer.go:233 +0x3fe
google.golang.org/grpc/balancer/base.(*baseBalancer).UpdateClientConnState.func1({0x1d2afa0?, {0x0?, 0x0?}})
	google.golang.org/grpc@v1.58.1/balancer/base/balancer.go:111 +0x3e
google.golang.org/grpc/internal/balancer/gracefulswitch.(*Balancer).updateSubConnState(0xc001e1e6c0, {0x40d1b00, 0xc0027b5110}, {0xc0017cd380?, {0x0?, 0x0?}}, 0xc0027b04b0)
	google.golang.org/grpc@v1.58.1/internal/balancer/gracefulswitch/gracefulswitch.go:228 +0x2c7
google.golang.org/grpc/internal/balancer/gracefulswitch.(*balancerWrapper).NewSubConn.func1({0x1c40760?, {0x0?, 0x0?}})
	google.golang.org/grpc@v1.58.1/internal/balancer/gracefulswitch/gracefulswitch.go:342 +0x46
google.golang.org/grpc.(*ccBalancerWrapper).updateSubConnState.func1({0xc0027b45d0?, 0xc002893f40?})
	google.golang.org/grpc@v1.58.1/balancer_conn_wrappers.go:130 +0x37
google.golang.org/grpc/internal/grpcsync.(*CallbackSerializer).run(0xc000348280, {0x40d1908, 0xc0027b21e0})
	google.golang.org/grpc@v1.58.1/internal/grpcsync/callback_serializer.go:92 +0x179
created by google.golang.org/grpc/internal/grpcsync.NewCallbackSerializer in goroutine 178
	google.golang.org/grpc@v1.58.1/internal/grpcsync/callback_serializer.go:55 +0x129

@github-actions github-actions bot removed the stale label Oct 2, 2023
@dfawley
Copy link
Member

dfawley commented Oct 2, 2023

I looked at the stack trace from @dylan-bourque, but I still can't see how it's possible. The only thing I could see is either NewSubConn returns nil for the SubConn OR invokes the state listener callback before NewSubConn returns. Unfortunately, that happens in another goroutine and isn't covered by that stack trace.

If there is a way I can reproduce this, then I can figure out what is going on and fix it. Otherwise, I just don't see what the problem might be.

@dfawley
Copy link
Member

dfawley commented Oct 2, 2023

Actually, looking at this closer, I do see how 1.58.* can be affected with a very small race between creating a new connection and shutting down or entering idle. I'll send a fix for that soon. I don't know about the reports from 1.45 or 1.46 however.

@dylan-bourque
Copy link

Thanks @dfawley. I can have our teams test/validate a fix whenever you have one ready.

@dfawley
Copy link
Member

dfawley commented Oct 3, 2023

@dylan-bourque - actually, now that I've looked even further, the race that I saw isn't actually possible. What I thought was happening was that the StateListener callback was being invoked before NewSubConn returned. There are two ways I thought that could happen - Close and enterIdleMode happening concurrently. However, in both of those cases, we actually close the balancer wrapper first, which prevents those callbacks from happening.

So, back to square one. We will definitely need a repro case to debug this any further, sorry.

@shenqidebaozi
Copy link

@dfawley sorry, the other day was China's National Day and I was on vacation. We have only encountered this issue once when using gRPC balancer, and we are currently unsure how to reproduce it. But in fact, we only need to judge the results to avoid this problem, do you think?

for _, addr := range b.subConns.Keys() {
		sci, _ := b.subConns.Get(addr)
		sc, ok := sci.(balancer.SubConn)
		if !ok {
			continue
		}
		if st, ok := b.scStates[sc]; ok && st == connectivity.Ready {
			readySCs[sc] = SubConnInfo{Address: addr}
		}
	}

@shenqidebaozi
Copy link

I searched our log library and did not find any similar error messages, including in the testing and production environments. It has only encountered a problem once in the production environment, so I am currently unsure how to reproduce this problem. But it caused panic in the program, causing some of our requests to fail.

@dfawley
Copy link
Member

dfawley commented Oct 8, 2023

Adding in defensive programming will prevent a panic, but it also covers up an issue that should be impossible to occur unless there is some other problem which could lead to outcomes even worse than a panic (e.g. a dead client not connected to any addresses). I'd rather leave in the potential panic to help us uncover and fix any bugs like that, but from code inspection, I don't see how anything like that is possible here.

@shenqidebaozi
Copy link

shenqidebaozi commented Oct 9, 2023

Yes, but I searched past logs and did not find any similar issues. I'm not sure how to reproduce this problem at the moment. I have only encountered this issue once since using gRPC, and other colleagues have not reported this issue. If it is an obvious code issue, we will quickly discover it, but depending on the frequency of the problem, it may require an extreme environment to occur. 😞

If we don't use some means to avoid it, program exit will cause some requests to fail, which is a situation we don't want to see.

@shenqidebaozi
Copy link

Due to the lack of relevant help logs, it is difficult for us to troubleshoot the issue solely through the stack information of the panic. Can we consider adding defensive programming and increasing log printing in abnormal situations, such as printing the current subConns information, to provide assistance in troubleshooting this issue

@dylan-bourque
Copy link

We have another team that's reported the same panic in baseBalancer.regeneratePicker() this morning with the same stack trace as I posted earlier.

@dylan-bourque
Copy link

Possibly relevant detail: we do have a custom, client-side LB implementation that's broken several times over the last few releases as the internals of the LB code has changed. I see quite a few changes inside of the balancer/ sub-package, including some deprecation notices.

@dfawley
Copy link
Member

dfawley commented Oct 11, 2023

Possibly relevant detail: we do have a custom, client-side LB implementation that's broken several times over the last few releases as the internals of the LB code has changed. I see quite a few changes inside of the balancer/ sub-package, including some deprecation notices.

Is it possible it's your custom LB code that is causing the panics here? Note that our custom LB support is currently still marked as experimental, meaning breaking changes and deprecation are not unexpected.

Can we consider adding defensive programming and increasing log printing

The problem is the failure mode can be even worse if we do this. If a server crashes, it's usually just restarted automatically. If a bug happens that makes it not have any open connections, then it will be stuck.

@dylan-bourque
Copy link

dylan-bourque commented Oct 11, 2023

Is it possible it's your custom LB code

I've thought about that but our code is not in any stack trace for these crashes.

breaking changes and deprecation are not unexpected.

I'm well aware. We've been bitten by this 3 or 4 times already, including one where we had to do an almost complete rewrite.

To be fair, there's very little documentation around the LB stuff and nothing that says "the behavior of SubConn is changing from X to Y" so I can only take guesses about what might be going on based on digging through the gRPC code. For now we have service owners using replace or exclude directives in go.mod to forcibly downgrade to an older release that doesn't panic.

@dfawley
Copy link
Member

dfawley commented Oct 11, 2023

If you have use cases that aren't served without using the experimental APIs then feel free to file another issue to discuss, and we'll see what we can do.

@shenqidebaozi
Copy link

shenqidebaozi commented Oct 12, 2023

The problem is the failure mode can be even worse if we do this. If a server crashes, it's usually just restarted automatically. If a bug happens that makes it not have any open connections, then it will be stuck.

@dfawley I have no objection to the need to exit the program to ensure the correct logic, but can you provide users with some hooks to handle before exiting or print detailed debug information before exiting for troubleshooting purposes

@dfawley
Copy link
Member

dfawley commented Oct 16, 2023

some hooks to handle before exiting

I'm not sure what you have in mind here. Can you give an example?

Are you using a custom LB policy? I do see https://github.com/go-kratos/kratos/blob/main/transport/grpc/balancer.go near the code from your stack trace, and it's possible it's the source of the problem.

@dylan-bourque
Copy link

dylan-bourque commented Oct 17, 2023

@dfawley we got a slightly different error today which may help pin this down (at least for us).

fatal error: concurrent map read and map write

goroutine 259 [running]:
google.golang.org/grpc/balancer/base.(*baseBalancer).regeneratePicker(0xc000199e80)
	google.golang.org/grpc@v1.58.1/balancer/base/balancer.go:178 +0x357
google.golang.org/grpc/balancer/base.(*baseBalancer).UpdateClientConnState(0xc000199e80, {{{0xc004f12500, 0xa, 0xa}, {0xc001284dc0, 0x1, 0x1}, 0x0, 0x0}, {0x0, ...}})
	google.golang.org/grpc@v1.58.1/balancer/base/balancer.go:144 +0x905
<redacted internal code>

Our custom LB is calling UpdateClientConnState() explicitly after a change to it's state. Do you know if the the semantics of that method have changed?

The read is here and there are two writes to that map here and here. My guess, based on this latest panic, is that there's a concurrent call to UpdateClientConnState() happening between our LB and the gRPC runtime.

@dfawley
Copy link
Member

dfawley commented Oct 17, 2023

My guess, based on this latest panic, is that there's a concurrent call to UpdateClientConnState() happening between our LB and the gRPC runtime.

If that's true then your custom LB policy is violating the requirements of the API:

https://pkg.go.dev/google.golang.org/grpc/balancer#Balancer

UpdateClientConnState, ResolverError, UpdateSubConnState, and Close are guaranteed to be called synchronously from the same goroutine. There's no guarantee on picker.Pick, it may be called anytime.

You should only call into this LB policy from your LB policy in response to the above calls (since they are guaranteed to be synchronous), or you must ensure you have other synchronization in place to guarantee the above requirement is met. Note that UpdateSubConnState and NewSubConnOptions.StateListener (which will replace UpdateSubConnState) must have the same semantics; this comment is slightly out of date in that it doesn't mention StateListener.

I suspect this is also what's happening in the kratos LB policy, though I don't have the time to fully understand that system to determine whether that's true for sure.

@dylan-bourque
Copy link

dylan-bourque commented Oct 17, 2023

Thanks for the info. I'll see what updates we can make to our code.

I have to mention, though, that our code has been running as-is in dozens of services and 100s of nodes without issues for years, including after I personally did a major overhaul in 2020 to make it adhere to the "new" client-side LB framework. The panics are new since we've upgraded to grpc@v1.58.x. It's certainly possible that we've just never run into this crash before by sheer luck, but I doubt it. 😞

Before I start digging, what is the "proper" way for my LB to tell gRPC to release a connection sub-conn when our policy determines it needs to be discarded?

@dylan-bourque
Copy link

dylan-bourque commented Oct 17, 2023

and NewSubConnOptions.StateListener (which will replace UpdateSubConnState)

Looks like the only way to use NewSubConnOptions is by calling ClientConn.NewSubConn() directly. We're not doing that today (we're adjusting the list of addresses in UpdateClientConnState() then delegating to the base balancer that we're wrapping). That NewSubConn() method also has a deprecation notice on it.

// NewSubConn is called by balancer to create a new SubConn.
// It doesn't block and wait for the connections to be established.
// Behaviors of the SubConn can be controlled by options.
//
// Deprecated: please be aware that in a future version, SubConns will only
// support one address per SubConn.
NewSubConn([]resolver.Address, NewSubConnOptions) (SubConn, error)

Not sure what the right path forward should be.

@dfawley
Copy link
Member

dfawley commented Oct 17, 2023

Before I start digging, what is the "proper" way for my LB to tell gRPC to release a connection sub-conn when our policy determines it needs to be discarded?

<SubConn>.Shutdown() is the method to use. However, if you're delegating connection management to the base balancer, then you probably want to remove the address behind that subchannel (grpc terminology for grpc-go's SubConn) from the list of addresses passed to UpdateClientConnState when calling that method on the base balancer. If you are shutting down subchannels created by another LB policy out from underneath it, then bad things will likely result.

That NewSubConn() method also has a deprecation notice on it.

The deprecation is support for multiple addresses per SubConn, not the method itself.

Looks like the only way to use NewSubConnOptions is by calling ClientConn.NewSubConn() directly. We're not doing that today

If you don't care about the state updates of that subchannel, then you can ignore all of that. If you do need the updates, then you'll need to wrap the ClientConn you pass to the base balancer when building it, and override the NewSubConn method to wrap the StateListener set by the base balancer to intercept it and then forward it along to the base balancer's listener.

For an example of this, you can look at where we do something similar here:

oldListener := opts.StateListener
opts.StateListener = func(state balancer.SubConnState) { bw.gsb.updateSubConnState(sc, state, oldListener) }
sc, err := bw.gsb.cc.NewSubConn(addrs, opts)

@dylan-bourque
Copy link

@dfawley thanks for the additional info and pointers. I'm making some progress but there's still a gap. I can't figure out how to poke the base balancer to create a new sub-conn after I call Shutdown() on the one I want to discard.

The end result I need is for a sub-conn to be closed/discarded and a new one created. Is that not possible using the current API?

@dfawley
Copy link
Member

dfawley commented Oct 18, 2023

how to poke the base balancer to create a new sub-conn after I call Shutdown() on the one I want to discard

You should not be calling Shutdown() if you're delegating subchannel management to the base balancer. I suspect you just want to filter addresses you pass to it instead.

The end result I need is for a sub-conn to be closed/discarded and a new one created. Is that not possible using the current API?

The current API allows you to do just about anything, but the base balancer is very limited. If you need more control than what the base balancer allows, you'll want to just do everything yourself from scratch and not use the base balancer. Basically, what the base balancer does is connect to every address it is given, and then it calls your picker builder to allow you to write the logic to determine which connection to use for which RPCs. If you don't want to connect to all addresses, you could filter out the ones you don't want to connect to, but at some point you aren't gaining much by using the base balancer, and would be better off doing everything from scratch.

I still would like to hear about your use cases and learn why the LB policies we provide out of the box are insufficient, as using these APIs at all is not recommended due to their experimental status and planned upcoming breakages.

@dylan-bourque
Copy link

I suspect you just want to filter addresses you pass to it instead.

This would work perfectly if there was a way for me to pass a new list of addresses at runtime, but that doesn't seem to be possible.

I still would like to hear about your use cases

The specific use case we have is redistributing load when the back-end is scaled up or down behind a load balancer (NLB or ALB). Client connections are "sticky" so that once a connection is established all requests from that client go to the same node on the back-end. That becomes a problem for us when the clients are themselves long-running services.

Assume 5 clients, A, B, C, D, E, and F, connecting to service S that's running on 10 nodes behind a LB. If we scale S up to 20 nodes, none of the traffic from those existing clients gets spread to the new nodes unless a connection to one of the original nodes is broken.

What we've done is implement a client-side LB policy that creates N connections by copying the address passed into UpdateClientConnState and adding an attribute to the resolver.Address instances so they are unique, then passing that list of N addresses to the base balancer. This is paired with a round-robin picker. We also kick off a goroutine that periodically discards one of those N addresses and generates a new one by again adding an attribute to the "original" address. Today that goroutine is calling UpdateClientConnState() on the base balancer, which is what's leading to the concurrent map read/write.

Like I mentioned before, this code has been running without issue for a very long time. The original implementation pre-dates me and I refactored it to work with the V2 client-side balancer APIs in 2020. Other than that, though, this logic has worked for us since it was put in place 5+ years ago.

It seems I'll have to build a from-scratch custom balancer unless you can suggest an out-of-the-box policy that will solve for redistributing load across new nodes after a scale-up behind a NLB/ALB.

@dfawley
Copy link
Member

dfawley commented Oct 18, 2023

This would work perfectly if there was a way for me to pass a new list of addresses at runtime, but that doesn't seem to be possible.

I'm not sure what you mean by "at runtime", but if you mean "asynchronous with the channel(/ClientConn)'s updates (i.e. calls to UpdateClientConnState or the subchannel's state listeners, then the easiest way to do that today would be to hold a lock while calling into the base balancer (including when the channel would directly call into its methods; those would need to be intercepted).

As for your use case..

What name resolver are you using? Just DNS or something custom? It sounds like you have just one address that you actually connect to (the LB), and that creates connections to the backend servers?

If I'm understanding you correctly, we have a very similar architecture for Google's public cloud gRPC services. The solution they are using today is to use a connection pool of gRPC channels instead of doing anything inside the channel itself, and doing round robin across the channels. You can find their implementation here (and potentially use it for yourself directly). The only thing this is missing is the "reconnect periodically" functionality you have; we typically recommend implementing that by setting a max connection age limit on the gRPC server instead: https://pkg.go.dev/google.golang.org/grpc/keepalive#ServerParameters.

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Oct 24, 2023
@github-actions github-actions bot closed this as completed Nov 1, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants