Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update grpc-core, grpc-interop-testing, ... to 1.29.0 #933

Merged
merged 10 commits into from May 11, 2020

Conversation

scala-steward
Copy link
Contributor

Updates

from 1.28.1 to 1.29.0.
GitHub Release Notes - Version Diff

I'll automatically update this PR to resolve conflicts as long as you don't change it yourself.

If you'd like to skip this version, you can just close this PR. If you have any feedback, just mention me in the comments below.

Configure Scala Steward for your repository with a .scala-steward.conf file.

Have a fantastic day writing Scala!

Ignore future updates

Add this to your .scala-steward.conf file to ignore future updates of this dependency:

updates.ignore = [ { groupId = "io.grpc" } ]

labels: library-update, semver-minor

@ignasi35
Copy link
Member

I think the LoadBalancingintegrationSpec is consistently failing in this PR because of changes in the of grpc-java regarding the ConnectivityStatus affecting the logic in:

@InternalApi
private[akka] def monitorChannel(
done: Promise[Done],
channel: ManagedChannel,
maxConnectionAttempts: Option[Int],
log: LoggingAdapter): Unit = {
def monitor(currentState: ConnectivityState, connectionAttempts: Int): Unit = {
log.debug(s"monitoring with state $currentState and connectionAttempts $connectionAttempts")
val newAttemptOpt = currentState match {
case ConnectivityState.TRANSIENT_FAILURE =>
if (maxConnectionAttempts.contains(connectionAttempts + 1)) {
// shutdown is idempotent in ManagedChannelImpl
channel.shutdown()
done.tryFailure(
new ClientConnectionException(s"Unable to establish connection after [$maxConnectionAttempts]"))
None
} else Some(connectionAttempts + 1)
case ConnectivityState.READY =>
Some(0)
case ConnectivityState.SHUTDOWN =>
done.trySuccess(Done)
None
case ConnectivityState.IDLE | ConnectivityState.CONNECTING =>
Some(connectionAttempts)
}
newAttemptOpt.foreach { attempts =>
channel.notifyWhenStateChanged(currentState, () => monitor(channel.getState(false), attempts))
}
}
monitor(channel.getState(false), 0)
}

Reading the release notes for 1.29.0 I suspect the issue is related to grpc/grpc-java#6657.

Before the upgrade the test would output:

monitoring with state IDLE and connectionAttempts 0 - 
monitoring with state CONNECTING and connectionAttempts 0 - 
monitoring with state TRANSIENT_FAILURE and connectionAttempts 0 - 
monitoring with state CONNECTING and connectionAttempts 1 - 
monitoring with state TRANSIENT_FAILURE and connectionAttempts 1 - 

But according to grpc/grpc-java#6657: the subchannel now remains in TRANSIENT_FAILURE so the counter logic in ChannelUtils doesn't make sense in the TRANSIENT_FAILURE case anymore.

I think we'd need a scheduler with a timeout or similar... 🤔

val failure =
client.sayHello(HelloRequest(s"Hello friend")).failed.futureValue.asInstanceOf[StatusRuntimeException]
failure.getStatus.getCode should be(Code.UNAVAILABLE)
client.closed.failed.futureValue shouldBe a[ClientConnectionException]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this line fixes the test.
grpc-java 1.29.0 changed the way subchannel events are notified (in load-balanced situations). With that change, the ChannelUtils listener for channel changes no longer works when using load balanced channels because it can no longer count reconnection attempts. As a consequence, a scenario like the one in the test will no longer fail but retry indefinitely.

Once I realized the reconnection would work indefinitely I decided the test could be removed.

And, if we remove the test, and then start pulling the thread, we should probably deprecate withConnectionAttempts(2) (or maybe re-implement the underlying logic if we consider a connectionAttempts counter makes any sense now that we are using the internal NameResolver engine provided by groc-java.

Before pushing forward and deleting all usages of connectionAttempts I preferred raising this comment for discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it makes sense to follow the lead of the reference implementation and just keep retrying when loadbalancing, then.

if we remove the test, and then start pulling the thread, we should probably deprecate withConnectionAttempts(2)

Isn't it used in the non-loadbalanced scenario either?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it used in the non-loadbalanced scenario either?

Yeah, I had the same thought over breakfast earlier today. I think we don't have a test for that and, even if we have such a test, we should improve documentation or API around withConnectionAttempts.

My main concern is whether we should improve the load-balanced use-cases. For example:

  • in situations where none of the records returned by the service discovery are valid, I'd like to see the aggregated channel to fail eventually (like it does in non-balanced usages). Or, at least, I'd like to see failure traces in the logs (I think we can have a look at logging settings for that).

One of the reasons this took me a while to diagnose was that the test code (aka user code) had zero visibility over the underlying failures. Also, I (the developer) didn't see any apparent failure or retries happening even with DEBUG logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in situations where none of the records returned by the service discovery are valid, I'd like to see the aggregated channel to fail eventually

I agree. If we don't have any insight in the progress of reconnecting / number of reconnects, I guess the best we could do is wait for a timeout: if a channel remains in TRANSIENT_FAILURE for X amount of time, assume it won't ever successfully connect and fail.

Does the reference implementation have any such timeout?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it used in the non-loadbalanced scenario either?

Yeah, I had the same thought over breakfast earlier today. I think we don't have a test for that and, even if we have such a test, we should improve documentation or API around withConnectionAttempts.


I had the answer above DRAFTed and then I realized that I don't know how to control wether the Channel provided is a single-Channel or a load-balanced one with multiple sub-channels.

Maybe we're no longer using a non LoadBalanced client. The client factory methods all create a ServiceDiscovery instance (maybe akka-dns, maybe config, maybe other) and, since we always inject a NamedResolver based on that ServiceDiscovery it's possible we're always running a Channel with sub-channel(s).

I need to dig that deeper.

Copy link
Member

@ignasi35 ignasi35 Apr 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I had multiple GH windows and posted 2 variants of the same comment)

Ok, so with grpc-java 1.29.0 we should keep withConnectionAttempts around but it'll only be used when withGrpcLoadBalancingType is not invoked. I've just tested that and it works.

I think we have to improve documentation with multiple things:

  1. when withGrpcLoadBalancingType is used, withConnectionAttempts is ignored, the underlying ( akka-grpc grpc-java ) implementation for backoff reconnecting will be used indefintely and client.closed only completes when client.close() is directly invoked. Note: TIL the load balancing in the akka-grpc implementation is per call.
  2. when withGrpcLoadBalancingType is not used, but a ServiceDiscovery is provided alls calls are sent to a single server instance but users can map over client.closed to reconnect again which will trigger a new ServiceDiscovery lookup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the reference implementation have any such timeout?

I'm reading docs here and there trying to see what other choices we have.

The upstream default is pick_first (no loadbalancing...

Again! 🤦🏼‍♂️ I should refresh the page before answering...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the underlying (akka-grpc) implementation for backoff reconnecting will be used

Do you mean grpc-java?

TIL the load balancing in the akka-grpc implementation is per call

What do you mean?

all calls are sent to a single server instance but users can map over client.closed to reconnect again which will trigger a new ServiceDiscovery lookup.

I would expect that when that single server instance fails, grpc-java will automatically fail over to another instance (triggering a new ServiceDiscovery lookup) as needed. Do you think we need to do something extra there?

In any case would be good to add tests for those scenario's.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(🤦🏼‍♂️ Today is not my day)

In both statements above I meant grpc-java (not akka.grpc).

TIL the load balancing in the akka-grpc grpc-java implementation is per call

🤦🏼‍♂️ It is round-robin! What did I expect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that when that single server instance fails, grpc-java will automatically fail over to another instance (triggering a new ServiceDiscovery lookup) as needed. Do you think we need to do something extra there?

I think our code listening on the number of re-connection attempts is the issue here and we may want to remove it once and for all. I mean, having moved to an implementation based on NameResolver, even when using pick-first you are probably correct and grpjc-java is doing a new name resolution. So, it doesn't make sense that close after a certain number of reconnection attempts. I mean, we could keep the code around for logging purposes (even though that code is useless in round-robin) but we probably can drop the withConnectionAttempts setting and closed future.

In any case would be good to add tests for those scenario's.

+1

@raboof
Copy link
Member

raboof commented May 7, 2020

Solid analysis, I think we have a good grasp of what's going on here.

I think 'just retrying indefinitely' does not provide enough observability. I think we can and should give the option to fail when connecting fails. As a future improvement we could expose the Future that is resolved when the channel becomes 'ready' and/or allow configuring a timeout. Perhaps that isn't needed yet though.

@raboof raboof requested a review from ignasi35 May 7, 2020 11:22
@raboof
Copy link
Member

raboof commented May 7, 2020

(this does suggest it'd be good to have a different setting for 'retry before becoming ready' and 'retry on failures after becoming ready', but master doesn't have that either, so that can be a future improvement)

@raboof raboof merged commit 73c91b2 into akka:master May 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants