New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update grpc-core, grpc-interop-testing, ... to 1.29.0 #933
Conversation
I think the
Reading the release notes for 1.29.0 I suspect the issue is related to grpc/grpc-java#6657. Before the upgrade the test would output:
But according to grpc/grpc-java#6657: the subchannel now remains in I think we'd need a scheduler with a timeout or similar... 🤔 |
01391d3
to
16ab21c
Compare
val failure = | ||
client.sayHello(HelloRequest(s"Hello friend")).failed.futureValue.asInstanceOf[StatusRuntimeException] | ||
failure.getStatus.getCode should be(Code.UNAVAILABLE) | ||
client.closed.failed.futureValue shouldBe a[ClientConnectionException] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this line fixes the test.
grpc-java
1.29.0
changed the way subchannel events are notified (in load-balanced situations). With that change, the ChannelUtils
listener for channel changes no longer works when using load balanced channels because it can no longer count reconnection attempts. As a consequence, a scenario like the one in the test will no longer fail but retry indefinitely.
Once I realized the reconnection would work indefinitely I decided the test could be removed.
And, if we remove the test, and then start pulling the thread, we should probably deprecate withConnectionAttempts(2)
(or maybe re-implement the underlying logic if we consider a connectionAttempts
counter makes any sense now that we are using the internal NameResolver
engine provided by groc-java
.
Before pushing forward and deleting all usages of connectionAttempts
I preferred raising this comment for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it makes sense to follow the lead of the reference implementation and just keep retrying when loadbalancing, then.
if we remove the test, and then start pulling the thread, we should probably deprecate withConnectionAttempts(2)
Isn't it used in the non-loadbalanced scenario either?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it used in the non-loadbalanced scenario either?
Yeah, I had the same thought over breakfast earlier today. I think we don't have a test for that and, even if we have such a test, we should improve documentation or API around withConnectionAttempts
.
My main concern is whether we should improve the load-balanced use-cases. For example:
- in situations where none of the records returned by the service discovery are valid, I'd like to see the aggregated channel to fail eventually (like it does in non-balanced usages). Or, at least, I'd like to see failure traces in the logs (I think we can have a look at logging settings for that).
One of the reasons this took me a while to diagnose was that the test code (aka user code) had zero visibility over the underlying failures. Also, I (the developer) didn't see any apparent failure or retries happening even with DEBUG
logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in situations where none of the records returned by the service discovery are valid, I'd like to see the aggregated channel to fail eventually
I agree. If we don't have any insight in the progress of reconnecting / number of reconnects, I guess the best we could do is wait for a timeout: if a channel remains in TRANSIENT_FAILURE for X amount of time, assume it won't ever successfully connect and fail.
Does the reference implementation have any such timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it used in the non-loadbalanced scenario either?
Yeah, I had the same thought over breakfast earlier today. I think we don't have a test for that and, even if we have such a test, we should improve documentation or API around withConnectionAttempts
.
I had the answer above DRAFTed and then I realized that I don't know how to control wether the Channel
provided is a single-Channel or a load-balanced one with multiple sub-channels.
Maybe we're no longer using a non LoadBalanced client. The client factory methods all create a ServiceDiscovery
instance (maybe akka-dns
, maybe config
, maybe other) and, since we always inject a NamedResolver
based on that ServiceDiscovery
it's possible we're always running a Channel
with sub-channel(s).
I need to dig that deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I had multiple GH windows and posted 2 variants of the same comment)
Ok, so with grpc-java
1.29.0
we should keep withConnectionAttempts
around but it'll only be used when withGrpcLoadBalancingType
is not invoked. I've just tested that and it works.
I think we have to improve documentation with multiple things:
- when
withGrpcLoadBalancingType
is used,withConnectionAttempts
is ignored, the underlying (akka-grpc
grpc-java
) implementation for backoff reconnecting will be used indefintely andclient.closed
only completes whenclient.close()
is directly invoked. Note: TIL the load balancing in theakka-grpc
implementation is per call. - when
withGrpcLoadBalancingType
is not used, but aServiceDiscovery
is provided alls calls are sent to a single server instance but users can map overclient.closed
to reconnect again which will trigger a newServiceDiscovery
lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the reference implementation have any such timeout?
I'm reading docs here and there trying to see what other choices we have.
The upstream default is pick_first (no loadbalancing...
Again! 🤦🏼♂️ I should refresh the page before answering...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the underlying (
akka-grpc
) implementation for backoff reconnecting will be used
Do you mean grpc-java
?
TIL the load balancing in the akka-grpc implementation is per call
What do you mean?
all calls are sent to a single server instance but users can map over client.closed to reconnect again which will trigger a new ServiceDiscovery lookup.
I would expect that when that single server instance fails, grpc-java will automatically fail over to another instance (triggering a new ServiceDiscovery lookup) as needed. Do you think we need to do something extra there?
In any case would be good to add tests for those scenario's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(🤦🏼♂️ Today is not my day)
In both statements above I meant grpc-java
(not akka.grpc
).
TIL the load balancing in the
akka-grpcgrpc-java implementation is per call
🤦🏼♂️ It is round-robin
! What did I expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that when that single server instance fails, grpc-java will automatically fail over to another instance (triggering a new ServiceDiscovery lookup) as needed. Do you think we need to do something extra there?
I think our code listening on the number of re-connection attempts is the issue here and we may want to remove it once and for all. I mean, having moved to an implementation based on NameResolver
, even when using pick-first
you are probably correct and grpjc-java
is doing a new name resolution. So, it doesn't make sense that close after a certain number of reconnection attempts. I mean, we could keep the code around for logging purposes (even though that code is useless in round-robin
) but we probably can drop the withConnectionAttempts
setting and closed
future.
In any case would be good to add tests for those scenario's.
+1
Solid analysis, I think we have a good grasp of what's going on here. I think 'just retrying indefinitely' does not provide enough observability. I think we can and should give the option to fail when connecting fails. As a future improvement we could expose the Future that is resolved when the channel becomes 'ready' and/or allow configuring a timeout. Perhaps that isn't needed yet though. |
…ints are provided
a1df10b
to
860b433
Compare
(this does suggest it'd be good to have a different setting for 'retry before becoming ready' and 'retry on failures after becoming ready', but master doesn't have that either, so that can be a future improvement) |
Updates
from 1.28.1 to 1.29.0.
GitHub Release Notes - Version Diff
I'll automatically update this PR to resolve conflicts as long as you don't change it yourself.
If you'd like to skip this version, you can just close this PR. If you have any feedback, just mention me in the comments below.
Configure Scala Steward for your repository with a
.scala-steward.conf
file.Have a fantastic day writing Scala!
Ignore future updates
Add this to your
.scala-steward.conf
file to ignore future updates of this dependency:labels: library-update, semver-minor