New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble/relay: improve peer connections handling #12556
Conversation
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 This is an awesome change! The resulting code looks much cleaner and the fact that we can now easily mock connectivity issues and faulty client responses is simply amazing.
The implementation itself looks solid. I do have a few (mostly minor) higher-level questions and comments that popped up during the review that I wanted to discuss before approving.
@@ -42,7 +42,8 @@ var DefaultOptions = Options{ | |||
Max: 12 * time.Hour, | |||
Factor: 2.0, | |||
}, | |||
RetryTimeout: 30 * time.Second, | |||
ConnCheckInterval: 2 * time.Minute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this interact with the backoff? I think if we only trigger reconnects every ConnCheckInterval
, then the minimal backoff will essentially always be at least ConnCheckInterval
, in this case 2 minutes. The first retry which is supposed to happen after 10 seconds according to exponential backoff will actually happen after 2 minutes.
In addition, we will start all re-connection attempts always at the same time for all offline peers whose backoff has expired in the last 2 minutes. There is no jitter.
I'm not sure what the best approach is and we can do it in a follow-up PR. But is seems to me that only reconnecting all peers ConnCheckInterval
will lead to very bursty reconnection behavior.
Ideally, we'd have an interval for checking the connection status, and a a separate logic scheduling the reconnection (i.e. a timer which expires at the smallest nextConnAttempt
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we only trigger reconnects every
ConnCheckInterval
Reconnects are also triggered with a client requests that uses ReportOffline
.
The first retry which is supposed to happen after 10 seconds according to exponential backoff will actually happen after 2 minutes.
Unless there's a client request, yes, this is true. I'm not sure it's worth trying to reconnect more aggressively than this.
In addition, we will start all re-connection attempts always at the same time for all offline peers whose backoff has expired in the last 2 minutes. There is no jitter.
Very good point. I thought about this as well but as this PR was already getting pretty big, I thought this could be addressed in a follow-up PR. Note that even before this PR there's also anbother burst when Hubble Relay is started and is notified of all the peers in the cluster.
I fully agree with your point here and I'm happy to discuss a solution to implement in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there's a client request, yes, this is true. I'm not sure it's worth trying to reconnect more aggressively than this.
If there is a client reconnect, then the backoff is ignored anwyway. Currently, actual back-off function will look like this.
2m0s (ConnCheckInterval)
2m0s (ConnCheckInterval)
2m0s (ConnCheckInterval)
2m0s (ConnCheckInterval)
2m40s
5m20s
10m40s
21m20s
42m40s
1h25m20s
2h50m40s
5h41m20s
11h22m40s
12h0m0s
I agree that we do not have to fix this in this PR, but it seems like this is not intentional.
pkg/hubble/relay/pool/manager.go
Outdated
@@ -178,7 +178,8 @@ func (m *Manager) manageConnections() { | |||
m.mu.Lock() | |||
p := m.peers[name] | |||
m.mu.Unlock() | |||
m.connect(p) | |||
// a connection request has been made, make sure to attempt a connection | |||
m.connect(p, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers, since it's not clear from context: m.connect()
will not reconnect if there is already a re-connection attempt in progress. So it's fine if the same peer is reported offline multiple times, we will still only do one actually re-connection attempt in m.connect
.
378c675
to
deaa98a
Compare
Note: rebased against master to pick up #12572 |
deaa98a
to
24cbc7c
Compare
pkg/hubble/relay/pool/manager.go
Outdated
conn ClientConn | ||
connAttempts int | ||
nextConnAttempt time.Time | ||
mu lock.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this mutex protecting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All peer
struct members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When declaring a struct where the mutex must protect access to one or more fields, place the mutex above the fields that it will protect as a best practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this quote come from? In any case, I've reordered the struct members.
return &Manager{ | ||
peers: make(map[string]*peer), | ||
offline: make(chan string, 100), | ||
stop: make(chan struct{}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stop
couldn't this stop be context? [1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be replaced by a context but I try to follow the best practice documented in the context
package documentation:
Programs that use Contexts should follow these rules to keep interfaces consistent across packages and enable static analysis tools to check context propagation:
Do not store Contexts inside a struct type; ...
} | ||
|
||
func (m *Manager) watchNotifications() { | ||
ctx, cancel := context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[1] then it could be used as a parent of this context.
pkg/hubble/relay/pool/manager.go
Outdated
conn ClientConn | ||
connAttempts int | ||
nextConnAttempt time.Time | ||
mu lock.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong but this mutex could be replaced with a lock.RWMutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we do this? I think it would be detrimental to performance in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we would benefit from using a RWMutex
in this specific case.
eff36f5
to
793c473
Compare
793c473
to
e504a29
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new commits look fine to me in terms of approach! Thanks for addressing my nits and adding some additional checks in the unit tests.
I don't understand the point of the defer
s though. They seem to try to achieve something, but I don't understand what.
e504a29
to
5ee6f8e
Compare
Indeed, not really useful :) I removed all the |
test-me-please |
pkg/hubble/relay/pool/manager.go
Outdated
conn ClientConn | ||
connAttempts int | ||
nextConnAttempt time.Time | ||
mu lock.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When declaring a struct where the mutex must protect access to one or more fields, place the mutex above the fields that it will protect as a best practice.
5ee6f8e
to
f4ea3c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great patch! I only have some minor comments and questions.
On peer change notifications or call to `pool.Manager.ReportOffline(name string)`, the backoff interval should be ignored to re-attempt a connection. This change typically addresses the following scenario: 1. Node Foo is offline 2. The connectivity checker attempts to reconnect several time; each time the backoff is increased. 3. Node Foo becomes available; a peer change notification is received 4. Oops, no connection attempt is made to Foo because of a large backoff duration Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This is not necessary as the connect function closes pre-existing connections. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This is necessary to avoid import cycles when implementing fakes for the peer interfaces in the `testutils` package. Note that the `types` sub-package is commonly found throughout Cilium's codebase so this change follows the convention. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This will allow mocking gRPC clients which is useful for unit testing. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
These unit tests are not exhaustive although they still provide a decent coverage. However, it provides a good structure to implement more tests in the future such as complex peer connectivity issue scenarios or simply regression tests for bugs that are fixed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commit removes `Target()` from the peer `ClientBuilder` interface and, instead, adds the target as a parameter to `Client()`. As a consequence, Hubble Relay code is adapted to the change. The peer pool manager now has a new option to provide the address of the peer gRPC service. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
The debug option was only used to set the log level of the logger. As the caller now has control over the logger to be used, this option is no longer necessary. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This change makes tests more reliable and is cleaner at the same time as once the function returns, the pool manager is really shut down. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commit tests the pool manager logger output for certain tests. This change revealed some flakiness in the tests that have been addressed. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
The peers map is expected to be more frequently read than written to, especially with the `List()` method being exposed and used by Hubble Relay for every request it handles. Therefore, a RWLock is more approriate than a regular lock. Suggested-by: André Martins <andre@cilium.io> Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
The name `pool` clashes with the package `pool`. It is not a problem per se but creates a bit of confusion so rename it to `pm`. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
f4ea3c3
to
f95e76a
Compare
test-me-please |
// ClientConn is an interface that defines the functions clients need to | ||
// perform unary and streaming RPCs. It is implemented by *grpc.ClientConn. | ||
type ClientConn interface { | ||
// GetState returns the connectivity.State of ClientConn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even in grpc v1.30.0 (the latest at the time of writing) GetState()
is marked as experimental.
Since this PR make use of it in several places, would it make sense to keep track of this somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to pay attention to this when upgrading the go-grpc dep but we will anyway have to face breaking changes when upgrading.
This PR improves peer connections handling and introduce unit tests. In order to implement unit tests, a major refactoring was necessary. This refactoring also has for side-effect to improve separation of concerns as all the peer connection handling logic is now into its own package (
pkg/hubble/relay/pool
).For more details, please go through the list of commits.
Rel: #11425