Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble/relay: improve peer connections handling #12556

Merged
merged 24 commits into from Jul 22, 2020

Conversation

rolinh
Copy link
Member

@rolinh rolinh commented Jul 16, 2020

This PR improves peer connections handling and introduce unit tests. In order to implement unit tests, a major refactoring was necessary. This refactoring also has for side-effect to improve separation of concerns as all the peer connection handling logic is now into its own package (pkg/hubble/relay/pool).

For more details, please go through the list of commits.

Rel: #11425

@rolinh rolinh added kind/enhancement This would improve or streamline existing functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay needs-backport/1.8 labels Jul 16, 2020
@rolinh rolinh requested a review from a team July 16, 2020 17:21
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.2 Jul 16, 2020
pkg/hubble/relay/peer.go Outdated Show resolved Hide resolved
pkg/hubble/peer/client.go Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Jul 16, 2020

Coverage Status

Coverage increased (+0.1%) to 37.102% when pulling f95e76a on pr/rolinh/relay-conn-mgmt-improvments into 1173f18 on master.

@rolinh
Copy link
Member Author

rolinh commented Jul 17, 2020

test-me-please

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 This is an awesome change! The resulting code looks much cleaner and the fact that we can now easily mock connectivity issues and faulty client responses is simply amazing.

The implementation itself looks solid. I do have a few (mostly minor) higher-level questions and comments that popped up during the review that I wanted to discuss before approving.

pkg/hubble/peer/client.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/option.go Outdated Show resolved Hide resolved
@@ -42,7 +42,8 @@ var DefaultOptions = Options{
Max: 12 * time.Hour,
Factor: 2.0,
},
RetryTimeout: 30 * time.Second,
ConnCheckInterval: 2 * time.Minute,
Copy link
Member

@gandro gandro Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this interact with the backoff? I think if we only trigger reconnects every ConnCheckInterval, then the minimal backoff will essentially always be at least ConnCheckInterval, in this case 2 minutes. The first retry which is supposed to happen after 10 seconds according to exponential backoff will actually happen after 2 minutes.

In addition, we will start all re-connection attempts always at the same time for all offline peers whose backoff has expired in the last 2 minutes. There is no jitter.

I'm not sure what the best approach is and we can do it in a follow-up PR. But is seems to me that only reconnecting all peers ConnCheckInterval will lead to very bursty reconnection behavior.

Ideally, we'd have an interval for checking the connection status, and a a separate logic scheduling the reconnection (i.e. a timer which expires at the smallest nextConnAttempt)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we only trigger reconnects every ConnCheckInterval

Reconnects are also triggered with a client requests that uses ReportOffline.

The first retry which is supposed to happen after 10 seconds according to exponential backoff will actually happen after 2 minutes.

Unless there's a client request, yes, this is true. I'm not sure it's worth trying to reconnect more aggressively than this.

In addition, we will start all re-connection attempts always at the same time for all offline peers whose backoff has expired in the last 2 minutes. There is no jitter.

Very good point. I thought about this as well but as this PR was already getting pretty big, I thought this could be addressed in a follow-up PR. Note that even before this PR there's also anbother burst when Hubble Relay is started and is notified of all the peers in the cluster.

I fully agree with your point here and I'm happy to discuss a solution to implement in a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there's a client request, yes, this is true. I'm not sure it's worth trying to reconnect more aggressively than this.

If there is a client reconnect, then the backoff is ignored anwyway. Currently, actual back-off function will look like this.

2m0s (ConnCheckInterval)
2m0s (ConnCheckInterval)
2m0s (ConnCheckInterval)
2m0s (ConnCheckInterval)
2m40s
5m20s
10m40s
21m20s
42m40s
1h25m20s
2h50m40s
5h41m20s
11h22m40s
12h0m0s

I agree that we do not have to fix this in this PR, but it seems like this is not intentional.

@@ -178,7 +178,8 @@ func (m *Manager) manageConnections() {
m.mu.Lock()
p := m.peers[name]
m.mu.Unlock()
m.connect(p)
// a connection request has been made, make sure to attempt a connection
m.connect(p, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers, since it's not clear from context: m.connect() will not reconnect if there is already a re-connection attempt in progress. So it's fine if the same peer is reported offline multiple times, we will still only do one actually re-connection attempt in m.connect.

pkg/hubble/relay/pool/manager_test.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager_test.go Show resolved Hide resolved
@rolinh rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from 378c675 to deaa98a Compare July 20, 2020 08:21
@rolinh
Copy link
Member Author

rolinh commented Jul 20, 2020

Note: rebased against master to pick up #12572

@rolinh rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from deaa98a to 24cbc7c Compare July 20, 2020 08:31
pkg/hubble/relay/observer.go Show resolved Hide resolved
conn ClientConn
connAttempts int
nextConnAttempt time.Time
mu lock.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this mutex protecting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All peer struct members.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When declaring a struct where the mutex must protect access to one or more fields, place the mutex above the fields that it will protect as a best practice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this quote come from? In any case, I've reordered the struct members.

pkg/hubble/relay/pool/manager.go Outdated Show resolved Hide resolved
return &Manager{
peers: make(map[string]*peer),
offline: make(chan string, 100),
stop: make(chan struct{}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stop couldn't this stop be context? [1]

Copy link
Member Author

@rolinh rolinh Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be replaced by a context but I try to follow the best practice documented in the context package documentation:

Programs that use Contexts should follow these rules to keep interfaces consistent across packages and enable static analysis tools to check context propagation:

Do not store Contexts inside a struct type; ...

}

func (m *Manager) watchNotifications() {
ctx, cancel := context.WithCancel(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1] then it could be used as a parent of this context.

pkg/hubble/relay/pool/manager.go Show resolved Hide resolved
pkg/hubble/relay/pool/manager.go Show resolved Hide resolved
conn ClientConn
connAttempts int
nextConnAttempt time.Time
mu lock.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong but this mutex could be replaced with a lock.RWMutex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we do this? I think it would be detrimental to performance in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would benefit from using a RWMutex in this specific case.

pkg/hubble/relay/pool/manager.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/option.go Show resolved Hide resolved
@rolinh rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from eff36f5 to 793c473 Compare July 20, 2020 15:27
@rolinh rolinh requested a review from aanm July 20, 2020 15:31
@rolinh rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from 793c473 to e504a29 Compare July 20, 2020 15:39
@rolinh
Copy link
Member Author

rolinh commented Jul 20, 2020

test-me-please

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new commits look fine to me in terms of approach! Thanks for addressing my nits and adding some additional checks in the unit tests.

I don't understand the point of the defers though. They seem to try to achieve something, but I don't understand what.

pkg/hubble/relay/pool/manager_test.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager_test.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager_test.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager_test.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager_test.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager_test.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager_test.go Outdated Show resolved Hide resolved
@rolinh rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from e504a29 to 5ee6f8e Compare July 20, 2020 19:19
@rolinh rolinh requested a review from gandro July 20, 2020 19:20
@rolinh
Copy link
Member Author

rolinh commented Jul 20, 2020

I don't understand the point of the defers though. They seem to try to achieve something, but I don't understand what.

Indeed, not really useful :) I removed all the defer statements in the tests.

@rolinh
Copy link
Member Author

rolinh commented Jul 21, 2020

test-me-please

pkg/hubble/relay/observer.go Show resolved Hide resolved
pkg/hubble/relay/observer.go Show resolved Hide resolved
conn ClientConn
connAttempts int
nextConnAttempt time.Time
mu lock.Mutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When declaring a struct where the mutex must protect access to one or more fields, place the mutex above the fields that it will protect as a best practice.

pkg/hubble/relay/pool/manager.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager.go Outdated Show resolved Hide resolved
@rolinh rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from 5ee6f8e to f4ea3c3 Compare July 21, 2020 11:58
@rolinh rolinh requested a review from aanm July 21, 2020 12:05
Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great patch! I only have some minor comments and questions.

pkg/hubble/relay/pool/manager.go Show resolved Hide resolved
pkg/hubble/relay/server.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager.go Outdated Show resolved Hide resolved
pkg/hubble/relay/pool/manager.go Show resolved Hide resolved
rolinh added 14 commits July 21, 2020 16:44
On peer change notifications or call to
`pool.Manager.ReportOffline(name string)`, the backoff interval should
be ignored to re-attempt a connection. This change typically addresses
the following scenario:

  1. Node Foo is offline
  2. The connectivity checker attempts to reconnect several time; each
    time the backoff is increased.
  3. Node Foo becomes available; a peer change notification is received
  4. Oops, no connection attempt is made to Foo because of a large
     backoff duration

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This is not necessary as the connect function closes pre-existing
connections.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This is necessary to avoid import cycles when implementing fakes for the
peer interfaces in the `testutils` package.

Note that the `types` sub-package is commonly found throughout Cilium's
codebase so this change follows the convention.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This will allow mocking gRPC clients which is useful for unit testing.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
These unit tests are not exhaustive although they still provide a decent
coverage. However, it provides a good structure to implement more tests
in the future such as complex peer connectivity issue scenarios or
simply regression tests for bugs that are fixed.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commit removes `Target()` from the peer `ClientBuilder` interface
and, instead, adds the target as a parameter to `Client()`.

As a consequence, Hubble Relay code is adapted to the change. The peer
pool manager now has a new option to provide the address of the peer
gRPC service.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
The debug option was only used to set the log level of the logger. As
the caller now has control over the logger to be used, this option is no
longer necessary.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This change makes tests more reliable and is cleaner at the same time as
once the function returns, the pool manager is really shut down.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commit tests the pool manager logger output for certain tests. This
change revealed some flakiness in the tests that have been addressed.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
The peers map is expected to be more frequently read than written to,
especially with the `List()` method being exposed and used by Hubble
Relay for every request it handles. Therefore, a RWLock is more
approriate than a regular lock.

Suggested-by: André Martins <andre@cilium.io>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
The name `pool` clashes with the package `pool`. It is not a problem per
se but creates a bit of confusion so rename it to `pm`.

Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
@rolinh rolinh force-pushed the pr/rolinh/relay-conn-mgmt-improvments branch from f4ea3c3 to f95e76a Compare July 21, 2020 14:45
@rolinh rolinh requested a review from kaworu July 21, 2020 14:46
@rolinh
Copy link
Member Author

rolinh commented Jul 22, 2020

test-me-please

// ClientConn is an interface that defines the functions clients need to
// perform unary and streaming RPCs. It is implemented by *grpc.ClientConn.
type ClientConn interface {
// GetState returns the connectivity.State of ClientConn.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even in grpc v1.30.0 (the latest at the time of writing) GetState() is marked as experimental.

Since this PR make use of it in several places, would it make sense to keep track of this somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to pay attention to this when upgrading the go-grpc dep but we will anyway have to face breaking changes when upgrading.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 22, 2020
@rolinh rolinh merged commit 9fdcca2 into master Jul 22, 2020
@rolinh rolinh deleted the pr/rolinh/relay-conn-mgmt-improvments branch July 22, 2020 13:34
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.2 Jul 23, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.2 Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay
Projects
No open projects
1.8.2
Backport done to v1.8
Development

Successfully merging this pull request may close these issues.

None yet

8 participants