Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustermesh: allow remote configuration to be changed quickly #24046

Merged
merged 2 commits into from
Mar 15, 2023

Conversation

oblazek
Copy link
Contributor

@oblazek oblazek commented Feb 27, 2023

In case there was a typo in remote cluster configuration which caused etcd connection to fail it was not possible to change the remote etcd configuration without waiting for initialConnectionTimeout.

See commit messages for detailed desc.

Signed-off-by: Ondrej Blazek ondrej.blazek@firma.seznam.cz

Fix a bug where users are unable to change a wrong remote etcd configuration

@oblazek oblazek requested review from a team as code owners February 27, 2023 13:03
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 27, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Feb 27, 2023
@oblazek oblazek force-pushed the ob-remote-clusters-context-cancel branch from e2b368c to 5f07fd9 Compare February 27, 2023 13:07
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @oblazek for your PR.

I have a few comments inline. I also wonder how you tested this, since there is apparently an issue with the fs watcher, which causes updates to already existing configuration files not to be correctly detected. I'm working to fix that.

pkg/clustermesh/remote_cluster.go Outdated Show resolved Hide resolved
pkg/clustermesh/remote_cluster.go Outdated Show resolved Hide resolved
pkg/kvstore/etcd.go Outdated Show resolved Hide resolved
pkg/clustermesh/remote_cluster.go Outdated Show resolved Hide resolved
@oblazek
Copy link
Contributor Author

oblazek commented Mar 1, 2023

I have a few comments inline. I also wonder how you tested this, since there is apparently an issue with the fs watcher, which causes updates to already existing configuration files not to be correctly detected. I'm working to fix that.

can you be more specific? I always replaced the file that fsnotify is watching with mv command to trigger the event.

@giorio94
Copy link
Member

giorio94 commented Mar 1, 2023

I have a few comments inline. I also wonder how you tested this, since there is apparently an issue with the fs watcher, which causes updates to already existing configuration files not to be correctly detected. I'm working to fix that.

can you be more specific? I always replaced the file that fsnotify is watching with mv command to trigger the event.

The issue I mentioned occurs only if the config folder is mounted in the container from a secret. Are you mounting the configuration files differently? I assume so, since you mentioned using mv to change the config.

@oblazek
Copy link
Contributor Author

oblazek commented Mar 2, 2023

So mostly I have tested this when running agent in a docker container with the clustermesh config folder mounted as a volume, as it's easier to change the config (e.g. the mv command). But in k8s we are mounting as a secret and had some issues there as well.

@giorio94
Copy link
Member

giorio94 commented Mar 4, 2023

But in k8s we are mounting as a secret and had some issues there as well.

FYI, #24163 addresses the above issue

@oblazek oblazek force-pushed the ob-remote-clusters-context-cancel branch from 5f07fd9 to e52ee27 Compare March 8, 2023 08:29
@oblazek oblazek requested a review from a team as a code owner March 8, 2023 08:29
@oblazek oblazek force-pushed the ob-remote-clusters-context-cancel branch from e52ee27 to deb9769 Compare March 8, 2023 08:30
@oblazek
Copy link
Contributor Author

oblazek commented Mar 8, 2023

hey 👋 @giorio94 ptal.. looks better now imo :)

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @oblazek!

The controller change LGTM. I left you a comment/question inline about the watcher termination handling.

pkg/kvstore/etcd.go Outdated Show resolved Hide resolved
Previously when etcd client context was not set and client
created with client.New(*config) it would use context.TODO,
without any way to cancel the ctx.

In some cases (e.g. wrong endpoints configuration) it would
 be necessary to cancel the ctx, otherwise the client might
be stucked in a retry loop and in case of remoteConnectionController
there wouldn't be any way stop that.

This commit passes a parent context to the client itself.

This commit also fixes the closing of stopWatch channel and rest
of the resources when ctx is done/cancelled.

If ctx was closed and select statement choosed handling of that
case (<-ctx.Done()) the code would quit the for loop and handling
of <-w.stopWatch would never be possible which means calling
w.Stop() from outside (e.g. releaseOldConnection()) would hang
forever as the wg counter wouldn't be decremented.
Now when the context gets closed, it correctly closes all channels
and decrements the wg counter, just like when handled by
releaseOldConnection func which releases everything.

Signed-off-by: Ondrej Blazek <ondrej.blazek@firma.seznam.cz>
This commit adds a new ControllerParam CancelDoFuncOnUpdate which
allows cancelling the controller context passed to DoFunc.

This way we can terminate the controller (if stuck in e.g. etcd
client retry loop) and still preserve the existing logic.

Signed-off-by: Ondrej Blazek <ondrej.blazek@firma.seznam.cz>
@oblazek oblazek force-pushed the ob-remote-clusters-context-cancel branch from deb9769 to 5f3c745 Compare March 8, 2023 12:22
@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. sig/kvstore Impacts the KVStore package interactions. needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Mar 9, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Mar 9, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.13.1 Mar 9, 2023
@giorio94
Copy link
Member

giorio94 commented Mar 9, 2023

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathConfig Host firewall With VXLAN

Failure Output

FAIL: Failed to reach 192.168.56.11:80 from testserver-host-kvpfz

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sAgentPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath

Failure Output

FAIL: cannot install connectivity-check

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

@giorio94
Copy link
Member

giorio94 commented Mar 10, 2023

/test-1.16-4.19
Hit unrelated flake: #15455

@giorio94
Copy link
Member

/test-1.16-4.19

@giorio94
Copy link
Member

giorio94 commented Mar 10, 2023

/test-1.26-net-next
Hit unrelated flake: #13071

@oblazek
Copy link
Contributor Author

oblazek commented Mar 10, 2023

thanks @giorio94 for retriggerring

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 15, 2023
@tklauser tklauser merged commit 86aa873 into cilium:master Mar 15, 2023
@nebril nebril added this to Needs backport from master in 1.13.2 Mar 15, 2023
@nebril nebril removed this from Needs backport from master in 1.13.1 Mar 15, 2023
@NikAleksandrov NikAleksandrov mentioned this pull request Mar 23, 2023
29 tasks
@NikAleksandrov NikAleksandrov added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Mar 23, 2023
@jibi jibi added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Apr 7, 2023
@gentoo-root gentoo-root moved this from Needs backport from master to Backport done to v1.13 in 1.13.2 Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. kind/bug This is a bug in the Cilium logic. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/kvstore Impacts the KVStore package interactions.
Projects
No open projects
1.13.2
Backport done to v1.13
Development

Successfully merging this pull request may close these issues.

None yet

5 participants