Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cinder-csi-plugin] Add finalizer to VolumeSnapshot when create PVC with the VolumeSnapshot #2369

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jeffyjf
Copy link
Contributor

@jeffyjf jeffyjf commented Sep 26, 2023

What this PR does / why we need it:

This PR implement a PVC watcher. When the watcher receive a PVC ADDED event and PVC with VolumeSnapshot kind dataSource, the VolumeSnapshot will be added a finalizer. And the finalizer will be removed when the watcher receive the PVC's DELETED event. It can used to protect VolumeSnapshot, avoid that the VS be deleted while there are any PersistentVolume on it.

Which issue this PR fixes(if applicable):
fixes #2294

Special notes for reviewers:

Asynchronously add finalizer by watcher seems like unsafety, might result in some potential issues. After the PR [1] merged and released, we need to refactor it and implement a synchronous way to add finalizer.

[1] kubernetes-csi/external-provisioner#1070

Release note:

Add  a command flag `--enable-vs-deletion-protection`. If set, The VS will be added a finalizer while a PVC created base on it, ensure the VS couldn't be deleted before the PVC be deleted.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 26, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jichenjc for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jeffyjf jeffyjf marked this pull request as draft September 26, 2023 11:48
@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Sep 26, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @jeffyjf. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 26, 2023
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the issue, I do like the concept of adding a finalizer to the snapshot for each volume which requires it. However, I can't think of another instance where finalizers are used this way, so it might be worth some effort to discover if there are any problems with potentially very big lists of finalizers. Imagine, for example, a some CI system which uses a 'golden image' extremely heavily and has 1,000 volumes using it. Would that be ok?

However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC.

Could we add code which adds and persists the snapshot finalizer immediately before creating the PV?

We would also need to add code to remove the finalizer in the deletion flow of the PV strictly before removing the PV finalizer.

@mdbooth
Copy link
Contributor

mdbooth commented Sep 27, 2023

Looks very WIP right now so probably not worth actually running tests, but

/ok-to-test

Note that tests won't run automatically while the PR is marked WIP, but you can run them manually on demand when you're ready with /test all.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 27, 2023
pkg/csi/cinder/controllerserver.go Outdated Show resolved Hide resolved
pkg/csi/cinder/controllerserver.go Outdated Show resolved Hide resolved
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Oct 2, 2023

Hi @mdbooth, thanks very much for your comments. But I have a bit question.

However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC.

I am a newbie. In my opinion, this cannot be implemented just by watcher. We need to add a webhook to do this. right? If it's not right, do you have any other suggestions?

@jeffyjf
Copy link
Contributor Author

jeffyjf commented Oct 9, 2023

As mentioned in the issue, I do like the concept of adding a finalizer to the snapshot for each volume which requires it. However, I can't think of another instance where finalizers are used this way, so it might be worth some effort to discover if there are any problems with potentially very big lists of finalizers. Imagine, for example, a some CI system which uses a 'golden image' extremely heavily and has 1,000 volumes using it. Would that be ok?

However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC.

Could we add code which adds and persists the snapshot finalizer immediately before creating the PV?

We would also need to add code to remove the finalizer in the deletion flow of the PV strictly before removing the PV finalizer.

In the past few days, I tried a few methods to implement add finalizer before creating the PV synchronously:

  1. I tried implement this in CreateVolume function, but I found I can't get enough context informations by graceful way. Such as, I can't obtain the k8s VolumeSnapshot's name and namespace, this result in I can't add finalizer to it.
  2. I tried add a webhook to intercept the PV's creation, this can add finalizer before creating the PV. But, this will increase the difficulty of deployment and maintenance. Moreover, If a cluster have multiple webhook about PV and another webhook reject the PV's creation after this webhook, I don't konw how to resolve this situation. So, I don't like this way.

I think the NO.1 method is more graceful and appropriate. But, the first methon require external-provisioner sidecar provide more necessary informations. So I think we should submit a KEP to CSI spec in order to we can get enough necessary informations about VolumeSnatshot in CreateVolume function to resolve this issue appropriately.

@jeffyjf jeffyjf changed the title [cinder-csi-plugin] Add finalizer to volumeshot when create PVC with the volumeshot [cinder-csi-plugin] Add finalizer to VolumeSnapshot when create PVC with the VolumeSnapshot Oct 9, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 20, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 2, 2023
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Nov 2, 2023

As mentioned in the issue, I do like the concept of adding a finalizer to the snapshot for each volume which requires it. However, I can't think of another instance where finalizers are used this way, so it might be worth some effort to discover if there are any problems with potentially very big lists of finalizers. Imagine, for example, a some CI system which uses a 'golden image' extremely heavily and has 1,000 volumes using it. Would that be ok?
However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC.
Could we add code which adds and persists the snapshot finalizer immediately before creating the PV?
We would also need to add code to remove the finalizer in the deletion flow of the PV strictly before removing the PV finalizer.

In the past few days, I tried a few methods to implement add finalizer before creating the PV synchronously:

  1. I tried implement this in CreateVolume function, but I found I can't get enough context informations by graceful way. Such as, I can't obtain the k8s VolumeSnapshot's name and namespace, this result in I can't add finalizer to it.
  2. I tried add a webhook to intercept the PV's creation, this can add finalizer before creating the PV. But, this will increase the difficulty of deployment and maintenance. Moreover, If a cluster have multiple webhook about PV and another webhook reject the PV's creation after this webhook, I don't konw how to resolve this situation. So, I don't like this way.

I think the NO.1 method is more graceful and appropriate. But, the first methon require external-provisioner sidecar provide more necessary informations. So I think we should submit a KEP to CSI spec in order to we can get enough necessary informations about VolumeSnatshot in CreateVolume function to resolve this issue appropriately.

I've already commit a PR to external-provisioner, in order to we can add finalizer to VolumeSnapshot synchronously in future. But now, let's continue to complete this imperfect solution to fix the issue #2294 and hold back resource leak in production environment asap.

@dulek
Copy link
Contributor

dulek commented Nov 8, 2023

Hi @mdbooth, thanks very much for your comments. But I have a bit question.

However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC.

I am a newbie. In my opinion, this cannot be implemented just by watcher. We need to add a webhook to do this. right? If it's not right, do you have any other suggestions?

Hey, what is the problem with adding a finalizer asynchronously? We've did that all the time in Kuryr and it worked totally well when we made sure we're doing it by patching the resource properly.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 9, 2023
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Nov 9, 2023

/hold cancel

@jeffyjf
Copy link
Contributor Author

jeffyjf commented Nov 13, 2023

Hi CPO Co-Leads @dulek @jichenjc @kayrus @zetaab

This PR is ready for review. Please take a look at this PR in your spare time. I'm eager to get your advices.

@jichenjc
Copy link
Contributor

Asynchronously add finalizer by watcher seems like unsafety, might result in some potential issues. After the PR [1] merged and released, we need to refactor it and implement a synchronous way to add finalizer.

thanks for the PR, I thought our PR rely on kubernetes-csi/external-provisioner#1070
but read again seems you are proposing a long term solution in external-provisioner then after that's done
we will do another round of fix to current one ? so this PR is a temp solution as stop gap ?

@jeffyjf
Copy link
Contributor Author

jeffyjf commented Nov 14, 2023

Asynchronously add finalizer by watcher seems like unsafety, might result in some potential issues. After the PR [1] merged and released, we need to refactor it and implement a synchronous way to add finalizer.

thanks for the PR, I thought our PR rely on kubernetes-csi/external-provisioner#1070 but read again seems you are proposing a long term solution in external-provisioner then after that's done we will do another round of fix to current one ? so this PR is a temp solution as stop gap ?

Yep, I am not sure when the kubernetes-csi/external-provisioner#1070 can be merged and released. This PR can be temporarily used to hold back resources leak.

@jeffyjf
Copy link
Contributor Author

jeffyjf commented Nov 14, 2023

Hi @mdbooth, thanks very much for your comments. But I have a bit question.

However, I don't think we can safely add the finalizer asynchronously. A finalizer would typically be added synchronously before resource creation. It's also worth noting that the resource in this case would be the PV rather than the PVC.

I am a newbie. In my opinion, this cannot be implemented just by watcher. We need to add a webhook to do this. right? If it's not right, do you have any other suggestions?

Hey, what is the problem with adding a finalizer asynchronously? We've did that all the time in Kuryr and it worked totally well when we made sure we're doing it by patching the resource properly.

There may just be some potential problems, but I have no concrete example. Maybe in an extremely busy environment, the resource may be deleted before the finalizer be successfully added.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 23, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 23, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kayrus
Copy link
Contributor

kayrus commented Dec 12, 2023

@jeffyjf can you rebase on latest master?

@jeffyjf
Copy link
Contributor Author

jeffyjf commented Dec 13, 2023

@jeffyjf can you rebase on latest master?

Of course I will rebase it late on.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Dec 13, 2023

@jeffyjf: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
openstack-cloud-keystone-authentication-authorization-test ee04af8 link true /test openstack-cloud-keystone-authentication-authorization-test
openstack-cloud-controller-manager-e2e-test ee04af8 link true /test openstack-cloud-controller-manager-e2e-test
openstack-cloud-csi-manila-e2e-test ee04af8 link true /test openstack-cloud-csi-manila-e2e-test
openstack-cloud-csi-manila-sanity-test ee04af8 link true /test openstack-cloud-csi-manila-sanity-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 13, 2023
@kayrus
Copy link
Contributor

kayrus commented Dec 13, 2023

@jeffyjf can you fix unit tests?

@jeffyjf
Copy link
Contributor Author

jeffyjf commented Dec 13, 2023

@jeffyjf can you fix unit tests?

Of course, I'll fix it asap.

This patch implement a PVC watcher. When the watcher receive a PVC
ADDED event and PVC with VolumeSnapshot kind dataSource, the VolumeSnapshot
will be added a finalizer. And the finalizer will be removed when the watcher
receive the PVC's DELETED event.

NOTE: Asynchronously add finalizer by watcher seems like unsafety, might
      result in some potential issues. After the PR [1] merged and released,
      we need to refactor it and implement a synchronous way to add finalizer.

[1] kubernetes-csi/external-provisioner#1070
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 3, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@dulek dulek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should be a proper K8s controller and not just a simplified version of a watcher.

Comment on lines +22 to +30
snap "github.com/kubernetes-csi/external-snapshotter/client/v6/clientset/versioned"
"golang.org/x/net/context"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/watch"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
"k8s.io/klog/v2"
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is mixing K8s and non-K8s imports. No biggie.

Comment on lines +67 to +70
if event.Type == watch.Added {
klog.V(5).InfoS("Receive ADDED event try to add finalizer to VolumeSnapshot", "PersistentVolumeClaim", pvc.Name, "VolumeSnapshot", snapshotObj.Name)
update = controllerutil.AddFinalizer(snapshotObj, finalizer)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, what if we missed an ADDED event somehow, but then get the MODIFIED one? I believe we should rather look for ADDED/MODIFIED and check if finalizer is there and if it's not add it.

Comment on lines +71 to +74
if event.Type == watch.Deleted {
klog.V(5).InfoS("Receive DELETED event try to remove finalizer from VolumeSnapshot", "PersistentVolumeClaim", pvc.Name, "VolumeSnapshot", snapshotObj.Name)
update = controllerutil.RemoveFinalizer(snapshotObj, finalizer)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you've missed a deletion event here? Are we stuck with a snapshot?

update = controllerutil.RemoveFinalizer(snapshotObj, finalizer)
}
if update {
_, err := cw.snapClient.SnapshotV1().VolumeSnapshots(pvc.Namespace).Update(context.Background(), snapshotObj, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a Patch, otherwise we might end up replacing changes made by some other controller.

@@ -65,6 +69,7 @@ func main() {

cmd.PersistentFlags().StringVar(&cluster, "cluster", "", "The identifier of the cluster that the plugin is running in.")
cmd.PersistentFlags().StringVar(&httpEndpoint, "http-endpoint", "", "The TCP network address where the HTTP server for providing metrics for diagnostics, will listen (example: `:8080`). The default is empty string, which means the server is disabled.")
cmd.PersistentFlags().BoolVar(&vsDeletionProtectionEnabled, "enable-vs-deletion-protection", false, "If set, The VS will be added a finalizer while a PVC created base on it, ensure the VS couldn't be deleted before the PVC be deleted.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VS is not widely used I think you mean virtual server ?

watcher watch.Interface
}

func (cw *controllerWatcher) run() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure but this to me is general feature instead of CPO ? e.g every cloud has this problem?
should this be part of snapshot general thing e.g external snapshot controller to handle ?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 10, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[cinder-csi-plugin:bug] Volume Snapshot deletion is not detecting errors
7 participants