Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving addons to higher priority layers doesn't prune from the lower layer #207

Open
avacaru opened this issue Jun 9, 2021 · 15 comments
Open

Comments

@avacaru
Copy link

avacaru commented Jun 9, 2021

Describe the bug
When moving an addon from one layer to another, the reconciliation doesn't complete successfully because the old resources are not pruned.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy two layers, one dependent on the other one. E.g. app-layer is dependent on base-layer. Let's say app-layer has two addons: app-runtime and app-logging. Base layer contains at least one addon: base-addon
  2. Move app-logging from app-layer to base-layer under the name base-logging instead of app-logging. Update layer versions, git-repository source, etc
  3. Apply the new AddonsLayers manifest files.
  4. See error: base-logging cannot be installed because its resources already exist (i.e. deployment already exists) and they are not managed by this helm release.

Expected behavior
When I remove an addon from a layer, it will be pruned regardless of the dependency relationship to other layers.

Kraan Helm Chart Version = v0.2.8
Kubernetes Version = v1.19.9

@nab-gha
Copy link
Contributor

nab-gha commented Jun 9, 2021

@avacaru the change of name of the helmrelease custom resource will prevent orphan/adoption processing. However I’d expect the app-logging resources to be deleted when the app-logging helm release is deleted by prune processing so the base-logging can be deployed but initially the app-layer pruning may still be in progress while the base-layer (which has nothing to prune) proceeds to the apply phase. The base-logging helm release will fail but should be retried. However Kraan just deploys the helm releases, it relies on the helm release being retried. check the retry settings?

@avacaru
Copy link
Author

avacaru commented Jun 10, 2021

@paulcarlton-ww This is what I have in that HR's manifest:

spec:
  install:
    remediation:
      retries: -1
  upgrade:
    remediation:
      retries: -1

I was expecting the same thing, even after the initial failure I was hoping that it will be fixed in the next reconciliation cycle.

Also I can see this error in the logs of kraan-controller:

Operation cannot be fulfilled on addonslayers.kraan.io "app-layer": the object has been modified; please apply your changes to the latest version and try again

controllers.(*AddonsLayerReconciler).update - addons_controller.go(896) - failed to update
github.com/fidelity/kraan/controllers.(*AddonsLayerReconciler).update
	/workspace/controllers/addons_controller.go:896
github.com/fidelity/kraan/controllers.(*AddonsLayerReconciler).updateRequeue
	/workspace/controllers/addons_controller.go:702
github.com/fidelity/kraan/controllers.(*AddonsLayerReconciler).Reconcile
	/workspace/controllers/addons_controller.go:837
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:244
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.6.3/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.20.2/pkg/util/wait/wait.go:90
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1373
controllers.(*AddonsLayerReconciler).updateRequeue - addons_controller.go(703) - failed to update

Does that help?

@nab-gha
Copy link
Contributor

nab-gha commented Jun 10, 2021

The kraan log error is normal, this is the controller runtime encountering a resource version mismatch due to concurrent reconciliation of a layer, the standard practice in Kubernetes is to redo the reconcile using latest version of layer and it should be ok so Kraan schedules an immediate re-reconcile. I'd only be concerned if copious numbers of this error are being generated?

The retry spec is correct. I'm wondering if this is a helm/helm-controller issue? If you restart the helm-controller or delete the helmrelease does that fix it?

@avacaru
Copy link
Author

avacaru commented Jun 10, 2021

There is indeed a high number of these error, one aprox. every minute (the AddonsLayer interval setting).

I've tried to delete the HR manually and then all the layers end up being successfully deployed.
I haven't tried yet to restart the helm-controller, but I'll give it a go now.

@nab-gha
Copy link
Contributor

nab-gha commented Jun 10, 2021

one a minute suggests the periodic sync of all layers, i.e. the controller reconciles all layers every minute. I expect this clashes with the repeated attempts to process the layer with the base-logging helmrelease in error in.

I think this is a helm-controller/helm issue, all Kraan is responsible for is applying changes to HelmRelease objects, it relies on the helm-controller applying that change. The fact that deleting the HR fixes the issue suggests that something is not working right in the helm-controller, possibly due to some nuance of the way helm works

@avacaru
Copy link
Author

avacaru commented Jun 10, 2021

I've checked the helm-controller logs and I can't see any errors about the initial app-logging HelmRelease failing to be deleted. All I see is that the reconciliation has succeeded. I am wondering, is the kraan-controller ever deleting the app-logging HR?

Any other logs I can look into in order to track this issue down?

@nab-gha
Copy link
Contributor

nab-gha commented Jun 10, 2021

Check kraan log, depending on verbosity setting you should see app-logging HR being deleted, but easier to just check using kubectl, it should be gone and all the resources it created should have been deleted too, check the deployment.apps object it should have been deleted, if it is still there with owner/annotation indicating that it belongs to app-logging HR then that is the cause of the issue but I assumed from your orginal description that this deletion had occurred.

@nab-gha
Copy link
Contributor

nab-gha commented Jun 10, 2021

From a Helm-Controller maintainer...
The helm-controller does indeed not garbage collect if you change whatever release it is pointed at
The other mention about the “already modified” error is something I have been working on across controllers. There is a patch helper pending in fluxcd/pkg#101 that knows how to deal with conflicts, integration of this is bundled with some other changes and will take some time before it will be available

@avacaru Please raise a Helm-Controller issue at https://github.com/fluxcd/helm-controller/issues

@avacaru
Copy link
Author

avacaru commented Jun 10, 2021

Check kraan log, depending on verbosity setting you should see app-logging HR being deleted, but easier to just check using kubectl, it should be gone and all the resources it created should have been deleted too, check the deployment.apps object it should have been deleted, if it is still there with owner/annotation indicating that it belongs to app-logging HR then that is the cause of the issue but I assumed from your orginal description that this deletion had occurred.

That is the actual problem, the old HR does not get removed. I can still see it when I run kubectl get hr. That's why the new HR, in the new layer, can't get deployed, because the resources with the same name already exist.

Helm install failed: rendered manifests contain a resource that already exists. Unable to continue with install: ServiceAccount "logging-sa" in namespace "my-namespace" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "base-logging": current value is "app-logging"

Also, if I remove the old HR manually, then the reconciliation succeeds. All layers will have status Deployed, whereas they are stuck like this:

NAME              VERSION         SOURCE                          PATH                                  STATUS                MESSAGE
base-layer        0.6.0           base-layer-git-location         ./add-ons/helm-releases               Failed                AddonsLayer failed, HelmRelease: my-namespace/base-logging, not ready
app-layer         1.1.1           app-layer-git-location          ./add-ons/helm-releases               ApplyPending          Waiting for layer: base-layer, to apply source revision: 0.6.0/523d71d14a89cd461b63914055e5e1d593e9098d. Layer: base-layer, current state: Failed, deployed revision: 0.5.0/22b5c459bfd54a50e7d6b90e05c16c331b50040c.

Can I get someone to reproduce this issue before I open an issue in the helm-controller, please?

@nab-gha
Copy link
Contributor

nab-gha commented Jun 10, 2021

@avacaru The app-logging HR should have been pruned by Kraan controller, please post kraan logs here. If you need to re-run to recreate set log level to 4 for copious logging, thanks

@avacaru
Copy link
Author

avacaru commented Jun 10, 2021

@paulcarlton-ww I've set log level to 4 and reproduced the bug. I can't see any error in the logs, the only reference to pruning is in these logs:

{"level":"Level(-4)","ts":"2021-06-10T14:52:47.194Z","logger":"kraan.controller.reconciler","msg":"Entering function","function":"controllers.(*AddonsLayerReconciler).processPrune","source":"addons_controller.go","line":384}
{"level":"Level(-4)","ts":"2021-06-10T14:52:47.194Z","logger":"kraan.controller.reconciler","msg":"Entering function","layer":"app-layer","function":"apply.KubectlLayerApplier.PruneIsRequired","source":"layerApplier.go","line":802}

No other reference that would indicate the layer has been pruned of addons not present in the git source anymore.

@nab-gha
Copy link
Contributor

nab-gha commented Jun 10, 2021

Can you post the complete log?

@avacaru
Copy link
Author

avacaru commented Jun 10, 2021

Can you post the complete log?

Unfortunately not, but if you don't have an environment to reproduce this, what should I look for in the logs?
Any particular string I can search for that demonstrates the app-logging HR has been removed by the kraan-controller?

@nab-gha
Copy link
Contributor

nab-gha commented Jun 10, 2021

@avacaru I will recreate this myself when I get time but that might not be for a few days

@nab-gha
Copy link
Contributor

nab-gha commented Jun 11, 2021

@avacaru I've tested this using head of master code, seems to work fine in both directions
Can you share the HR definitions for base-logging and app-logging, are the images being used public?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants