Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent terminationGracePeriodSeconds set in different versions of calico-node daemonset #8691

Open
BenjaminHuang opened this issue Apr 3, 2024 · 5 comments

Comments

@BenjaminHuang
Copy link

BenjaminHuang commented Apr 3, 2024

The calico-node daemonset has terminationGracePeriodSeconds set.

In the manifest version, it's coded as 0:
terminationGracePeriodSeconds: 0

But in the version generated by tigera-operator, it's coded as 5:
terminationGracePeriodSeconds: 5

However, both versions have prestop hook specified

        lifecycle:
          preStop:
            exec:
              command:
              - /bin/calico-node
              - -shutdown
  • If terminationGracePeriodSeconds set to 0
    pre-stop hook will be unreachable, this makes the impact of calico-node deletion minimized.

  • If terminationGracePeriodSeconds set to non-zero
    pre-stop hook will cause NetworkUnavailable status set:

  conditions:
  - lastHeartbeatTime: "2024-04-03T08:18:42Z"
    lastTransitionTime: "2024-04-03T08:18:42Z"
    message: Calico is running on this node
    reason: CalicoIsUp
    status: "False"
    type: NetworkUnavailable

and eventually cause a no-schedule taint added by kube-controller-manager:

  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    timeAdded: "2024-04-03T07:17:46Z"

However, I'm not sure which is the desired behavior.

Expected Behavior

terminationGracePeriodSeconds should be consistent in calico-node daemonset, both in manifest and tigera-operator-generated version.

Current Behavior

terminationGracePeriodSeconds is inconsistent in calico-node daemonset, between manifest and tigera-operator-generated version.

Possible Solution

set terminationGracePeriodSeconds to 0 in different version of calico-node daemonset

Steps to Reproduce (for bugs)

  1. Go to installation guide, e.g. https://docs.tigera.io/calico/latest/getting-started/kubernetes/self-managed-onprem/onpremises
  2. Choose the way for installing: manifest or operator in a k8s cluster (optionally using kind)
  3. Follow up the instructions to comple the installation
  4. Try deleing calico-node instance and inspect coressponding node status/taints during deletion.
  5. Compare the different behaviour , in different installation, and focus on difference of terminationGracePeriodSeconds

Context

I want calico installation from manifest or from tigera operator has the same behavior, in calico-node deletion.

Your Environment

  • Calico version 3.20.6
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.20.7
  • Operating System and version: Ubuntu 20.04
  • Link to your project (optional):
@BenjaminHuang BenjaminHuang changed the title Inconsistent terminationGracePeriodSeconds seen in different versions of calico-node daemonset Inconsistent terminationGracePeriodSeconds set in different versions of calico-node daemonset Apr 3, 2024
@caseydavenport
Copy link
Member

I suspect the manifest value needs to be increased to match what the operator is setting, and to enable the preStop hook to run.

@cyclinder
Copy link
Contributor

If terminationGracePeriodSeconds set to non-zero
pre-stop hook will cause NetworkUnavailable status set:

Does this look like it's expected? If so, we need to adjust the manifest value to 5.

@BenjaminHuang
Copy link
Author

BenjaminHuang commented Apr 10, 2024

If terminationGracePeriodSeconds set to non-zero
pre-stop hook will cause NetworkUnavailable status set:

Does this look like it's expected? If so, we need to adjust the manifest value to 5.

that depends on your situation

  • if you have all pods on overlay, and don't want new pods to be scheduled on the node, during calico-node crash/restart, set a positive value will be better
  • if you have all pods on host network, and just want to delete calico-node from cluster, leave it as zero would be nice

@BenjaminHuang
Copy link
Author

BenjaminHuang commented Apr 10, 2024

I suspect the manifest value needs to be increased to match what the operator is setting, and to enable the preStop hook to run.

I'm not sure whether setting both to positive value would be better.

I guess by adding a commet above this parameter, describing the impact to node status , would be good enough.

Without that, it'd be hard to imagine what happen when changing it, you have to dig out more details from source code.

@caseydavenport
Copy link
Member

if you have all pods on host network, and just want to delete calico-node from cluster, leave it as zero would be nice

Agreed, although I would classify this as an exceptional case and far from the expected scenario in 90% of Kubernetes clusters using Calico.

I think we should:

  • Adjust the manifest to use the same value as the operator.
  • Add a comment explaining why the value is set that way, to aid anyone who might want to adjust it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants