-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal - Pod safety and termination guarantees #34160
Conversation
This proposal describes how to evolve the Kubernetes cluster to provide at-most-one semantics for pod execution and to allow selective relaxation where necessary to heal partitions.
Linked a number of issues that relate to pet safety, pod termination guarantees, and storage safety. |
/cc @adohe |
entity will be spawned) but may become unavailable (cluster no longer has | ||
a sufficient number of members). The Pet Set guarante must be strong enough | ||
for an administrator to reason about the state of the cluster by observing | ||
the Kubrenetes API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Kubernetes
|
||
Clients today may assume that force deletions are safe. We must appropriately | ||
audit clients to identify this behavior and improve the messages. For instance, | ||
`kubectl delete --grace-period=0` could print a warning and require `--confirm`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the pod has pending finalizers, --grace-period=0 will not delete the pod from the key-value store. Rather than overloadin the grace-period
flag, can we define a new flag, like --forceful-deletion
, which will instruct the API server to also ignore pending finalizers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think of force deletion as bypassing finalizers (in the context the proposal is written). This would be delete 0 as it is today. If that bypasses finalizers then we may want to clarify what exactly we want the role of finalizers to be wrt deletion here so it's clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think of force deletion as bypassing finalizers (in the context the proposal is written). This would be delete 0 as it is today. If that bypasses finalizers then we may want to clarify what exactly we want the role of finalizers to be wrt deletion here so it's clear.
Gentle nag for more review from those this impacts - this is a p1 1.5 item for petsets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spelling/grammar nits
* If no grace period is provided, the default from the pod is leveraged | ||
* When the kubelet observes the deletion, it starts a timer equal to the | ||
grace period and performs the following actions: | ||
* Executes the pre-stop hook, if specified, waiting up **grace period** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up -> up to
grace period and performs the following actions: | ||
* Executes the pre-stop hook, if specified, waiting up **grace period** | ||
before continuing | ||
* Sends the termination signal to the container runtime (SIGTERM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does k8s support stopping a container via the STOPSIGNAL provided in a dockerfile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we invoke docker stop which calls stopsignal on the image.
* Executes the pre-stop hook, if specified, waiting up **grace period** | ||
before continuing | ||
* Sends the termination signal to the container runtime (SIGTERM) | ||
* Waits 2 seconds, or the remaining grace period, which ever is longer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which ever -> whichever
to run for an arbitary amount of time. If a higher level component like the | ||
PetSet controller treats the existence of the pod API object as a strongly | ||
consistent entity, deleting the pod in this fashion will violate the | ||
at-most-one guarantees we wish to offer for pet sets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guarantees -> guarantee ?
ReplicaSets and ReplicationControllers both attempt to preserve availability | ||
of their constituent pods over ensuring **at most one** semantics. So a | ||
replica set to scale 1 will immediately create a new pod when it observes an | ||
old pod is delete, and as a result at many points in the lifetime of a replica |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is delete -> has been deleted
is used by two pods on different nodes simultaneously, concurrent access may | ||
result in corruption, even if the PV or PVC is identified as "read write one". | ||
PVC consumers must ensure these volume types are *never* referenced from | ||
mulitple pods without some external synchronization. As described above, it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mulitple -> multiple
|
||
### Avoid multiple instances of pods | ||
|
||
To ensure that the Pet Set controller can safely use pods and ensure at most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure -> ensure that
* Application owners must be free to force delete pods, but they *must* | ||
understand the implications of doing so, and all client UI must be able | ||
to communicate those implications. | ||
* All existing controllers in the system must be limited signaling pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limited -> limited to
* Additional agents running on each host to force kill process or trigger reboots | ||
* Agents integrated with or communicating with hypervisors running hosts to stop VMs | ||
* Hardware IPMI interfaces to reboot a host | ||
* Rack level power units to power cycle a blad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blad -> blade
5. The kubelet on node `A` observes the pod references a PVC that specifies RWO which | ||
requires "attach" to be successful | ||
6. The attach/detach controller observes that a pod has been bound with a PVC that | ||
requires "attach", and attempts to execute a CAS update on the PVC/PV attaching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, what is 'CAS'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to communicate those implications. | ||
* All existing controllers in the system must be limited signaling pod | ||
termination (starting graceful deletion), and are not allowed to force | ||
delete a pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just leave this upto forgiveness? pods get default forgiveness that does what we do today (tolerate partition for 5m). All controllers are modified to respect forgiveness, users can modify it, petset controller will modify it to favor consistency by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Force deletion is what causes split brain. The only guarantee that can ensure the process has been terminated (and thus is safe to launch another) is the kubelet or a fencer. So only the kubelet, a fencer, or a human should ever force delete. The petset controller shouldn't have to apply any consistency guarantee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgiveness is delay before delete. Should have no coupling to force deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and forgiveness is a delay before a delete that is triggered by placement of a taint so there are plenty of deletion scenarios that forgiveness wouldn't help with. (It would help with node unreachable (Unknown node condition), though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fencer can't always assume my stateless nginx pods will suffer from split brain right? Maybe we don't need forgiveness, but assuming all pods need this safety guarantee seems weird. Ideally I'd like the fencer deletion to be configurable.
If delay == forever then the only way to override is to --force. The taint in this case is something like node down. If a user deletes (pod or ns), the pod gets deletion grace and the kubelet has final say, as always. if a user --force deletes, it's on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think creating a ReplicaSet has no expectation of split brain prevention (today, it doesn't try to guarantee that). Could we say:
- ReplicaSets have no expectation of identity, and therefore replica sets can be force deleted after grace period expires (or maybe they can opt in)
- PetSets always have expectation of identity and uniqueness, and therefore never want to be force deleted except by safety.
I think fencing typically is correlated with short forgiveness / toleration of downtime. I.e. I have a maximum disruption budget of 2 min of downtime. The node controller detection of 40s is pretty short, but if we had other fencing detection mechanisms in the future (there's a lot of peer-peer heartbeats out there, could be service proxy or distributed downtime detection) that are a lot shorter, admins may want to fence first, ask questions later. I.e. disruption budget is 2min, forgiveness is 30s, fast failure detector can detect link disruption in 2s and vote, then the fencing controller could make a fast decision. If forgiveness is 10 minutes, and disruption budget is 5 minutes, then fast fencing is not that critical.
I agree not all pods need the safety guarantee - but otoh force deletion is already unnecessary for replica sets to maintain availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're agreeing (I forgot about deletion timestamp). You're suggesting that we use a combination of forgiveness + disruptionBudget to communicate when to delete to the fencer right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will add a paragraph in fencing for this.
|
||
The changes above allow Pet Sets to ensure at-most-one pod, but provide no | ||
recourse for the automatic resolution of cluster partitions during normal | ||
operation. For that, we propose a **fencing controller** which exists above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a gap between the previous section (which breaks current AP replica-set behavior by preventing nodecontroller delete) and this (which only fences). What deletes pods in this world? are we saying it's always a human in the partition case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A cloud provider can properly observe dead nodes and delete them, so we can preserve that aspect Partition only prevents pods from being removed from etcd. I think the default behavior has to be to never induce split brain except under stronger guarantees than "I waited".
It's possible that we could extend forgiveness to be an upper bound of tolerance for failure, and so could be opt in to induce a split brain for pet sets. But that's not safe without either unique fenced storage or code inside the petset that can tolerate membership changes by removing the items with a certain clock window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a forgiveness node variant we can create that allows both "preserve without deleting" and "preserve without force deleting"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgiveness hasn't been fully implemented yet but I was assuming that once the forgiveness period expires, it would delete the same way node controller deletes today when there's a NotReady (bad node/kubelet) or Unknown (node unreachable), i.e. the code here
func deletePods(kubeClient clientset.Interface, recorder record.EventRecorder, nodeName, nodeUID string, daemonStore cache.StoreToDaemonSetLister) (bool, error) { |
which IIUC does a regular delete not a force delete? (Where is the client Delete() operation that it's using defined?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there going to be a event: *, duration: infiinte
flavor for forgiveness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func forcefullyDeleteNode(kubeClient clientset.Interface, nodeName string, forcefulDeletePodFunc func(*api.Pod) error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we start using taints/tolerations/forgiveness to control eviction due to unreachable/not ready node (unfortunately it was decided this will not be enabled by default in 1.5, though it will be implemented) you will be able to say "stay bound to the node no matter what happens" . I think forcefullyDeleteNode() which Clayton pointed to may still be OK since it only activates for a node that has disappeared from the cloud provider, and IIRC you said it's fine for nodes the cloud provider knows is gone, to be deleted? There may be other places in the NC where we force delete, and presumably we would have to replace those with graceful delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
update on the PVC/PV clearing its attach state. | ||
14. The attach/detach controller observes the second pod has been scheduled and | ||
attaches it to node `B` and pod 2 | ||
15. The kubelet on node `B` observes the attach and allows the pod to execute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we already have this flow working on cloudprovider backed storage, we just need to:
- teach all storage plugins (even those that are not "attachable") to honor it
- communicate between scheduler and volume controller about where exactly to attach (that's the first half of [WIP] Sticky emptydir proposal #30044)
The existing flow uses 2 fields in node.Status:
- volumesInUse: To block volumecontroller from detaching a mounted volume
- volumesAttached: To block another node from attaching a pre-attached volume
I can think of 2 things we might need to fix off the top of my head:
- I think there's a gap with the storage plugin on the node, where some allow 2 pods to bind mount the same volume RWO. I think gce allows this currently.
- Today, some volume controllers time out and force detach after a "long" time (2h). I think this needs fixing, especially if we're going to use them as described.
@kubernetes/sig-storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node status isn't transactional to the pv, so it doesn't prevent races when claiming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduler will be the single writer entering the name of a node into a field on the pv (or applying labels), and the kubelet will still reject anything with a name that's not its hostname. Once assigned to a node, the pv/pvc can't be reassigned to any other. After that normal attach/detach logic kicks in.
Or am I misunderstanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the scheduler should always make the assignment, not volume controller. Given that scheduler needs to write hostname somewhere, the volumecontroller just needs to trigger its normal attach flow on that. Maybe you meant active/passive volume controller setup ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bprashanth, actually the force detach timeout period is 6 minutes now. And I am thinking to make longer or even to remove it to avoid the possible of detaching a mounted volume since it will cause file system corruption.
From my understanding, the attach_detach_controller will update desired state of world when node added/deleted, pod added/deleted/updated. So the following sequence might happen
- PodA refers VolumeX which is attached to NodeA
- PodB refers VolumeX which is assigned to NodeB, reconciler will try to attach VolumeX to NodeB (this could succeed if RWX is supported for the volumes)
- PodA is deleted and VolumeX is detached from NodeA.
So it seems like your suggest a way to prevent step 2 happen using an extra field in PV? What if pod is refereeing the volume directly? To avoid step 2 happen, I think it is possible to change the workflow in reconcile to check if volume access mode is RWO, and currently volume is already attached to a node according to actual state records, it should not trigger the attach operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that all the attach logic today observes scheduler decisions, and that except for maybe the storage allocation for local persistent volume, it's more natural for attach detach to continue to observe scheduling and guarantee locks. It seems to match the description of what attach_detach_controller owns (binding PVs to nodes transactionally by making API calls to either cloud or kube)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that a PV is attached to a node isn't clearly denoted in a way that is helpful to this idea, I think. There's some real work needed to make sure this is safe. Given how tricky the existing binding and attach/detach code is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah thats the same problem we need to solve for sticky emtpy dir. We discussed using a field or labels on that proposal.
|
||
## Backwards compatibility | ||
|
||
On an upgrade, pet sets would not be "safe" until the above behavior is implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd just use (i.e document) finalizers as poor mans forgiveness, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not even sure pet set controller needs a finalizer. I think no one should force delete, and at that point the finalizer is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant if you want to preserve safety across some rollback that doesn't have what this proposal details (but not so far that it also doesn't have finalizers), you can put in a finalizer for each pet to block deletion by old-nodecontroller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Do you think we'd need it?
termination (starting graceful deletion), and are not allowed to force | ||
delete a pod. | ||
* The node controller will no longer be allowed to force delete pods - | ||
it may only signal deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't force deletions from the node controller still needed for the guarantee that RS/RC provide, that at least N replicas are running even in the face of node partitions?
Oh I see, that responsibility would move to the pod gc controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RS/RC immediately spin up new pods as soon as the node controller signals termination on the pod (by starting graceful delete). They exclude any terminating pods from their count.
@Kargakis raised today that that behavior actually violates the implicit guarantee of the recreate deployment strategy - that there is a period where there are zero pods running of the old version. I think we need to resolve that independently and he was filing an issue.
If we treat node deletion as implicit unlock (it probably is) then both pod GC and node controller would be allowed to force delete pods. However, I'm not 100% sure since today there are cases where the node would just recreate its node object, and people may not be expecting that delete node means "hard node evacuate".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
terminated. | ||
* Application owners must be free to force delete pods, but they *must* | ||
understand the implications of doing so, and all client UI must be able | ||
to communicate those implications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these 2 are already true today right? (assumign "normal" operation is not netsplit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the "implications" are not explained.
to communicate those implications. | ||
* All existing controllers in the system must be limited signaling pod | ||
termination (starting graceful deletion), and are not allowed to force | ||
delete a pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're agreeing (I forgot about deletion timestamp). You're suggesting that we use a combination of forgiveness + disruptionBudget to communicate when to delete to the fencer right?
Is the 1.5 MVP to delete code that force-deletes pods in node and other controllers (as described: #31762 (comment))? Out of the 3 sections, the first is the one that solves the core issue. Both fencer and RWO lock border new features. |
Yes, and to add --force for kubectl delete (0 will not be allowed by Fencer and RWO can be solved in a future release, we would document how to I'll sign up for all of the work here, I have time set aside for it anyway On Oct 12, 2016, at 8:36 PM, Prashanth B notifications@github.com wrote: Is the 1.5 MVP to delete code that force-deletes pods in node and other — |
SGTM, implementation should be straightforward. More importantly, doing this will enable us to test some basic split brain. Running by @erictune (do you still think we need a "Lost" state?) |
Automatic merge from submit-queue Add --force to kubectl delete and explain force deletion --force is required for --grace-period=0. --now is == --grace-period=1. Improve command help to explain what graceful deletion is and warn about force deletion. Part of #34160 & #29033 ```release-note In order to bypass graceful deletion of pods (to immediately remove the pod from the API) the user must now provide the `--force` flag in addition to `--grace-period=0`. This prevents users from accidentally force deleting pods without being aware of the consequences of force deletion. Force deleting pods for resources like StatefulSets can result in multiple pods with the same name having running processes in the cluster, which may lead to data corruption or data inconsistency when using shared storage or common API endpoints. ```
Automatic merge from submit-queue Describe graceful deletionTimestamp more accurately Spawned from #34160
If there are no further comments I'd like to merge this and discuss finalizers in a follow up to this issue. No one has articulated a blocker on this yet. This is still advisory for storage - i think we'll have to expand this with a storage related topic. |
Hrm, I need to fix the pet set references and a few wording changes. |
* Give the Kubelet sole responsibility for normal deletion of pods - | ||
only the Kubelet in the course of normal operation should ever remove a | ||
pod from etcd (only the Kubelet should force delete) | ||
* The kubelet must not delete the pod until all processes are confirmed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want volume clean-up to happen prior to deletion from the apiserver as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends - certainly having attach/detach controller able to observe that on pod deletion helps (once the pod is gone, detach controller has to start trying to detach anyway, so might as well make that consistent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer cleaning up of all resources belonging to a pod before the corresponding object gets deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
to communicate those implications. | ||
* Force deleting a pod may cause data loss (two instances of the same | ||
pod process may be running at the same time) | ||
* All existing controllers in the system must be limited to signaling pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we can protect this in code somehow. For example, separate ForceDelete
from Delete
, so Delete
does not let you do a force delete so we can minimize reviewer burden in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not terribly difficult to audit today in our code base, but I agree that some level of protection / review is important.
no longer exist if we treat node deletion as confirming permanent | ||
partition. If we do not, the pod GC controller must not force delete | ||
pods. | ||
* It must be possible for an administrator to effectively resolve partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we should bump this up to a condition on namespace or if an event would be sufficient.
observe partitions, we propose an additional responsibility to the node controller | ||
or any future controller that attempts to detect partition. The node controller should | ||
add an additional condition to pods that have been terminated due to a node failing | ||
to heartbeat that indicates that the cause of the deletion was node partition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 -- this is useful to ensure namespaces stuck terminating are for this reason.
container image's STOPSIGNAL on Docker) | ||
* Waits 2 seconds, or the remaining grace period, whichever is longer | ||
* Sends the force termination signal to the container runtime (SIGKILL) | ||
* Once the kubelet observes the container is fully terminated, it issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "deleted" here refers to containers reaching a terminal state. The pod is still not technically "deleted"
period, but never a longer one. | ||
|
||
Deleting a pod with grace period 0 is called **force deletion** and will | ||
update the pod with a `deletionGracePeriodSeconds` of 0, and then immediately |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to provide a synchronous delete option? That way, grace period can be separated from actual deletion of the object. In most cases, a user might want to gracefully delete and cleanup a pod right away by specifying grace period as 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the use case that you want to "wait" for the deletion to finish? Right at the end of 1.5 we made the change to have kubelet wait for deletion, and there is a general issue tracking "wait for action" (which could include delete) to make clients better at scripting. I would expect those to cover the client side aspects of waiting for deletion within the bounds of our API design philosophy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, setting graceperiod to 0
implicitly deletes the pod object from etcd. Instead of having an implicit API, why not make inline deletes explicit? A user can request a graceperiod of 0
and still have the pod object be deleted by the kubelet. If a user wants to delete a pod object right away from etcd, they can then set an explicit parameter to achieve that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not be backwards compatible, so at best would be something we do in a v2.
A persistent volume that references a strongly consistent storage backend | ||
like AWS EBS, GCE PD, OpenStack Cinder, or Ceph RBD can rely on the storage | ||
API to prevent corruption of the data due to simultaneous access by multiple | ||
clients. However, many commonly deployed storage technologies in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Are these storage technologies not suitable for clustered deployments then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are broadly deployed technologies that benefit from centralized control (because they offer no centralized guarantees themselves). Most production workloads in the world today run on iSCSI, FibreChannel, or NFS devices in clustered forms - but everyone builds their own solutions to handle clustering (Pacemaker, etc). Building the affordances in kube to allow them to be safely used is important, even if Kube itself may not contain the functionality to make them safe.
* Give the Kubelet sole responsibility for normal deletion of pods - | ||
only the Kubelet in the course of normal operation should ever remove a | ||
pod from etcd (only the Kubelet should force delete) | ||
* The kubelet must not delete the pod until all processes are confirmed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer cleaning up of all resources belonging to a pod before the corresponding object gets deleted.
* Give the Kubelet sole responsibility for normal deletion of pods - | ||
only the Kubelet in the course of normal operation should ever remove a | ||
pod from etcd (only the Kubelet should force delete) | ||
* The kubelet must not delete the pod until all processes are confirmed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole ^^
* All existing controllers in the system must be limited to signaling pod | ||
termination (starting graceful deletion), and are not allowed to force | ||
delete a pod. | ||
* The node controller will no longer be allowed to force delete pods - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen when the nodes go offline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be able to leverage a number of systems including but not limited to: | ||
|
||
* Cloud control plane APIs such as machine force shutdown | ||
* Additional agents running on each host to force kill process or trigger reboots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary? Is kubelet is robust enough, this should not be necessary and we do want kubelet to be robust in general
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example would be network fencing. Kubelet may wedge (we cannot prevent wedges), but an independent process might be fine (like kube-proxy). That independent process could sever the network connection to storage.
I don't think this is required, but it's an example. There are lots of sophisticated fencing tools today - I'm moderately biased to leaving the door open for their use even if it's a "problem left to someone else to solve".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it is not part of the default setup, I have no issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to be at the level of cloud provider controller or higher, but higher level fencing would probably depend on your setup and not be a "turn on by default" for metal.
* Network routers, backplane switches, software defined networks, or system firewalls | ||
* Storage server APIs to block client access | ||
|
||
to appropriately limit the ability of the partitioned system to impact the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a lot of work to get poorly designed software to work in clusters. What about the cost of maintaining such a complex controller? It seems to have a lot of extension points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to build this on top of Kube. Kube can't be a generalized cluster management tool if it prohibits affordances that deal with real world software. This proposal is less about saying Kube should do that work, and more about laying the groundwork for how someone could create it.
Concrete example - bare metal will need an equivalent of node controller. That bare metal node controller will be specialized to different environments, or might just offer simple core tools. The name for this controller today is "human operators". I think it should be a long term goal for the kube ecosystem to allow bare metal to be automated, and the system of tools that allow bare metal to be managed and maintained should be orthogonal to Kube but work well with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is with third party applications making incorrect assumptions about kube node software. I hope these extensions are managing other infrastructure only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have familiarity with some forms of fencers, but not all. In general, it's just working up the chain of increasingly large hammers until you hear the breaking glass and can confirm that it's broken. I.e. rack IPMI telling you that it powered down a node (assuming you trust the IPMI controller) or the TOR switch acknowledging it is now dropping all packets to and from a particular MAC / port / etc. I would agree that these hammers require a good understanding of their weaknesses before using them.
docs/proposals/pod-safety.md, line 187 at r3 (raw file): Previously, dashpole (David Ashpole) wrote…
The kubelet must not delete a pod until all resources assigned to it are cleaned up. This includes processes, containers, volumes, and network (and probably more). Comments from Reviewable |
Adding label:do-not-merge because PR changes docs prohibited to auto merge |
Moving to kubernetes/community#124, I will close this once I respond and update that PR with the latest comments from vish and ashpole |
Agree that kubelet should cleanup resources of pods before cleaning up the pod from the API. Will enshrine that in the proposal. |
|
||
If the kubelet crashes during the termination process, it will restart the | ||
termination process from the beginning (grace period is reset). This ensures | ||
that a process is always given **at least** grace period to terminate cleanly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"at least" may turn out to be problematic.
In other contexts cluster managers usually prefer "at most" as this allows them to continue to make reasonably accurate assumptions even if whole datacenters become inaccessible from one another (in which case the blessed side might reasonably want to assume that the non-blessed side will have killed the relevant pods after some interval).
Deciding which side gets blessed and how is a tangential conversation, for now it's probably only important whether this is a relevant usecase to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fencer would be the one who could provide "at most" guarantees. At the core, we can't get around consensus requirements, so an administrator who wanted to provide "at most" semantics would fence the other data center. In the current system it would be possible for someone who wants to make an "at most" decision even without strong consensus (observed termination) could still force delete the pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember though, there will be two fencers in this scenario and they can't talk to each other across the network split.
One in the non-blessed portion of the cluster that will ask for existing pods to go away, and one in the blessed portion that will report that it is now safe to start the replacement pods.
Sure the non-blessed fencer could do a force delete, but I thought part of the point of this proposal was to avoid those and IIUC it still leaves the pods running for some period of time[1] and it appears to conflict with the "at least" design/statement from the PR.
Nothing that should hold up the PR though. Just something to consider.
[1] Is there a maximum for this interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by blessed and non-blessed part (terminology mismatch). The API server is providing serializable guarantees, so there is only progress or not-progress w.r.t. to recorded changes. The node controller will not do anything if split (can't make changes), but even then it simply requests termination. The kubelet will observe the request and try to reconcile it, but not it will not handle behavior.
The fencer is allowed to force delete if it can:
- Reach the master (not partitioned)
- Guarantee total isolation of the process from the rest of the cluster (storage, network, quantum interference with other processes, whatever)
We definitely will have to consider sets of fencing (storage, network, and then power management), so 2 above is more nuanced than just "force delete" - it might be updating the PV record (as described elsewhere in this doc) as well as force deleting the pod.
If the fencer is wrong, or unsure, the fencer can't force delete the pod.
@smarterclayton LGTM. |
Marking keep-open until I apply the last comments to the community PR to clarify fencing scope. |
[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files: We suggest the following people: |
All comments applied to the community doc, please give the second commit a once over and then we'll merge (since this was landed in 1.5). |
This proposal describes how to evolve the Kubernetes cluster to provide
at-most-one semantics for pod execution and to allow selective
relaxation where necessary to heal partitions.
This is a continuation of pod graceful deletion and completes the safety guarantees begun in that change (#1535)
This change is