Proposal - Pod safety and termination guarantees #34160

smarterclayton · 2016-10-06T01:51:24Z

This proposal describes how to evolve the Kubernetes cluster to provide
at-most-one semantics for pod execution and to allow selective
relaxation where necessary to heal partitions.

This is a continuation of pod graceful deletion and completes the safety guarantees begun in that change (#1535)

This change is

This proposal describes how to evolve the Kubernetes cluster to provide at-most-one semantics for pod execution and to allow selective relaxation where necessary to heal partitions.

smarterclayton · 2016-10-06T01:52:04Z

@bprashanth @saad-ali @thockin @vishh

smarterclayton · 2016-10-06T01:54:51Z

Linked a number of issues that relate to pet safety, pod termination guarantees, and storage safety.

gmarek · 2016-10-06T06:56:20Z

cc @wojtek-t @gmarek

adohe-zz · 2016-10-06T08:28:24Z

/cc @adohe

fabiand · 2016-10-06T09:42:00Z

docs/proposals/pod-safety.md

+entity will be spawned) but may become unavailable (cluster no longer has
+a sufficient number of members). The Pet Set guarante must be strong enough
+for an administrator to reason about the state of the cluster by observing
+the Kubrenetes API.


Typo: Kubernetes

caesarxuchao · 2016-10-06T19:31:55Z

docs/proposals/pod-safety.md

+
+Clients today may assume that force deletions are safe. We must appropriately
+audit clients to identify this behavior and improve the messages. For instance,
+`kubectl delete --grace-period=0` could print a warning and require `--confirm`:


If the pod has pending finalizers, --grace-period=0 will not delete the pod from the key-value store. Rather than overloadin the grace-period flag, can we define a new flag, like --forceful-deletion, which will instruct the API server to also ignore pending finalizers?

I don't think of force deletion as bypassing finalizers (in the context the proposal is written). This would be delete 0 as it is today. If that bypasses finalizers then we may want to clarify what exactly we want the role of finalizers to be wrt deletion here so it's clear.

smarterclayton · 2016-10-11T21:36:19Z

Gentle nag for more review from those this impacts - this is a p1 1.5 item for petsets

smarterclayton · 2016-10-11T21:45:03Z

@erictune

marun

spelling/grammar nits

marun · 2016-10-11T23:52:17Z

docs/proposals/pod-safety.md

+  * If no grace period is provided, the default from the pod is leveraged
+* When the kubelet observes the deletion, it starts a timer equal to the
+  grace period and performs the following actions:
+  * Executes the pre-stop hook, if specified, waiting up **grace period**


up -> up to

marun · 2016-10-11T23:52:27Z

docs/proposals/pod-safety.md

+  grace period and performs the following actions:
+  * Executes the pre-stop hook, if specified, waiting up **grace period**
+    before continuing
+  * Sends the termination signal to the container runtime (SIGTERM)


Does k8s support stopping a container via the STOPSIGNAL provided in a dockerfile?

Yes, we invoke docker stop which calls stopsignal on the image.

marun · 2016-10-11T23:52:30Z

docs/proposals/pod-safety.md

+  * Executes the pre-stop hook, if specified, waiting up **grace period**
+    before continuing
+  * Sends the termination signal to the container runtime (SIGTERM)
+  * Waits 2 seconds, or the remaining grace period, which ever is longer


which ever -> whichever

marun · 2016-10-11T23:52:33Z

docs/proposals/pod-safety.md

+to run for an arbitary amount of time. If a higher level component like the
+PetSet controller treats the existence of the pod API object as a strongly
+consistent entity, deleting the pod in this fashion will violate the
+at-most-one guarantees we wish to offer for pet sets.


guarantees -> guarantee ?

marun · 2016-10-11T23:52:42Z

docs/proposals/pod-safety.md

+ReplicaSets and ReplicationControllers both attempt to preserve availability
+of their constituent pods over ensuring **at most one** semantics. So a
+replica set to scale 1 will immediately create a new pod when it observes an
+old pod is delete, and as a result at many points in the lifetime of a replica


is delete -> has been deleted

marun · 2016-10-11T23:53:03Z

docs/proposals/pod-safety.md

+is used by two pods on different nodes simultaneously, concurrent access may
+result in corruption, even if the PV or PVC is identified as "read write one".
+PVC consumers must ensure these volume types are *never* referenced from
+mulitple pods without some external synchronization. As described above, it


mulitple -> multiple

marun · 2016-10-11T23:53:27Z

docs/proposals/pod-safety.md

+
+### Avoid multiple instances of pods
+
+To ensure that the Pet Set controller can safely use pods and ensure at most


ensure -> ensure that

marun · 2016-10-11T23:54:25Z

docs/proposals/pod-safety.md

+* Application owners must be free to force delete pods, but they *must*
+  understand the implications of doing so, and all client UI must be able
+  to communicate those implications.
+* All existing controllers in the system must be limited signaling pod


limited -> limited to

marun · 2016-10-11T23:57:42Z

docs/proposals/pod-safety.md

+* Additional agents running on each host to force kill process or trigger reboots
+* Agents integrated with or communicating with hypervisors running hosts to stop VMs
+* Hardware IPMI interfaces to reboot a host
+* Rack level power units to power cycle a blad


blad -> blade

marun · 2016-10-12T00:04:56Z

docs/proposals/pod-safety.md

+5. The kubelet on node `A` observes the pod references a PVC that specifies RWO which
+   requires "attach" to be successful
+6. The attach/detach controller observes that a pod has been bound with a PVC that
+   requires "attach", and attempts to execute a CAS update on the PVC/PV attaching


Curious, what is 'CAS'?

https://en.wikipedia.org/wiki/Compare-and-swap

bprashanth · 2016-10-12T00:33:32Z

docs/proposals/pod-safety.md

+  to communicate those implications.
+* All existing controllers in the system must be limited signaling pod
+  termination (starting graceful deletion), and are not allowed to force
+  delete a pod.


why not just leave this upto forgiveness? pods get default forgiveness that does what we do today (tolerate partition for 5m). All controllers are modified to respect forgiveness, users can modify it, petset controller will modify it to favor consistency by default.

Force deletion is what causes split brain. The only guarantee that can ensure the process has been terminated (and thus is safe to launch another) is the kubelet or a fencer. So only the kubelet, a fencer, or a human should ever force delete. The petset controller shouldn't have to apply any consistency guarantee.

Forgiveness is delay before delete. Should have no coupling to force deletion.

Yeah, and forgiveness is a delay before a delete that is triggered by placement of a taint so there are plenty of deletion scenarios that forgiveness wouldn't help with. (It would help with node unreachable (Unknown node condition), though.)

The fencer can't always assume my stateless nginx pods will suffer from split brain right? Maybe we don't need forgiveness, but assuming all pods need this safety guarantee seems weird. Ideally I'd like the fencer deletion to be configurable.

If delay == forever then the only way to override is to --force. The taint in this case is something like node down. If a user deletes (pod or ns), the pod gets deletion grace and the kubelet has final say, as always. if a user --force deletes, it's on them.

I think creating a ReplicaSet has no expectation of split brain prevention (today, it doesn't try to guarantee that). Could we say:

ReplicaSets have no expectation of identity, and therefore replica sets can be force deleted after grace period expires (or maybe they can opt in)

PetSets always have expectation of identity and uniqueness, and therefore never want to be force deleted except by safety.

I think fencing typically is correlated with short forgiveness / toleration of downtime. I.e. I have a maximum disruption budget of 2 min of downtime. The node controller detection of 40s is pretty short, but if we had other fencing detection mechanisms in the future (there's a lot of peer-peer heartbeats out there, could be service proxy or distributed downtime detection) that are a lot shorter, admins may want to fence first, ask questions later. I.e. disruption budget is 2min, forgiveness is 30s, fast failure detector can detect link disruption in 2s and vote, then the fencing controller could make a fast decision. If forgiveness is 10 minutes, and disruption budget is 5 minutes, then fast fencing is not that critical.

I agree not all pods need the safety guarantee - but otoh force deletion is already unnecessary for replica sets to maintain availability.

I think we're agreeing (I forgot about deletion timestamp). You're suggesting that we use a combination of forgiveness + disruptionBudget to communicate when to delete to the fencer right?

Yes, will add a paragraph in fencing for this.

bprashanth · 2016-10-12T00:33:36Z

docs/proposals/pod-safety.md

+
+The changes above allow Pet Sets to ensure at-most-one pod, but provide no
+recourse for the automatic resolution of cluster partitions during normal
+operation. For that, we propose a **fencing controller** which exists above


I think there's a gap between the previous section (which breaks current AP replica-set behavior by preventing nodecontroller delete) and this (which only fences). What deletes pods in this world? are we saying it's always a human in the partition case?

A cloud provider can properly observe dead nodes and delete them, so we can preserve that aspect Partition only prevents pods from being removed from etcd. I think the default behavior has to be to never induce split brain except under stronger guarantees than "I waited".

It's possible that we could extend forgiveness to be an upper bound of tolerance for failure, and so could be opt in to induce a split brain for pet sets. But that's not safe without either unique fenced storage or code inside the petset that can tolerate membership changes by removing the items with a certain clock window.

is there a forgiveness node variant we can create that allows both "preserve without deleting" and "preserve without force deleting"?

Forgiveness hasn't been fully implemented yet but I was assuming that once the forgiveness period expires, it would delete the same way node controller deletes today when there's a NotReady (bad node/kubelet) or Unknown (node unreachable), i.e. the code here

kubernetes/pkg/controller/node/controller_utils.go

Line 45 in 4bda90e

func deletePods(kubeClient clientset.Interface, recorder record.EventRecorder, nodeName, nodeUID string, daemonStore cache.StoreToDaemonSetLister) (bool, error) {

which IIUC does a regular delete not a force delete? (Where is the client Delete() operation that it's using defined?)

isn't there going to be a event: *, duration: infiinte flavor for forgiveness?

@davidopp

kubernetes/pkg/controller/node/controller_utils.go

Line 95 in 4bda90e

func forcefullyDeleteNode(kubeClient clientset.Interface, nodeName string, forcefulDeletePodFunc func(*api.Pod) error) error {

is invoked after timeout period. The proposal calls for the forceful delete to be removed.

Once we start using taints/tolerations/forgiveness to control eviction due to unreachable/not ready node (unfortunately it was decided this will not be enabled by default in 1.5, though it will be implemented) you will be able to say "stay bound to the node no matter what happens" . I think forcefullyDeleteNode() which Clayton pointed to may still be OK since it only activates for a node that has disappeared from the cloud provider, and IIRC you said it's fine for nodes the cloud provider knows is gone, to be deleted? There may be other places in the NC where we force delete, and presumably we would have to replace those with graceful delete?

bprashanth · 2016-10-12T00:45:53Z

docs/proposals/pod-safety.md

+    update on the PVC/PV clearing its attach state.
+14. The attach/detach controller observes the second pod has been scheduled and
+    attaches it to node `B` and pod 2
+15. The kubelet on node `B` observes the attach and allows the pod to execute.


I think we already have this flow working on cloudprovider backed storage, we just need to:

teach all storage plugins (even those that are not "attachable") to honor it

communicate between scheduler and volume controller about where exactly to attach (that's the first half of [WIP] Sticky emptydir proposal #30044)

The existing flow uses 2 fields in node.Status:

volumesInUse: To block volumecontroller from detaching a mounted volume

volumesAttached: To block another node from attaching a pre-attached volume

I can think of 2 things we might need to fix off the top of my head:

I think there's a gap with the storage plugin on the node, where some allow 2 pods to bind mount the same volume RWO. I think gce allows this currently.

Today, some volume controllers time out and force detach after a "long" time (2h). I think this needs fixing, especially if we're going to use them as described.

@kubernetes/sig-storage

Node status isn't transactional to the pv, so it doesn't prevent races when claiming.

The scheduler will be the single writer entering the name of a node into a field on the pv (or applying labels), and the kubelet will still reject anything with a name that's not its hostname. Once assigned to a node, the pv/pvc can't be reassigned to any other. After that normal attach/detach logic kicks in.
Or am I misunderstanding?

I think the scheduler should always make the assignment, not volume controller. Given that scheduler needs to write hostname somewhere, the volumecontroller just needs to trigger its normal attach flow on that. Maybe you meant active/passive volume controller setup ?

@bprashanth, actually the force detach timeout period is 6 minutes now. And I am thinking to make longer or even to remove it to avoid the possible of detaching a mounted volume since it will cause file system corruption.

From my understanding, the attach_detach_controller will update desired state of world when node added/deleted, pod added/deleted/updated. So the following sequence might happen

PodA refers VolumeX which is attached to NodeA

PodB refers VolumeX which is assigned to NodeB, reconciler will try to attach VolumeX to NodeB (this could succeed if RWX is supported for the volumes)

PodA is deleted and VolumeX is detached from NodeA.

So it seems like your suggest a way to prevent step 2 happen using an extra field in PV? What if pod is refereeing the volume directly? To avoid step 2 happen, I think it is possible to change the workflow in reconcile to check if volume access mode is RWO, and currently volume is already attached to a node according to actual state records, it should not trigger the attach operation.

I would argue that all the attach logic today observes scheduler decisions, and that except for maybe the storage allocation for local persistent volume, it's more natural for attach detach to continue to observe scheduling and guarantee locks. It seems to match the description of what attach_detach_controller owns (binding PVs to nodes transactionally by making API calls to either cloud or kube)

The fact that a PV is attached to a node isn't clearly denoted in a way that is helpful to this idea, I think. There's some real work needed to make sure this is safe. Given how tricky the existing binding and attach/detach code is...

Yeah thats the same problem we need to solve for sticky emtpy dir. We discussed using a field or labels on that proposal.

bprashanth · 2016-10-12T00:53:28Z

docs/proposals/pod-safety.md

+
+## Backwards compatibility
+
+On an upgrade, pet sets would not be "safe" until the above behavior is implemented.


We'd just use (i.e document) finalizers as poor mans forgiveness, right?

I'm not even sure pet set controller needs a finalizer. I think no one should force delete, and at that point the finalizer is unnecessary.

I meant if you want to preserve safety across some rollback that doesn't have what this proposal details (but not so far that it also doesn't have finalizers), you can put in a finalizer for each pet to block deletion by old-nodecontroller.

Ah. Do you think we'd need it?

foxish · 2016-10-12T06:12:06Z

docs/proposals/pod-safety.md

+  termination (starting graceful deletion), and are not allowed to force
+  delete a pod.
+  * The node controller will no longer be allowed to force delete pods -
+    it may only signal deletion


Aren't force deletions from the node controller still needed for the guarantee that RS/RC provide, that at least N replicas are running even in the face of node partitions?

Oh I see, that responsibility would move to the pod gc controller?

RS/RC immediately spin up new pods as soon as the node controller signals termination on the pod (by starting graceful delete). They exclude any terminating pods from their count.

@Kargakis raised today that that behavior actually violates the implicit guarantee of the recreate deployment strategy - that there is a period where there are zero pods running of the old version. I think we need to resolve that independently and he was filing an issue.

If we treat node deletion as implicit unlock (it probably is) then both pod GC and node controller would be allowed to force delete pods. However, I'm not 100% sure since today there are cases where the node would just recreate its node object, and people may not be expecting that delete node means "hard node evacuate".

bprashanth · 2016-10-13T00:20:25Z

docs/proposals/pod-safety.md

+    terminated.
+* Application owners must be free to force delete pods, but they *must*
+  understand the implications of doing so, and all client UI must be able
+  to communicate those implications.


these 2 are already true today right? (assumign "normal" operation is not netsplit)

I think the "implications" are not explained.

bprashanth · 2016-10-13T00:27:13Z

docs/proposals/pod-safety.md

+  to communicate those implications.
+* All existing controllers in the system must be limited signaling pod
+  termination (starting graceful deletion), and are not allowed to force
+  delete a pod.


I think we're agreeing (I forgot about deletion timestamp). You're suggesting that we use a combination of forgiveness + disruptionBudget to communicate when to delete to the fencer right?

bprashanth · 2016-10-13T00:36:18Z

Is the 1.5 MVP to delete code that force-deletes pods in node and other controllers (as described: #31762 (comment))? Out of the 3 sections, the first is the one that solves the core issue. Both fencer and RWO lock border new features.

smarterclayton · 2016-10-13T00:39:28Z

Yes, and to add --force for kubectl delete (0 will not be allowed by
default or made to be 1 under the covers).

Fencer and RWO can be solved in a future release, we would document how to
be safe and what admins have to do to be safe.

I'll sign up for all of the work here, I have time set aside for it anyway
(or I can review if someone is already assigned).

On Oct 12, 2016, at 8:36 PM, Prashanth B notifications@github.com wrote:

Is the 1.5 MVP to delete code that force-deletes pods in node and other
controllers (as described: #31762 (comment)
#31762 (comment))?
Out of the 3 sections, the first is the one that solves the core issue.
Both fencer and RWO lock border new features.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#34160 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4yODD1qRvlwsz_5dw6jzs4M0-95ks5qzX0QgaJpZM4KPd1x
.

bprashanth · 2016-10-13T00:54:51Z

SGTM, implementation should be straightforward. More importantly, doing this will enable us to test some basic split brain. Running by @erictune (do you still think we need a "Lost" state?)

Automatic merge from submit-queue Add --force to kubectl delete and explain force deletion --force is required for --grace-period=0. --now is == --grace-period=1. Improve command help to explain what graceful deletion is and warn about force deletion. Part of #34160 & #29033 ```release-note In order to bypass graceful deletion of pods (to immediately remove the pod from the API) the user must now provide the `--force` flag in addition to `--grace-period=0`. This prevents users from accidentally force deleting pods without being aware of the consequences of force deletion. Force deleting pods for resources like StatefulSets can result in multiple pods with the same name having running processes in the cluster, which may lead to data corruption or data inconsistency when using shared storage or common API endpoints. ```

Automatic merge from submit-queue Describe graceful deletionTimestamp more accurately Spawned from #34160

smarterclayton · 2016-11-06T15:48:10Z

If there are no further comments I'd like to merge this and discuss finalizers in a follow up to this issue. No one has articulated a blocker on this yet. This is still advisory for storage - i think we'll have to expand this with a storage related topic.

smarterclayton · 2016-11-06T15:49:50Z

Hrm, I need to fix the pet set references and a few wording changes.

derekwaynecarr · 2016-11-28T21:19:13Z

docs/proposals/pod-safety.md

+* Give the Kubelet sole responsibility for normal deletion of pods -
+  only the Kubelet in the course of normal operation should ever remove a
+  pod from etcd (only the Kubelet should force delete)
+  * The kubelet must not delete the pod until all processes are confirmed


do you want volume clean-up to happen prior to deletion from the apiserver as well?

It depends - certainly having attach/detach controller able to observe that on pod deletion helps (once the pod is gone, detach controller has to start trying to detach anyway, so might as well make that consistent).

I'd prefer cleaning up of all resources belonging to a pod before the corresponding object gets deleted.

@dashpole ^^

derekwaynecarr · 2016-11-28T21:23:06Z

docs/proposals/pod-safety.md

+  to communicate those implications.
+  * Force deleting a pod may cause data loss (two instances of the same
+    pod process may be running at the same time)
+* All existing controllers in the system must be limited to signaling pod


I am wondering if we can protect this in code somehow. For example, separate ForceDelete from Delete, so Delete does not let you do a force delete so we can minimize reviewer burden in the future.

It's not terribly difficult to audit today in our code base, but I agree that some level of protection / review is important.

derekwaynecarr · 2016-11-28T21:24:11Z

docs/proposals/pod-safety.md

+    no longer exist if we treat node deletion as confirming permanent
+    partition. If we do not, the pod GC controller must not force delete
+    pods.
+* It must be possible for an administrator to effectively resolve partitions


i wonder if we should bump this up to a condition on namespace or if an event would be sufficient.

derekwaynecarr · 2016-11-28T21:28:41Z

docs/proposals/pod-safety.md

+observe partitions, we propose an additional responsibility to the node controller
+or any future controller that attempts to detect partition. The node controller should
+add an additional condition to pods that have been terminated due to a node failing
+to heartbeat that indicates that the cause of the deletion was node partition.


+1 -- this is useful to ensure namespaces stuck terminating are for this reason.

vishh · 2016-11-29T04:09:39Z

docs/proposals/pod-safety.md

+    container image's STOPSIGNAL on Docker)
+  * Waits 2 seconds, or the remaining grace period, whichever is longer
+  * Sends the force termination signal to the container runtime (SIGKILL)
+* Once the kubelet observes the container is fully terminated, it issues


nit: "deleted" here refers to containers reaching a terminal state. The pod is still not technically "deleted"

vishh · 2016-11-29T04:13:40Z

docs/proposals/pod-safety.md

+period, but never a longer one.
+
+Deleting a pod with grace period 0 is called **force deletion** and will
+update the pod with a `deletionGracePeriodSeconds` of 0, and then immediately


Would it make sense to provide a synchronous delete option? That way, grace period can be separated from actual deletion of the object. In most cases, a user might want to gracefully delete and cleanup a pod right away by specifying grace period as 0.

Is the use case that you want to "wait" for the deletion to finish? Right at the end of 1.5 we made the change to have kubelet wait for deletion, and there is a general issue tracking "wait for action" (which could include delete) to make clients better at scripting. I would expect those to cover the client side aspects of waiting for deletion within the bounds of our API design philosophy.

As of now, setting graceperiod to 0 implicitly deletes the pod object from etcd. Instead of having an implicit API, why not make inline deletes explicit? A user can request a graceperiod of 0 and still have the pod object be deleted by the kubelet. If a user wants to delete a pod object right away from etcd, they can then set an explicit parameter to achieve that.

Would not be backwards compatible, so at best would be something we do in a v2.

vishh · 2016-11-29T04:16:25Z

docs/proposals/pod-safety.md

+A persistent volume that references a strongly consistent storage backend
+like AWS EBS, GCE PD, OpenStack Cinder, or Ceph RBD can rely on the storage
+API to prevent corruption of the data due to simultaneous access by multiple
+clients. However, many commonly deployed storage technologies in the


Question: Are these storage technologies not suitable for clustered deployments then?

They are broadly deployed technologies that benefit from centralized control (because they offer no centralized guarantees themselves). Most production workloads in the world today run on iSCSI, FibreChannel, or NFS devices in clustered forms - but everyone builds their own solutions to handle clustering (Pacemaker, etc). Building the affordances in kube to allow them to be safely used is important, even if Kube itself may not contain the functionality to make them safe.

vishh · 2016-11-29T04:18:48Z

docs/proposals/pod-safety.md

+* Give the Kubelet sole responsibility for normal deletion of pods -
+  only the Kubelet in the course of normal operation should ever remove a
+  pod from etcd (only the Kubelet should force delete)
+  * The kubelet must not delete the pod until all processes are confirmed


I'd prefer cleaning up of all resources belonging to a pod before the corresponding object gets deleted.

vishh · 2016-11-29T04:19:05Z

docs/proposals/pod-safety.md

+* Give the Kubelet sole responsibility for normal deletion of pods -
+  only the Kubelet in the course of normal operation should ever remove a
+  pod from etcd (only the Kubelet should force delete)
+  * The kubelet must not delete the pod until all processes are confirmed


@dashpole ^^

vishh · 2016-11-29T04:22:05Z

docs/proposals/pod-safety.md

+* All existing controllers in the system must be limited to signaling pod
+  termination (starting graceful deletion), and are not allowed to force
+  delete a pod.
+  * The node controller will no longer be allowed to force delete pods -


What will happen when the nodes go offline?

https://github.com/kubernetes/kubernetes.github.io/pull/1757/files#diff-1fd98dd83a72174d81fbe353fdb248baR62

@Kargakis posted a comment in https://github.com/kubernetes/kubernetes.github.io/pull/1757/files#diff-1fd98dd83a72174d81fbe353fdb248baR62

vishh · 2016-11-29T04:34:29Z

docs/proposals/pod-safety.md

+would be able to leverage a number of systems including but not limited to:
+
+* Cloud control plane APIs such as machine force shutdown
+* Additional agents running on each host to force kill process or trigger reboots


why is this necessary? Is kubelet is robust enough, this should not be necessary and we do want kubelet to be robust in general

Example would be network fencing. Kubelet may wedge (we cannot prevent wedges), but an independent process might be fine (like kube-proxy). That independent process could sever the network connection to storage.

I don't think this is required, but it's an example. There are lots of sophisticated fencing tools today - I'm moderately biased to leaving the door open for their use even if it's a "problem left to someone else to solve".

As long as it is not part of the default setup, I have no issues.

I would expect this to be at the level of cloud provider controller or higher, but higher level fencing would probably depend on your setup and not be a "turn on by default" for metal.

vishh · 2016-11-29T04:37:50Z

docs/proposals/pod-safety.md

+* Network routers, backplane switches, software defined networks, or system firewalls
+* Storage server APIs to block client access
+
+to appropriately limit the ability of the partitioned system to impact the cluster.


This seems to be a lot of work to get poorly designed software to work in clusters. What about the cost of maintaining such a complex controller? It seems to have a lot of extension points?

It should be possible to build this on top of Kube. Kube can't be a generalized cluster management tool if it prohibits affordances that deal with real world software. This proposal is less about saying Kube should do that work, and more about laying the groundwork for how someone could create it.

Concrete example - bare metal will need an equivalent of node controller. That bare metal node controller will be specialized to different environments, or might just offer simple core tools. The name for this controller today is "human operators". I think it should be a long term goal for the kube ecosystem to allow bare metal to be automated, and the system of tools that allow bare metal to be managed and maintained should be orthogonal to Kube but work well with it.

My concern is with third party applications making incorrect assumptions about kube node software. I hope these extensions are managing other infrastructure only.

So I have familiarity with some forms of fencers, but not all. In general, it's just working up the chain of increasingly large hammers until you hear the breaking glass and can confirm that it's broken. I.e. rack IPMI telling you that it powered down a node (assuming you trust the IPMI controller) or the TOR switch acknowledging it is now dropping all packets to and from a particular MAC / port / etc. I would agree that these hammers require a good understanding of their weaknesses before using them.

dashpole · 2016-11-30T22:58:42Z

docs/proposals/pod-safety.md, line 187 at r3 (raw file):

Previously, dashpole (David Ashpole) wrote…

Ack

The kubelet must not delete a pod until all resources assigned to it are cleaned up. This includes processes, containers, volumes, and network (and probably more).

Comments from Reviewable

k8s-github-robot · 2016-12-01T01:19:12Z

Adding label:do-not-merge because PR changes docs prohibited to auto merge
See http://kubernetes.io/editdocs/ for information about editing docs

smarterclayton · 2016-12-01T18:14:10Z

Moving to kubernetes/community#124, I will close this once I respond and update that PR with the latest comments from vish and ashpole

smarterclayton · 2016-12-01T18:25:14Z

Agree that kubelet should cleanup resources of pods before cleaning up the pod from the API. Will enshrine that in the proposal.

beekhof · 2016-12-05T01:25:36Z

docs/proposals/pod-safety.md

+
+If the kubelet crashes during the termination process, it will restart the
+termination process from the beginning (grace period is reset). This ensures
+that a process is always given **at least** grace period to terminate cleanly.


"at least" may turn out to be problematic.

In other contexts cluster managers usually prefer "at most" as this allows them to continue to make reasonably accurate assumptions even if whole datacenters become inaccessible from one another (in which case the blessed side might reasonably want to assume that the non-blessed side will have killed the relevant pods after some interval).

Deciding which side gets blessed and how is a tangential conversation, for now it's probably only important whether this is a relevant usecase to consider.

The fencer would be the one who could provide "at most" guarantees. At the core, we can't get around consensus requirements, so an administrator who wanted to provide "at most" semantics would fence the other data center. In the current system it would be possible for someone who wants to make an "at most" decision even without strong consensus (observed termination) could still force delete the pods.

Remember though, there will be two fencers in this scenario and they can't talk to each other across the network split.

One in the non-blessed portion of the cluster that will ask for existing pods to go away, and one in the blessed portion that will report that it is now safe to start the replacement pods.

Sure the non-blessed fencer could do a force delete, but I thought part of the point of this proposal was to avoid those and IIUC it still leaves the pods running for some period of time[1] and it appears to conflict with the "at least" design/statement from the PR.

Nothing that should hold up the PR though. Just something to consider.

[1] Is there a maximum for this interval?

Not sure what you mean by blessed and non-blessed part (terminology mismatch). The API server is providing serializable guarantees, so there is only progress or not-progress w.r.t. to recorded changes. The node controller will not do anything if split (can't make changes), but even then it simply requests termination. The kubelet will observe the request and try to reconcile it, but not it will not handle behavior.

The fencer is allowed to force delete if it can:

Reach the master (not partitioned)

Guarantee total isolation of the process from the rest of the cluster (storage, network, quantum interference with other processes, whatever)

We definitely will have to consider sets of fencing (storage, network, and then power management), so 2 above is more nuanced than just "force delete" - it might be updating the PV record (as described elsewhere in this doc) as well as force deleting the pod.

If the fencer is wrong, or unsure, the fencer can't force delete the pod.

vishh · 2016-12-05T05:00:03Z

@smarterclayton LGTM.

smarterclayton · 2017-01-04T21:22:26Z

Marking keep-open until I apply the last comments to the community PR to clarify fencing scope.

k8s-github-robot · 2017-01-24T00:11:36Z

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

docs/OWNERS

We suggest the following people:
cc @brendandburns
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

smarterclayton · 2017-02-06T02:49:35Z

All comments applied to the community doc, please give the second commit a once over and then we'll merge (since this was landed in 1.5).

Proposal - Pod safety and termination guarantees

59779ba

This proposal describes how to evolve the Kubernetes cluster to provide at-most-one semantics for pod execution and to allow selective relaxation where necessary to heal partitions.

googlebot added the cla: yes label Oct 6, 2016

This was referenced Oct 6, 2016

Have kubelet delete all pods #29033

Closed

PetSet fencing #31762

Closed

Need fencing in HA recovery process #12338

Closed

storage: in the future RBD images might become unaccessible in read-write mode #33013

Closed

Pet Set in beta #28718

Closed

smarterclayton mentioned this pull request Oct 6, 2016

Reconcile graceful termination #20497

Closed

k8s-github-robot assigned thockin Oct 6, 2016

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Oct 6, 2016

k8s-github-robot mentioned this pull request Oct 6, 2016

[k8s.io] EmptyDir wrapper volumes should not conflict {Kubernetes e2e suite} #32467

Closed

fabiand reviewed Oct 6, 2016

View reviewed changes

caesarxuchao reviewed Oct 6, 2016

View reviewed changes

smarterclayton mentioned this pull request Oct 8, 2016

Make StatefulSets safe during cluster failure kubernetes/enhancements#119

Closed

23 tasks

marun reviewed Oct 12, 2016

View reviewed changes

bprashanth reviewed Oct 12, 2016

View reviewed changes

foxish reviewed Oct 12, 2016

View reviewed changes

bprashanth reviewed Oct 13, 2016

View reviewed changes

smarterclayton added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Nov 3, 2016

k8s-github-robot pushed a commit that referenced this pull request Nov 6, 2016

Merge pull request #35481 from smarterclayton/apidoc

c80acb4

Automatic merge from submit-queue Describe graceful deletionTimestamp more accurately Spawned from #34160

bgrant0607 mentioned this pull request Nov 19, 2016

kubectl --force deletion is breaking upgrade tests #37117

Closed

erictune added the area/stateful-apps label Nov 28, 2016

derekwaynecarr reviewed Nov 28, 2016

View reviewed changes

vishh reviewed Nov 29, 2016

View reviewed changes

k8s-github-robot added kind/design Categorizes issue or PR as related to design. kind/old-docs do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels Dec 1, 2016

smarterclayton mentioned this pull request Dec 1, 2016

Proposal: Pod Safety Guarantees kubernetes/community#124

Merged

beekhof reviewed Dec 5, 2016

View reviewed changes

enisoc mentioned this pull request Dec 7, 2016

GCE Node Upgrade should respect PodDisruptionBudget #38336

Closed

smarterclayton added the keep-open label Jan 4, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 30, 2017

smarterclayton closed this Feb 6, 2017


		### Avoid multiple instances of pods

		To ensure that the Pet Set controller can safely use pods and ensure at most


		## Backwards compatibility

		On an upgrade, pet sets would not be "safe" until the above behavior is implemented.

Proposal - Pod safety and termination guarantees #34160

Proposal - Pod safety and termination guarantees #34160

Conversation

smarterclayton commented Oct 6, 2016 • edited

smarterclayton commented Oct 6, 2016

smarterclayton commented Oct 6, 2016

gmarek commented Oct 6, 2016

adohe-zz commented Oct 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Oct 11, 2016

smarterclayton commented Oct 11, 2016

marun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marun Oct 11, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marun Oct 11, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bprashanth Oct 12, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidopp Oct 12, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bprashanth Oct 12, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bprashanth Oct 12, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

foxish Oct 12, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bprashanth commented Oct 13, 2016

smarterclayton commented Oct 13, 2016

bprashanth commented Oct 13, 2016

smarterclayton commented Nov 6, 2016

smarterclayton commented Nov 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton commented Oct 6, 2016 •

edited

marun Oct 11, 2016 •

edited

marun Oct 11, 2016 •

edited

bprashanth Oct 12, 2016 •

edited

davidopp Oct 12, 2016 •

edited

bprashanth Oct 12, 2016 •

edited

bprashanth Oct 12, 2016 •

edited

foxish Oct 12, 2016 •

edited