Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal - Pod safety and termination guarantees #34160

Closed
wants to merge 3 commits into from

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Oct 6, 2016

This proposal describes how to evolve the Kubernetes cluster to provide
at-most-one semantics for pod execution and to allow selective
relaxation where necessary to heal partitions.

This is a continuation of pod graceful deletion and completes the safety guarantees begun in that change (#1535)


This change is Reviewable

This proposal describes how to evolve the Kubernetes cluster to provide
at-most-one semantics for pod execution and to allow selective
relaxation where necessary to heal partitions.
@smarterclayton
Copy link
Contributor Author

@smarterclayton
Copy link
Contributor Author

Linked a number of issues that relate to pet safety, pod termination guarantees, and storage safety.

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Oct 6, 2016
@gmarek
Copy link
Contributor

gmarek commented Oct 6, 2016

cc @wojtek-t @gmarek

@adohe-zz
Copy link

adohe-zz commented Oct 6, 2016

/cc @adohe

entity will be spawned) but may become unavailable (cluster no longer has
a sufficient number of members). The Pet Set guarante must be strong enough
for an administrator to reason about the state of the cluster by observing
the Kubrenetes API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Kubernetes


Clients today may assume that force deletions are safe. We must appropriately
audit clients to identify this behavior and improve the messages. For instance,
`kubectl delete --grace-period=0` could print a warning and require `--confirm`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pod has pending finalizers, --grace-period=0 will not delete the pod from the key-value store. Rather than overloadin the grace-period flag, can we define a new flag, like --forceful-deletion, which will instruct the API server to also ignore pending finalizers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think of force deletion as bypassing finalizers (in the context the proposal is written). This would be delete 0 as it is today. If that bypasses finalizers then we may want to clarify what exactly we want the role of finalizers to be wrt deletion here so it's clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think of force deletion as bypassing finalizers (in the context the proposal is written). This would be delete 0 as it is today. If that bypasses finalizers then we may want to clarify what exactly we want the role of finalizers to be wrt deletion here so it's clear.

@smarterclayton
Copy link
Contributor Author

Gentle nag for more review from those this impacts - this is a p1 1.5 item for petsets

@smarterclayton
Copy link
Contributor Author

@erictune

Copy link
Contributor

@marun marun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling/grammar nits

* If no grace period is provided, the default from the pod is leveraged
* When the kubelet observes the deletion, it starts a timer equal to the
grace period and performs the following actions:
* Executes the pre-stop hook, if specified, waiting up **grace period**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up -> up to

grace period and performs the following actions:
* Executes the pre-stop hook, if specified, waiting up **grace period**
before continuing
* Sends the termination signal to the container runtime (SIGTERM)
Copy link
Contributor

@marun marun Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does k8s support stopping a container via the STOPSIGNAL provided in a dockerfile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we invoke docker stop which calls stopsignal on the image.

* Executes the pre-stop hook, if specified, waiting up **grace period**
before continuing
* Sends the termination signal to the container runtime (SIGTERM)
* Waits 2 seconds, or the remaining grace period, which ever is longer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which ever -> whichever

to run for an arbitary amount of time. If a higher level component like the
PetSet controller treats the existence of the pod API object as a strongly
consistent entity, deleting the pod in this fashion will violate the
at-most-one guarantees we wish to offer for pet sets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guarantees -> guarantee ?

ReplicaSets and ReplicationControllers both attempt to preserve availability
of their constituent pods over ensuring **at most one** semantics. So a
replica set to scale 1 will immediately create a new pod when it observes an
old pod is delete, and as a result at many points in the lifetime of a replica
Copy link
Contributor

@marun marun Oct 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is delete -> has been deleted

is used by two pods on different nodes simultaneously, concurrent access may
result in corruption, even if the PV or PVC is identified as "read write one".
PVC consumers must ensure these volume types are *never* referenced from
mulitple pods without some external synchronization. As described above, it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mulitple -> multiple


### Avoid multiple instances of pods

To ensure that the Pet Set controller can safely use pods and ensure at most
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure -> ensure that

* Application owners must be free to force delete pods, but they *must*
understand the implications of doing so, and all client UI must be able
to communicate those implications.
* All existing controllers in the system must be limited signaling pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limited -> limited to

* Additional agents running on each host to force kill process or trigger reboots
* Agents integrated with or communicating with hypervisors running hosts to stop VMs
* Hardware IPMI interfaces to reboot a host
* Rack level power units to power cycle a blad
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blad -> blade

5. The kubelet on node `A` observes the pod references a PVC that specifies RWO which
requires "attach" to be successful
6. The attach/detach controller observes that a pod has been bound with a PVC that
requires "attach", and attempts to execute a CAS update on the PVC/PV attaching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, what is 'CAS'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to communicate those implications.
* All existing controllers in the system must be limited signaling pod
termination (starting graceful deletion), and are not allowed to force
delete a pod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just leave this upto forgiveness? pods get default forgiveness that does what we do today (tolerate partition for 5m). All controllers are modified to respect forgiveness, users can modify it, petset controller will modify it to favor consistency by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force deletion is what causes split brain. The only guarantee that can ensure the process has been terminated (and thus is safe to launch another) is the kubelet or a fencer. So only the kubelet, a fencer, or a human should ever force delete. The petset controller shouldn't have to apply any consistency guarantee.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgiveness is delay before delete. Should have no coupling to force deletion.

Copy link
Member

@davidopp davidopp Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and forgiveness is a delay before a delete that is triggered by placement of a taint so there are plenty of deletion scenarios that forgiveness wouldn't help with. (It would help with node unreachable (Unknown node condition), though.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fencer can't always assume my stateless nginx pods will suffer from split brain right? Maybe we don't need forgiveness, but assuming all pods need this safety guarantee seems weird. Ideally I'd like the fencer deletion to be configurable.

If delay == forever then the only way to override is to --force. The taint in this case is something like node down. If a user deletes (pod or ns), the pod gets deletion grace and the kubelet has final say, as always. if a user --force deletes, it's on them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think creating a ReplicaSet has no expectation of split brain prevention (today, it doesn't try to guarantee that). Could we say:

  1. ReplicaSets have no expectation of identity, and therefore replica sets can be force deleted after grace period expires (or maybe they can opt in)
  2. PetSets always have expectation of identity and uniqueness, and therefore never want to be force deleted except by safety.

I think fencing typically is correlated with short forgiveness / toleration of downtime. I.e. I have a maximum disruption budget of 2 min of downtime. The node controller detection of 40s is pretty short, but if we had other fencing detection mechanisms in the future (there's a lot of peer-peer heartbeats out there, could be service proxy or distributed downtime detection) that are a lot shorter, admins may want to fence first, ask questions later. I.e. disruption budget is 2min, forgiveness is 30s, fast failure detector can detect link disruption in 2s and vote, then the fencing controller could make a fast decision. If forgiveness is 10 minutes, and disruption budget is 5 minutes, then fast fencing is not that critical.

I agree not all pods need the safety guarantee - but otoh force deletion is already unnecessary for replica sets to maintain availability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're agreeing (I forgot about deletion timestamp). You're suggesting that we use a combination of forgiveness + disruptionBudget to communicate when to delete to the fencer right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add a paragraph in fencing for this.


The changes above allow Pet Sets to ensure at-most-one pod, but provide no
recourse for the automatic resolution of cluster partitions during normal
operation. For that, we propose a **fencing controller** which exists above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a gap between the previous section (which breaks current AP replica-set behavior by preventing nodecontroller delete) and this (which only fences). What deletes pods in this world? are we saying it's always a human in the partition case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cloud provider can properly observe dead nodes and delete them, so we can preserve that aspect Partition only prevents pods from being removed from etcd. I think the default behavior has to be to never induce split brain except under stronger guarantees than "I waited".

It's possible that we could extend forgiveness to be an upper bound of tolerance for failure, and so could be opt in to induce a split brain for pet sets. But that's not safe without either unique fenced storage or code inside the petset that can tolerate membership changes by removing the items with a certain clock window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a forgiveness node variant we can create that allows both "preserve without deleting" and "preserve without force deleting"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgiveness hasn't been fully implemented yet but I was assuming that once the forgiveness period expires, it would delete the same way node controller deletes today when there's a NotReady (bad node/kubelet) or Unknown (node unreachable), i.e. the code here

func deletePods(kubeClient clientset.Interface, recorder record.EventRecorder, nodeName, nodeUID string, daemonStore cache.StoreToDaemonSetLister) (bool, error) {

which IIUC does a regular delete not a force delete? (Where is the client Delete() operation that it's using defined?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there going to be a event: *, duration: infiinte flavor for forgiveness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidopp

func forcefullyDeleteNode(kubeClient clientset.Interface, nodeName string, forcefulDeletePodFunc func(*api.Pod) error) error {
is invoked after timeout period. The proposal calls for the forceful delete to be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we start using taints/tolerations/forgiveness to control eviction due to unreachable/not ready node (unfortunately it was decided this will not be enabled by default in 1.5, though it will be implemented) you will be able to say "stay bound to the node no matter what happens" . I think forcefullyDeleteNode() which Clayton pointed to may still be OK since it only activates for a node that has disappeared from the cloud provider, and IIRC you said it's fine for nodes the cloud provider knows is gone, to be deleted? There may be other places in the NC where we force delete, and presumably we would have to replace those with graceful delete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

update on the PVC/PV clearing its attach state.
14. The attach/detach controller observes the second pod has been scheduled and
attaches it to node `B` and pod 2
15. The kubelet on node `B` observes the attach and allows the pod to execute.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have this flow working on cloudprovider backed storage, we just need to:

  1. teach all storage plugins (even those that are not "attachable") to honor it
  2. communicate between scheduler and volume controller about where exactly to attach (that's the first half of [WIP] Sticky emptydir proposal #30044)

The existing flow uses 2 fields in node.Status:

  • volumesInUse: To block volumecontroller from detaching a mounted volume
  • volumesAttached: To block another node from attaching a pre-attached volume

I can think of 2 things we might need to fix off the top of my head:

  1. I think there's a gap with the storage plugin on the node, where some allow 2 pods to bind mount the same volume RWO. I think gce allows this currently.
  2. Today, some volume controllers time out and force detach after a "long" time (2h). I think this needs fixing, especially if we're going to use them as described.

@kubernetes/sig-storage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node status isn't transactional to the pv, so it doesn't prevent races when claiming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler will be the single writer entering the name of a node into a field on the pv (or applying labels), and the kubelet will still reject anything with a name that's not its hostname. Once assigned to a node, the pv/pvc can't be reassigned to any other. After that normal attach/detach logic kicks in.
Or am I misunderstanding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the scheduler should always make the assignment, not volume controller. Given that scheduler needs to write hostname somewhere, the volumecontroller just needs to trigger its normal attach flow on that. Maybe you meant active/passive volume controller setup ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bprashanth, actually the force detach timeout period is 6 minutes now. And I am thinking to make longer or even to remove it to avoid the possible of detaching a mounted volume since it will cause file system corruption.

From my understanding, the attach_detach_controller will update desired state of world when node added/deleted, pod added/deleted/updated. So the following sequence might happen

  1. PodA refers VolumeX which is attached to NodeA
  2. PodB refers VolumeX which is assigned to NodeB, reconciler will try to attach VolumeX to NodeB (this could succeed if RWX is supported for the volumes)
  3. PodA is deleted and VolumeX is detached from NodeA.

So it seems like your suggest a way to prevent step 2 happen using an extra field in PV? What if pod is refereeing the volume directly? To avoid step 2 happen, I think it is possible to change the workflow in reconcile to check if volume access mode is RWO, and currently volume is already attached to a node according to actual state records, it should not trigger the attach operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that all the attach logic today observes scheduler decisions, and that except for maybe the storage allocation for local persistent volume, it's more natural for attach detach to continue to observe scheduling and guarantee locks. It seems to match the description of what attach_detach_controller owns (binding PVs to nodes transactionally by making API calls to either cloud or kube)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that a PV is attached to a node isn't clearly denoted in a way that is helpful to this idea, I think. There's some real work needed to make sure this is safe. Given how tricky the existing binding and attach/detach code is...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah thats the same problem we need to solve for sticky emtpy dir. We discussed using a field or labels on that proposal.


## Backwards compatibility

On an upgrade, pet sets would not be "safe" until the above behavior is implemented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd just use (i.e document) finalizers as poor mans forgiveness, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even sure pet set controller needs a finalizer. I think no one should force delete, and at that point the finalizer is unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant if you want to preserve safety across some rollback that doesn't have what this proposal details (but not so far that it also doesn't have finalizers), you can put in a finalizer for each pet to block deletion by old-nodecontroller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Do you think we'd need it?

termination (starting graceful deletion), and are not allowed to force
delete a pod.
* The node controller will no longer be allowed to force delete pods -
it may only signal deletion
Copy link
Contributor

@foxish foxish Oct 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't force deletions from the node controller still needed for the guarantee that RS/RC provide, that at least N replicas are running even in the face of node partitions?

Oh I see, that responsibility would move to the pod gc controller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RS/RC immediately spin up new pods as soon as the node controller signals termination on the pod (by starting graceful delete). They exclude any terminating pods from their count.

@Kargakis raised today that that behavior actually violates the implicit guarantee of the recreate deployment strategy - that there is a period where there are zero pods running of the old version. I think we need to resolve that independently and he was filing an issue.

If we treat node deletion as implicit unlock (it probably is) then both pod GC and node controller would be allowed to force delete pods. However, I'm not 100% sure since today there are cases where the node would just recreate its node object, and people may not be expecting that delete node means "hard node evacuate".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terminated.
* Application owners must be free to force delete pods, but they *must*
understand the implications of doing so, and all client UI must be able
to communicate those implications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 are already true today right? (assumign "normal" operation is not netsplit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "implications" are not explained.

to communicate those implications.
* All existing controllers in the system must be limited signaling pod
termination (starting graceful deletion), and are not allowed to force
delete a pod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're agreeing (I forgot about deletion timestamp). You're suggesting that we use a combination of forgiveness + disruptionBudget to communicate when to delete to the fencer right?

@bprashanth
Copy link
Contributor

Is the 1.5 MVP to delete code that force-deletes pods in node and other controllers (as described: #31762 (comment))? Out of the 3 sections, the first is the one that solves the core issue. Both fencer and RWO lock border new features.

@smarterclayton
Copy link
Contributor Author

Yes, and to add --force for kubectl delete (0 will not be allowed by
default or made to be 1 under the covers).

Fencer and RWO can be solved in a future release, we would document how to
be safe and what admins have to do to be safe.

I'll sign up for all of the work here, I have time set aside for it anyway
(or I can review if someone is already assigned).

On Oct 12, 2016, at 8:36 PM, Prashanth B notifications@github.com wrote:

Is the 1.5 MVP to delete code that force-deletes pods in node and other
controllers (as described: #31762 (comment)
#31762 (comment))?
Out of the 3 sections, the first is the one that solves the core issue.
Both fencer and RWO lock border new features.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#34160 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4yODD1qRvlwsz_5dw6jzs4M0-95ks5qzX0QgaJpZM4KPd1x
.

@bprashanth
Copy link
Contributor

SGTM, implementation should be straightforward. More importantly, doing this will enable us to test some basic split brain. Running by @erictune (do you still think we need a "Lost" state?)

@smarterclayton smarterclayton added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Nov 3, 2016
k8s-github-robot pushed a commit that referenced this pull request Nov 5, 2016
Automatic merge from submit-queue

Add --force to kubectl delete and explain force deletion

--force is required for --grace-period=0. --now is == --grace-period=1.
Improve command help to explain what graceful deletion is and warn about
force deletion.

Part of #34160 & #29033

```release-note
In order to bypass graceful deletion of pods (to immediately remove the pod from the API) the user must now provide the `--force` flag in addition to `--grace-period=0`.  This prevents users from accidentally force deleting pods without being aware of the consequences of force deletion.  Force deleting pods for resources like StatefulSets can result in multiple pods with the same name having running processes in the cluster, which may lead to data corruption or data inconsistency when using shared storage or common API endpoints.
```
k8s-github-robot pushed a commit that referenced this pull request Nov 6, 2016
Automatic merge from submit-queue

Describe graceful deletionTimestamp more accurately

Spawned from #34160
@smarterclayton
Copy link
Contributor Author

If there are no further comments I'd like to merge this and discuss finalizers in a follow up to this issue. No one has articulated a blocker on this yet. This is still advisory for storage - i think we'll have to expand this with a storage related topic.

@smarterclayton
Copy link
Contributor Author

Hrm, I need to fix the pet set references and a few wording changes.

* Give the Kubelet sole responsibility for normal deletion of pods -
only the Kubelet in the course of normal operation should ever remove a
pod from etcd (only the Kubelet should force delete)
* The kubelet must not delete the pod until all processes are confirmed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want volume clean-up to happen prior to deletion from the apiserver as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends - certainly having attach/detach controller able to observe that on pod deletion helps (once the pod is gone, detach controller has to start trying to detach anyway, so might as well make that consistent).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer cleaning up of all resources belonging to a pod before the corresponding object gets deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

to communicate those implications.
* Force deleting a pod may cause data loss (two instances of the same
pod process may be running at the same time)
* All existing controllers in the system must be limited to signaling pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we can protect this in code somehow. For example, separate ForceDelete from Delete, so Delete does not let you do a force delete so we can minimize reviewer burden in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not terribly difficult to audit today in our code base, but I agree that some level of protection / review is important.

no longer exist if we treat node deletion as confirming permanent
partition. If we do not, the pod GC controller must not force delete
pods.
* It must be possible for an administrator to effectively resolve partitions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we should bump this up to a condition on namespace or if an event would be sufficient.

observe partitions, we propose an additional responsibility to the node controller
or any future controller that attempts to detect partition. The node controller should
add an additional condition to pods that have been terminated due to a node failing
to heartbeat that indicates that the cause of the deletion was node partition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 -- this is useful to ensure namespaces stuck terminating are for this reason.

container image's STOPSIGNAL on Docker)
* Waits 2 seconds, or the remaining grace period, whichever is longer
* Sends the force termination signal to the container runtime (SIGKILL)
* Once the kubelet observes the container is fully terminated, it issues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "deleted" here refers to containers reaching a terminal state. The pod is still not technically "deleted"

period, but never a longer one.

Deleting a pod with grace period 0 is called **force deletion** and will
update the pod with a `deletionGracePeriodSeconds` of 0, and then immediately
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to provide a synchronous delete option? That way, grace period can be separated from actual deletion of the object. In most cases, a user might want to gracefully delete and cleanup a pod right away by specifying grace period as 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the use case that you want to "wait" for the deletion to finish? Right at the end of 1.5 we made the change to have kubelet wait for deletion, and there is a general issue tracking "wait for action" (which could include delete) to make clients better at scripting. I would expect those to cover the client side aspects of waiting for deletion within the bounds of our API design philosophy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, setting graceperiod to 0 implicitly deletes the pod object from etcd. Instead of having an implicit API, why not make inline deletes explicit? A user can request a graceperiod of 0 and still have the pod object be deleted by the kubelet. If a user wants to delete a pod object right away from etcd, they can then set an explicit parameter to achieve that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not be backwards compatible, so at best would be something we do in a v2.

A persistent volume that references a strongly consistent storage backend
like AWS EBS, GCE PD, OpenStack Cinder, or Ceph RBD can rely on the storage
API to prevent corruption of the data due to simultaneous access by multiple
clients. However, many commonly deployed storage technologies in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Are these storage technologies not suitable for clustered deployments then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are broadly deployed technologies that benefit from centralized control (because they offer no centralized guarantees themselves). Most production workloads in the world today run on iSCSI, FibreChannel, or NFS devices in clustered forms - but everyone builds their own solutions to handle clustering (Pacemaker, etc). Building the affordances in kube to allow them to be safely used is important, even if Kube itself may not contain the functionality to make them safe.

* Give the Kubelet sole responsibility for normal deletion of pods -
only the Kubelet in the course of normal operation should ever remove a
pod from etcd (only the Kubelet should force delete)
* The kubelet must not delete the pod until all processes are confirmed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer cleaning up of all resources belonging to a pod before the corresponding object gets deleted.

* Give the Kubelet sole responsibility for normal deletion of pods -
only the Kubelet in the course of normal operation should ever remove a
pod from etcd (only the Kubelet should force delete)
* The kubelet must not delete the pod until all processes are confirmed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* All existing controllers in the system must be limited to signaling pod
termination (starting graceful deletion), and are not allowed to force
delete a pod.
* The node controller will no longer be allowed to force delete pods -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen when the nodes go offline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be able to leverage a number of systems including but not limited to:

* Cloud control plane APIs such as machine force shutdown
* Additional agents running on each host to force kill process or trigger reboots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary? Is kubelet is robust enough, this should not be necessary and we do want kubelet to be robust in general

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example would be network fencing. Kubelet may wedge (we cannot prevent wedges), but an independent process might be fine (like kube-proxy). That independent process could sever the network connection to storage.

I don't think this is required, but it's an example. There are lots of sophisticated fencing tools today - I'm moderately biased to leaving the door open for their use even if it's a "problem left to someone else to solve".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it is not part of the default setup, I have no issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to be at the level of cloud provider controller or higher, but higher level fencing would probably depend on your setup and not be a "turn on by default" for metal.

* Network routers, backplane switches, software defined networks, or system firewalls
* Storage server APIs to block client access

to appropriately limit the ability of the partitioned system to impact the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a lot of work to get poorly designed software to work in clusters. What about the cost of maintaining such a complex controller? It seems to have a lot of extension points?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to build this on top of Kube. Kube can't be a generalized cluster management tool if it prohibits affordances that deal with real world software. This proposal is less about saying Kube should do that work, and more about laying the groundwork for how someone could create it.

Concrete example - bare metal will need an equivalent of node controller. That bare metal node controller will be specialized to different environments, or might just offer simple core tools. The name for this controller today is "human operators". I think it should be a long term goal for the kube ecosystem to allow bare metal to be automated, and the system of tools that allow bare metal to be managed and maintained should be orthogonal to Kube but work well with it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is with third party applications making incorrect assumptions about kube node software. I hope these extensions are managing other infrastructure only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have familiarity with some forms of fencers, but not all. In general, it's just working up the chain of increasingly large hammers until you hear the breaking glass and can confirm that it's broken. I.e. rack IPMI telling you that it powered down a node (assuming you trust the IPMI controller) or the TOR switch acknowledging it is now dropping all packets to and from a particular MAC / port / etc. I would agree that these hammers require a good understanding of their weaknesses before using them.

@dashpole
Copy link
Contributor

docs/proposals/pod-safety.md, line 187 at r3 (raw file):

Previously, dashpole (David Ashpole) wrote…

Ack

The kubelet must not delete a pod until all resources assigned to it are cleaned up. This includes processes, containers, volumes, and network (and probably more).


Comments from Reviewable

@k8s-github-robot
Copy link

Adding label:do-not-merge because PR changes docs prohibited to auto merge
See http://kubernetes.io/editdocs/ for information about editing docs

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. kind/old-docs do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels Dec 1, 2016
@smarterclayton
Copy link
Contributor Author

Moving to kubernetes/community#124, I will close this once I respond and update that PR with the latest comments from vish and ashpole

@smarterclayton
Copy link
Contributor Author

Agree that kubelet should cleanup resources of pods before cleaning up the pod from the API. Will enshrine that in the proposal.


If the kubelet crashes during the termination process, it will restart the
termination process from the beginning (grace period is reset). This ensures
that a process is always given **at least** grace period to terminate cleanly.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"at least" may turn out to be problematic.

In other contexts cluster managers usually prefer "at most" as this allows them to continue to make reasonably accurate assumptions even if whole datacenters become inaccessible from one another (in which case the blessed side might reasonably want to assume that the non-blessed side will have killed the relevant pods after some interval).

Deciding which side gets blessed and how is a tangential conversation, for now it's probably only important whether this is a relevant usecase to consider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fencer would be the one who could provide "at most" guarantees. At the core, we can't get around consensus requirements, so an administrator who wanted to provide "at most" semantics would fence the other data center. In the current system it would be possible for someone who wants to make an "at most" decision even without strong consensus (observed termination) could still force delete the pods.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember though, there will be two fencers in this scenario and they can't talk to each other across the network split.

One in the non-blessed portion of the cluster that will ask for existing pods to go away, and one in the blessed portion that will report that it is now safe to start the replacement pods.

Sure the non-blessed fencer could do a force delete, but I thought part of the point of this proposal was to avoid those and IIUC it still leaves the pods running for some period of time[1] and it appears to conflict with the "at least" design/statement from the PR.

Nothing that should hold up the PR though. Just something to consider.

[1] Is there a maximum for this interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by blessed and non-blessed part (terminology mismatch). The API server is providing serializable guarantees, so there is only progress or not-progress w.r.t. to recorded changes. The node controller will not do anything if split (can't make changes), but even then it simply requests termination. The kubelet will observe the request and try to reconcile it, but not it will not handle behavior.

The fencer is allowed to force delete if it can:

  1. Reach the master (not partitioned)
  2. Guarantee total isolation of the process from the rest of the cluster (storage, network, quantum interference with other processes, whatever)

We definitely will have to consider sets of fencing (storage, network, and then power management), so 2 above is more nuanced than just "force delete" - it might be updating the PV record (as described elsewhere in this doc) as well as force deleting the pod.

If the fencer is wrong, or unsure, the fencer can't force delete the pod.

@vishh
Copy link
Contributor

vishh commented Dec 5, 2016

@smarterclayton LGTM.

@smarterclayton
Copy link
Contributor Author

Marking keep-open until I apply the last comments to the community PR to clarify fencing scope.

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @brendandburns
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 30, 2017
@smarterclayton
Copy link
Contributor Author

All comments applied to the community doc, please give the second commit a once over and then we'll merge (since this was landed in 1.5).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/stateful-apps do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. kind/design Categorizes issue or PR as related to design. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet