-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade kubelet from v1.11.3 to v.12.3 without draining pods causes containers to restart continuously #72296
Comments
This is a serious problem. Once you upgrade the kubelet, the container in the entire cluster will be restarted. The API changes will cause a restart, but the continuous restart is not expected. In my opinion, this may be related to the compatibility of docker, that is like comment #64283 (comment) I think kubernetes-sigs/cri-tools#373 is needed. |
/priority critical-urgent |
/sig node |
what was the problem? @wgliang |
@vikaschoudhary16 The reason for the problem is that, normally, Kubernetes' container restart strategy is RebuildAlways, that is, when the Pod fails (either the container itself or the physical machine restart, etc.), the kubelet will recreate the new container for the Pod. However, in fact, many users' applications will write some data or configuration in the root directory, so users will expect to use the previous container. So we added a ReuseAlways strategy for Pod and became the default strategy for restart. The original Always strategy, the strategy of the rebuild container, is one of the optional strategies. When using the ReuseAlways policy, the kubelet will first check if there is a corresponding container, and if so, it will start the container directly without re-creating a new container. The problem is here. Normally, using the default Rebuild strategy will rebuild the container object, and it will be consistent when the Hash is checked after restarting, but since we Reuse the original container, the newly constructed container object is always the same as the original. The object's hash is inconsistent, and Kubelet thinks it needs to restart the container and kill it. In the end, the container we saw was constantly restarted. |
@kubernetes/sig-release |
draining pods from a node prior to upgrading minor versions is the tested upgrade procedure. clarifying that in documentation is on the list of follow ups to kubernetes/website#11060 /sig node |
@liggitt Thanks, documentation and instruction work is very necessary. |
Also, I don't know if we can do this without restarting the container when upgrading the kubelet even if the container object's API changed, because the service running in the real environment is not allowed to quit unexpectedly. |
Here is a issue tracking this: #63814. |
@wgliang we meet the same issue when upgrading from 1.11.3 to 1.12.3, can you share how you solve it? |
@haoqing0110 If you want to hack and workaround this issue temporarily, you can check this PR: kpaas-io#53. We've done this in upgrading 1.7 to 1.9 in our private clusters before. The idea applies to upgrading between other versions. The key point is to keep container hash same after upgrade (described in #63814). |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
Upgrade kubernetes from v1.11.3 to v.12.3 causes containers to restart continuously.
What you expected to happen:
Upgrade kubernetes from v1.11.3 to v.12.3 successfully.
How to reproduce it (as minimally and precisely as possible):
Upgrade kubernetes from v1.11.3 to v.12.3.
Anything else we need to know?:
I1222 15:12:14.538192 197768 kuberuntime_container.go:887] Update pause id to 291ec4e9320f0460e703e405ea459e47525df9762d7fb74b2c53b5f42d4a3c6e of container d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316 for pod redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)
I1222 15:12:14.538819 197768 kuberuntime_container.go:893] Start exited container d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316 for pod redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)
I1222 15:12:14.549372 197768 secret.go:182] Setting up volume default-token-2ccdr for pod 0bbb8d6b-05a8-11e9-8012-0022b2fb1eda at /var/lib/kubelet/pods/0bbb8d6b-05a8-11e9-8012-0022b2fb1eda/volumes/kubernetes.io secret/default-token-2ccdr
I1222 15:12:14.549431 197768 secret.go:206] Received secret default/default-token-2ccdr containing (3) pieces of data, 1948 total bytes
I1222 15:12:19.939984 197768 setters.go:803] update image disk to 972
I1222 15:12:19.954819 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:12:21.359371 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
W1222 15:12:21.437428 197768 conversion.go:110] Could not get instant cpu stats: cumulative stats decrease
I1222 15:12:21.440392 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65338980Ki, capacity: 104841732Ki, time: 2018-12-22 15:12:21.360524069 +0800 CST m=+6910.379936037
I1222 15:12:21.440416 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:12:21.360524069 +0800 CST m=+6910.379936037
I1222 15:12:21.440424 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:12:21.360524069 +0800 CST m=+6910.379936037
I1222 15:12:21.440430 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:12:21.439787982 +0800 CST m=+6910.459199938
I1222 15:12:21.440438 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127781348Ki, capacity: 131453720Ki, time: 2018-12-22 15:12:21.360524069 +0800 CST m=+6910.379936037
I1222 15:12:21.440444 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120959560Ki, capacity: 120967960Ki, time: 2018-12-22 15:12:21.44031551 +0800 CST m=+6910.459727491
I1222 15:12:21.440471 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:12:30.031825 197768 setters.go:803] update image disk to 972
I1222 15:12:30.045953 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:12:31.440631 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
I1222 15:12:31.509827 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:12:31.441748963 +0800 CST m=+6920.461160991
I1222 15:12:31.509848 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:12:31.441748963 +0800 CST m=+6920.461160991
I1222 15:12:31.509856 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:12:31.509236486 +0800 CST m=+6920.528648434
I1222 15:12:31.509861 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127781376Ki, capacity: 131453720Ki, time: 2018-12-22 15:12:31.441748963 +0800 CST m=+6920.461160991
I1222 15:12:31.509866 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120959560Ki, capacity: 120967960Ki, time: 2018-12-22 15:12:31.509770428 +0800 CST m=+6920.529182420
I1222 15:12:31.509871 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65338980Ki, capacity: 104841732Ki, time: 2018-12-22 15:12:31.441748963 +0800 CST m=+6920.461160991
I1222 15:12:31.509894 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:12:40.164200 197768 setters.go:803] update image disk to 972
I1222 15:12:40.175425 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:12:41.510077 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
I1222 15:12:41.592297 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127781200Ki, capacity: 131453720Ki, time: 2018-12-22 15:12:41.511325515 +0800 CST m=+6930.530737539
I1222 15:12:41.592330 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120959664Ki, capacity: 120967960Ki, time: 2018-12-22 15:12:41.592217114 +0800 CST m=+6930.611629061
I1222 15:12:41.592337 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65339040Ki, capacity: 104841732Ki, time: 2018-12-22 15:12:41.511325515 +0800 CST m=+6930.530737539
I1222 15:12:41.592343 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:12:41.511325515 +0800 CST m=+6930.530737539
I1222 15:12:41.592349 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:12:41.511325515 +0800 CST m=+6930.530737539
I1222 15:12:41.592355 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:12:41.591449367 +0800 CST m=+6930.610861319
I1222 15:12:41.592377 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:12:50.293269 197768 setters.go:803] update image disk to 972
I1222 15:12:50.304977 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:12:51.592583 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
I1222 15:12:51.662632 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127778408Ki, capacity: 131453720Ki, time: 2018-12-22 15:12:51.59387317 +0800 CST m=+6940.613285106
I1222 15:12:51.662667 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120960064Ki, capacity: 120967960Ki, time: 2018-12-22 15:12:51.662565942 +0800 CST m=+6940.681978056
I1222 15:12:51.662674 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65339040Ki, capacity: 104841732Ki, time: 2018-12-22 15:12:51.59387317 +0800 CST m=+6940.613285106
I1222 15:12:51.662696 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:12:51.59387317 +0800 CST m=+6940.613285106
I1222 15:12:51.662702 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:12:51.59387317 +0800 CST m=+6940.613285106
I1222 15:12:51.662707 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:12:51.662038613 +0800 CST m=+6940.681450560
I1222 15:12:51.662730 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:13:00.375986 197768 setters.go:803] update image disk to 972
I1222 15:13:00.385920 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:13:01.662899 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
I1222 15:13:01.732555 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65339040Ki, capacity: 104841732Ki, time: 2018-12-22 15:13:01.664006156 +0800 CST m=+6950.683418147
I1222 15:13:01.732579 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:13:01.664006156 +0800 CST m=+6950.683418147
I1222 15:13:01.732587 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:13:01.664006156 +0800 CST m=+6950.683418147
I1222 15:13:01.732592 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:13:01.731921755 +0800 CST m=+6950.751333690
I1222 15:13:01.732598 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127778172Ki, capacity: 131453720Ki, time: 2018-12-22 15:13:01.664006156 +0800 CST m=+6950.683418147
I1222 15:13:01.732603 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120960280Ki, capacity: 120967960Ki, time: 2018-12-22 15:13:01.732451214 +0800 CST m=+6950.751863193
I1222 15:13:01.732627 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:13:10.484452 197768 setters.go:803] update image disk to 972
I1222 15:13:10.496762 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:13:11.732724 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
I1222 15:13:11.810767 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127778812Ki, capacity: 131453720Ki, time: 2018-12-22 15:13:11.733727057 +0800 CST m=+6960.753139014
I1222 15:13:11.810800 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120960276Ki, capacity: 120967960Ki, time: 2018-12-22 15:13:11.810677586 +0800 CST m=+6960.830089588
I1222 15:13:11.810806 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65339040Ki, capacity: 104841732Ki, time: 2018-12-22 15:13:11.733727057 +0800 CST m=+6960.753139014
I1222 15:13:11.810812 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:13:11.733727057 +0800 CST m=+6960.753139014
I1222 15:13:11.810818 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:13:11.733727057 +0800 CST m=+6960.753139014
I1222 15:13:11.810824 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:13:11.810162455 +0800 CST m=+6960.829574420
I1222 15:13:11.810851 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:13:16.373261 197768 kubelet_pods.go:1327] Generating status for "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)"
I1222 15:13:16.373364 197768 status_manager.go:364] Ignoring same status for pod "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2018-12-22 13:11:32 +0800 CST Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2018-12-22 13:11:35 +0800 CST Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:0001-01-01 00:00:00 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2018-12-22 13:11:32 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.21.32.4 PodIP:10.24.75.204 StartTime:2018-12-22 13:11:32 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:redis State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2018-12-22 14:36:42 +0800 CST,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.test.com/tools/redis:latest ImageID:docker-pullable://registry.test.com/tools/redis@sha256:1bce822ce3c234b07c4036ead05c464e7972a565b63779c37d0efd25b69d188a ContainerID:docker://d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316}] QOSClass:Burstable}
I1222 15:13:16.373534 197768 volume_manager.go:350] Waiting for volumes to attach and mount for pod "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)"
I1222 15:13:16.373547 197768 volume_manager.go:383] All volumes are attached and mounted for pod "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)"
I1222 15:13:16.373555 197768 kuberuntime_manager.go:437] Container Status &{ID:{Type:docker ID:d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316} Name:redis State:running CreatedAt:2018-12-22 13:11:33.955691733 +0800 CST StartedAt:2018-12-22 14:36:42.845466943 +0800 CST FinishedAt:0001-01-01 00:00:00 +0000 UTC ExitCode:0 Image:registry.test.com/tools/redis:latest ImageID:docker-pullable://registry.test.com/tools/redis@sha256:1bce822ce3c234b07c4036ead05c464e7972a565b63779c37d0efd25b69d188a Hash:1795437233 RestartCount:0 Reason: Message:} and container &Container{Name:redis,Image:registry.test.com/tools/redis:latest,Command:[],Args:[],WorkingDir:,Ports:[],Env:[{test-cluster_K8S_WORKLOAD_NAME aa nil}],Resources:ResourceRequirements{Limits:ResourceList{cpu: {{4 0} {} 4 DecimalSI},memory: {{1073741824 0} {} 1Gi BinarySI},},Requests:ResourceList{cpu: {{400 -3} {} 400m DecimalSI},memory: {{1073741824 0} {} 1Gi BinarySI},pod.alpha.kubernetes.io/opaque-int-resource-test-cluster_DISK: {{10 0} {} 10 DecimalSI},pod.alpha.kubernetes.io/opaque-int-resource-IMAGE_DISK: {{10 0} {} 10 DecimalSI},},},VolumeMounts:[{test-clusterlvm false /docker } {default-token-2ccdr true /var/run/secrets/kubernetes.io/serviceaccount }],LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:Always,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[SYS_ADMIN SYS_PTRACE],Drop:[],},Privileged:*false,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:*Default,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[],TerminationMessagePolicy:File,VolumeDevices:[],}
I1222 15:13:16.373759 197768 kuberuntime_manager.go:564] Container "redis" ({"docker" "d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316"}) of pod redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda): Container spec hash changed (1795437233 vs 2346900602).. Container will be killed and recreated.
I1222 15:13:16.373767 197768 kuberuntime_manager.go:585] computePodActions got {KillPod:false CreateSandbox:false SandboxID:291ec4e9320f0460e703e405ea459e47525df9762d7fb74b2c53b5f42d4a3c6e Attempt:0 NextInitContainerToStart:nil ContainersToStart:[0] ContainersToKill:map[{Type:docker ID:d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316}:{container:0xc420b57400 name:redis message:Container spec hash changed (1795437233 vs 2346900602).. Container will be killed and recreated.}]} for pod "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)"
I1222 15:13:16.373787 197768 kuberuntime_manager.go:619] Killing unwanted container "redis"(id={"docker" "d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316"}) for pod "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)"
I1222 15:13:16.373796 197768 kuberuntime_container.go:559] Killing container "docker://d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316" with 30 second grace period
W1222 15:13:16.447863 197768 plugin-defaults.go:32] flexVolume driver test-cluster/lvm: using default GetVolumeName for volume test-clusterlvm
I1222 15:13:16.447970 197768 secret.go:182] Setting up volume default-token-2ccdr for pod 0bbb8d6b-05a8-11e9-8012-0022b2fb1eda at /var/lib/kubelet/pods/0bbb8d6b-05a8-11e9-8012-0022b2fb1eda/volumes/kubernetes.io~secret/default-token-2ccdr
I1222 15:13:16.447988 197768 secret.go:206] Received secret default/default-token-2ccdr containing (3) pieces of data, 1948 total bytes
I1222 15:13:16.505560 197768 kuberuntime_container.go:583] Container "docker://d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316" exited normally
I1222 15:13:16.505668 197768 server.go:460] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"redis-k6p6z", UID:"0bbb8d6b-05a8-11e9-8012-0022b2fb1eda", APIVersion:"v1", ResourceVersion:"832313", FieldPath:"spec.containers{redis}"}): type: 'Normal' reason: 'Killing' Killing container with id docker://redis:Container spec hash changed (1795437233 vs 2346900602).. Container will be killed and recreated.
I1222 15:13:16.505736 197768 server.go:460] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"gh-test-cluster-k8s-ep02.gh.test.com", UID:"gh-test-cluster-k8s-ep02.gh.test.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
I1222 15:13:16.505753 197768 server.go:460] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"redis-k6p6z", UID:"0bbb8d6b-05a8-11e9-8012-0022b2fb1eda", APIVersion:"v1", ResourceVersion:"832313", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' pod: "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
Environment:
kubernetes master v.12.3
kubernetes kubelet v.12.3
docker verison
Client:
Version: 1.13.1
API version: 1.26
kubectl version
):uname -a
):The text was updated successfully, but these errors were encountered: