Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade kubelet from v1.11.3 to v.12.3 without draining pods causes containers to restart continuously #72296

Closed
wgliang opened this issue Dec 22, 2018 · 16 comments
Labels
area/kubelet area/test kind/documentation Categorizes issue or PR as related to documentation. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@wgliang
Copy link
Contributor

wgliang commented Dec 22, 2018

What happened:
Upgrade kubernetes from v1.11.3 to v.12.3 causes containers to restart continuously.

What you expected to happen:
Upgrade kubernetes from v1.11.3 to v.12.3 successfully.

How to reproduce it (as minimally and precisely as possible):
Upgrade kubernetes from v1.11.3 to v.12.3.

Anything else we need to know?:

I1222 15:12:14.538192 197768 kuberuntime_container.go:887] Update pause id to 291ec4e9320f0460e703e405ea459e47525df9762d7fb74b2c53b5f42d4a3c6e of container d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316 for pod redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)
I1222 15:12:14.538819 197768 kuberuntime_container.go:893] Start exited container d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316 for pod redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)
I1222 15:12:14.549372 197768 secret.go:182] Setting up volume default-token-2ccdr for pod 0bbb8d6b-05a8-11e9-8012-0022b2fb1eda at /var/lib/kubelet/pods/0bbb8d6b-05a8-11e9-8012-0022b2fb1eda/volumes/kubernetes.io secret/default-token-2ccdr
I1222 15:12:14.549431 197768 secret.go:206] Received secret default/default-token-2ccdr containing (3) pieces of data, 1948 total bytes
I1222 15:12:19.939984 197768 setters.go:803] update image disk to 972
I1222 15:12:19.954819 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:12:21.359371 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
W1222 15:12:21.437428 197768 conversion.go:110] Could not get instant cpu stats: cumulative stats decrease
I1222 15:12:21.440392 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65338980Ki, capacity: 104841732Ki, time: 2018-12-22 15:12:21.360524069 +0800 CST m=+6910.379936037
I1222 15:12:21.440416 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:12:21.360524069 +0800 CST m=+6910.379936037
I1222 15:12:21.440424 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:12:21.360524069 +0800 CST m=+6910.379936037
I1222 15:12:21.440430 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:12:21.439787982 +0800 CST m=+6910.459199938
I1222 15:12:21.440438 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127781348Ki, capacity: 131453720Ki, time: 2018-12-22 15:12:21.360524069 +0800 CST m=+6910.379936037
I1222 15:12:21.440444 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120959560Ki, capacity: 120967960Ki, time: 2018-12-22 15:12:21.44031551 +0800 CST m=+6910.459727491
I1222 15:12:21.440471 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:12:30.031825 197768 setters.go:803] update image disk to 972
I1222 15:12:30.045953 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:12:31.440631 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
I1222 15:12:31.509827 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:12:31.441748963 +0800 CST m=+6920.461160991
I1222 15:12:31.509848 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:12:31.441748963 +0800 CST m=+6920.461160991
I1222 15:12:31.509856 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:12:31.509236486 +0800 CST m=+6920.528648434
I1222 15:12:31.509861 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127781376Ki, capacity: 131453720Ki, time: 2018-12-22 15:12:31.441748963 +0800 CST m=+6920.461160991
I1222 15:12:31.509866 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120959560Ki, capacity: 120967960Ki, time: 2018-12-22 15:12:31.509770428 +0800 CST m=+6920.529182420
I1222 15:12:31.509871 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65338980Ki, capacity: 104841732Ki, time: 2018-12-22 15:12:31.441748963 +0800 CST m=+6920.461160991
I1222 15:12:31.509894 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:12:40.164200 197768 setters.go:803] update image disk to 972
I1222 15:12:40.175425 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:12:41.510077 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
I1222 15:12:41.592297 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127781200Ki, capacity: 131453720Ki, time: 2018-12-22 15:12:41.511325515 +0800 CST m=+6930.530737539
I1222 15:12:41.592330 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120959664Ki, capacity: 120967960Ki, time: 2018-12-22 15:12:41.592217114 +0800 CST m=+6930.611629061
I1222 15:12:41.592337 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65339040Ki, capacity: 104841732Ki, time: 2018-12-22 15:12:41.511325515 +0800 CST m=+6930.530737539
I1222 15:12:41.592343 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:12:41.511325515 +0800 CST m=+6930.530737539
I1222 15:12:41.592349 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:12:41.511325515 +0800 CST m=+6930.530737539
I1222 15:12:41.592355 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:12:41.591449367 +0800 CST m=+6930.610861319
I1222 15:12:41.592377 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:12:50.293269 197768 setters.go:803] update image disk to 972
I1222 15:12:50.304977 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:12:51.592583 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
I1222 15:12:51.662632 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127778408Ki, capacity: 131453720Ki, time: 2018-12-22 15:12:51.59387317 +0800 CST m=+6940.613285106
I1222 15:12:51.662667 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120960064Ki, capacity: 120967960Ki, time: 2018-12-22 15:12:51.662565942 +0800 CST m=+6940.681978056
I1222 15:12:51.662674 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65339040Ki, capacity: 104841732Ki, time: 2018-12-22 15:12:51.59387317 +0800 CST m=+6940.613285106
I1222 15:12:51.662696 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:12:51.59387317 +0800 CST m=+6940.613285106
I1222 15:12:51.662702 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:12:51.59387317 +0800 CST m=+6940.613285106
I1222 15:12:51.662707 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:12:51.662038613 +0800 CST m=+6940.681450560
I1222 15:12:51.662730 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:13:00.375986 197768 setters.go:803] update image disk to 972
I1222 15:13:00.385920 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:13:01.662899 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
I1222 15:13:01.732555 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65339040Ki, capacity: 104841732Ki, time: 2018-12-22 15:13:01.664006156 +0800 CST m=+6950.683418147
I1222 15:13:01.732579 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:13:01.664006156 +0800 CST m=+6950.683418147
I1222 15:13:01.732587 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:13:01.664006156 +0800 CST m=+6950.683418147
I1222 15:13:01.732592 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:13:01.731921755 +0800 CST m=+6950.751333690
I1222 15:13:01.732598 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127778172Ki, capacity: 131453720Ki, time: 2018-12-22 15:13:01.664006156 +0800 CST m=+6950.683418147
I1222 15:13:01.732603 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120960280Ki, capacity: 120967960Ki, time: 2018-12-22 15:13:01.732451214 +0800 CST m=+6950.751863193
I1222 15:13:01.732627 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:13:10.484452 197768 setters.go:803] update image disk to 972
I1222 15:13:10.496762 197768 setters.go:810] update test-cluster disk to 1024
I1222 15:13:11.732724 197768 eviction_manager.go:226] eviction manager: synchronize housekeeping
I1222 15:13:11.810767 197768 helpers.go:836] eviction manager: observations: signal=memory.available, available: 127778812Ki, capacity: 131453720Ki, time: 2018-12-22 15:13:11.733727057 +0800 CST m=+6960.753139014
I1222 15:13:11.810800 197768 helpers.go:836] eviction manager: observations: signal=allocatableMemory.available, available: 120960276Ki, capacity: 120967960Ki, time: 2018-12-22 15:13:11.810677586 +0800 CST m=+6960.830089588
I1222 15:13:11.810806 197768 helpers.go:836] eviction manager: observations: signal=nodefs.available, available: 65339040Ki, capacity: 104841732Ki, time: 2018-12-22 15:13:11.733727057 +0800 CST m=+6960.753139014
I1222 15:13:11.810812 197768 helpers.go:836] eviction manager: observations: signal=nodefs.inodesFree, available: 103631774, capacity: 104854528, time: 2018-12-22 15:13:11.733727057 +0800 CST m=+6960.753139014
I1222 15:13:11.810818 197768 helpers.go:836] eviction manager: observations: signal=imagefs.available, available: 938628608Ki, capacity: 996140Mi, time: 2018-12-22 15:13:11.733727057 +0800 CST m=+6960.753139014
I1222 15:13:11.810824 197768 helpers.go:836] eviction manager: observations: signal=pid.available, available: 326140, capacity: 320Ki, time: 2018-12-22 15:13:11.810162455 +0800 CST m=+6960.829574420
I1222 15:13:11.810851 197768 eviction_manager.go:317] eviction manager: no resources are starved
I1222 15:13:16.373261 197768 kubelet_pods.go:1327] Generating status for "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)"
I1222 15:13:16.373364 197768 status_manager.go:364] Ignoring same status for pod "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)", status: {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2018-12-22 13:11:32 +0800 CST Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2018-12-22 13:11:35 +0800 CST Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:0001-01-01 00:00:00 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2018-12-22 13:11:32 +0800 CST Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.21.32.4 PodIP:10.24.75.204 StartTime:2018-12-22 13:11:32 +0800 CST InitContainerStatuses:[] ContainerStatuses:[{Name:redis State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2018-12-22 14:36:42 +0800 CST,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.test.com/tools/redis:latest ImageID:docker-pullable://registry.test.com/tools/redis@sha256:1bce822ce3c234b07c4036ead05c464e7972a565b63779c37d0efd25b69d188a ContainerID:docker://d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316}] QOSClass:Burstable}
I1222 15:13:16.373534 197768 volume_manager.go:350] Waiting for volumes to attach and mount for pod "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)"
I1222 15:13:16.373547 197768 volume_manager.go:383] All volumes are attached and mounted for pod "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)"
I1222 15:13:16.373555 197768 kuberuntime_manager.go:437] Container Status &{ID:{Type:docker ID:d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316} Name:redis State:running CreatedAt:2018-12-22 13:11:33.955691733 +0800 CST StartedAt:2018-12-22 14:36:42.845466943 +0800 CST FinishedAt:0001-01-01 00:00:00 +0000 UTC ExitCode:0 Image:registry.test.com/tools/redis:latest ImageID:docker-pullable://registry.test.com/tools/redis@sha256:1bce822ce3c234b07c4036ead05c464e7972a565b63779c37d0efd25b69d188a Hash:1795437233 RestartCount:0 Reason: Message:} and container &Container{Name:redis,Image:registry.test.com/tools/redis:latest,Command:[],Args:[],WorkingDir:,Ports:[],Env:[{test-cluster_K8S_WORKLOAD_NAME aa nil}],Resources:ResourceRequirements{Limits:ResourceList{cpu: {{4 0} {} 4 DecimalSI},memory: {{1073741824 0} {} 1Gi BinarySI},},Requests:ResourceList{cpu: {{400 -3} {} 400m DecimalSI},memory: {{1073741824 0} {} 1Gi BinarySI},pod.alpha.kubernetes.io/opaque-int-resource-test-cluster_DISK: {{10 0} {} 10 DecimalSI},pod.alpha.kubernetes.io/opaque-int-resource-IMAGE_DISK: {{10 0} {} 10 DecimalSI},},},VolumeMounts:[{test-clusterlvm false /docker } {default-token-2ccdr true /var/run/secrets/kubernetes.io/serviceaccount }],LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:Always,SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[SYS_ADMIN SYS_PTRACE],Drop:[],},Privileged:*false,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:*Default,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[],TerminationMessagePolicy:File,VolumeDevices:[],}
I1222 15:13:16.373759 197768 kuberuntime_manager.go:564] Container "redis" ({"docker" "d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316"}) of pod redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda): Container spec hash changed (1795437233 vs 2346900602).. Container will be killed and recreated.
I1222 15:13:16.373767 197768 kuberuntime_manager.go:585] computePodActions got {KillPod:false CreateSandbox:false SandboxID:291ec4e9320f0460e703e405ea459e47525df9762d7fb74b2c53b5f42d4a3c6e Attempt:0 NextInitContainerToStart:nil ContainersToStart:[0] ContainersToKill:map[{Type:docker ID:d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316}:{container:0xc420b57400 name:redis message:Container spec hash changed (1795437233 vs 2346900602).. Container will be killed and recreated.}]} for pod "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)"
I1222 15:13:16.373787 197768 kuberuntime_manager.go:619] Killing unwanted container "redis"(id={"docker" "d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316"}) for pod "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)"
I1222 15:13:16.373796 197768 kuberuntime_container.go:559] Killing container "docker://d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316" with 30 second grace period
W1222 15:13:16.447863 197768 plugin-defaults.go:32] flexVolume driver test-cluster/lvm: using default GetVolumeName for volume test-clusterlvm
I1222 15:13:16.447970 197768 secret.go:182] Setting up volume default-token-2ccdr for pod 0bbb8d6b-05a8-11e9-8012-0022b2fb1eda at /var/lib/kubelet/pods/0bbb8d6b-05a8-11e9-8012-0022b2fb1eda/volumes/kubernetes.io~secret/default-token-2ccdr
I1222 15:13:16.447988 197768 secret.go:206] Received secret default/default-token-2ccdr containing (3) pieces of data, 1948 total bytes
I1222 15:13:16.505560 197768 kuberuntime_container.go:583] Container "docker://d6b39588cb0b4e38f46cf0c135e52e7862f9af2d47d30662ae9550ec7e0c0316" exited normally
I1222 15:13:16.505668 197768 server.go:460] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"redis-k6p6z", UID:"0bbb8d6b-05a8-11e9-8012-0022b2fb1eda", APIVersion:"v1", ResourceVersion:"832313", FieldPath:"spec.containers{redis}"}): type: 'Normal' reason: 'Killing' Killing container with id docker://redis:Container spec hash changed (1795437233 vs 2346900602).. Container will be killed and recreated.
I1222 15:13:16.505736 197768 server.go:460] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"gh-test-cluster-k8s-ep02.gh.test.com", UID:"gh-test-cluster-k8s-ep02.gh.test.com", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.
I1222 15:13:16.505753 197768 server.go:460] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"redis-k6p6z", UID:"0bbb8d6b-05a8-11e9-8012-0022b2fb1eda", APIVersion:"v1", ResourceVersion:"832313", FieldPath:""}): type: 'Warning' reason: 'MissingClusterDNS' pod: "redis-k6p6z_default(0bbb8d6b-05a8-11e9-8012-0022b2fb1eda)". kubelet does not have ClusterDNS IP configured and cannot create Pod using "ClusterFirst" policy. Falling back to "Default" policy.

Environment:
kubernetes master v.12.3
kubernetes kubelet v.12.3
docker verison
Client:
Version: 1.13.1
API version: 1.26

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 22, 2018
@wgliang
Copy link
Contributor Author

wgliang commented Dec 22, 2018

This is a serious problem. Once you upgrade the kubelet, the container in the entire cluster will be restarted. The API changes will cause a restart, but the continuous restart is not expected.

In my opinion, this may be related to the compatibility of docker, that is like comment #64283 (comment)

I think kubernetes-sigs/cri-tools#373 is needed.

@wgliang
Copy link
Contributor Author

wgliang commented Dec 22, 2018

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Dec 22, 2018
@wgliang
Copy link
Contributor Author

wgliang commented Dec 22, 2018

/sig node
/area kubelet
/area test

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. area/kubelet area/test and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 22, 2018
@wgliang wgliang closed this as completed Dec 22, 2018
@vikaschoudhary16
Copy link
Contributor

what was the problem? @wgliang

@wgliang
Copy link
Contributor Author

wgliang commented Dec 22, 2018

@vikaschoudhary16
The problem has been solved
What I want to say is that we did not have detailed upgrade details in the official version upgrade instructions, which caused very large failures in my cluster, which also brought a lot of cost and lost a lot of users and traffic.

The reason for the problem is that, normally, Kubernetes' container restart strategy is RebuildAlways, that is, when the Pod fails (either the container itself or the physical machine restart, etc.), the kubelet will recreate the new container for the Pod. However, in fact, many users' applications will write some data or configuration in the root directory, so users will expect to use the previous container. So we added a ReuseAlways strategy for Pod and became the default strategy for restart. The original Always strategy, the strategy of the rebuild container, is one of the optional strategies. When using the ReuseAlways policy, the kubelet will first check if there is a corresponding container, and if so, it will start the container directly without re-creating a new container.

The problem is here. Normally, using the default Rebuild strategy will rebuild the container object, and it will be consistent when the Hash is checked after restarting, but since we Reuse the original container, the newly constructed container object is always the same as the original. The object's hash is inconsistent, and Kubelet thinks it needs to restart the container and kill it. In the end, the container we saw was constantly restarted.

@wgliang wgliang reopened this Dec 22, 2018
@wgliang
Copy link
Contributor Author

wgliang commented Dec 22, 2018

@kubernetes/sig-release

@liggitt
Copy link
Member

liggitt commented Dec 22, 2018

draining pods from a node prior to upgrading minor versions is the tested upgrade procedure. clarifying that in documentation is on the list of follow ups to kubernetes/website#11060

/sig node
/kind documentation

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Dec 22, 2018
@liggitt liggitt added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Dec 22, 2018
@liggitt liggitt changed the title Upgrade kubernetes from v1.11.3 to v.12.3 causes containers to restart continuously Upgrade kubelet from v1.11.3 to v.12.3 without draining pods causes containers to restart continuously Dec 22, 2018
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 22, 2018
@wgliang
Copy link
Contributor Author

wgliang commented Dec 22, 2018

@liggitt Thanks, documentation and instruction work is very necessary.

@liggitt liggitt removed the kind/bug Categorizes issue or PR as related to a bug. label Dec 23, 2018
@wgliang
Copy link
Contributor Author

wgliang commented Dec 23, 2018

Also, I don't know if we can do this without restarting the container when upgrading the kubelet even if the container object's API changed, because the service running in the real environment is not allowed to quit unexpectedly.

@cofyc
Copy link
Member

cofyc commented Dec 24, 2018

Also, I don't know if we can do this without restarting the container when upgrading the kubelet even if the container object's API changed, because the service running in the real environment is not allowed to quit unexpectedly.

Here is a issue tracking this: #63814.

@haoqing0110
Copy link

@wgliang we meet the same issue when upgrading from 1.11.3 to 1.12.3, can you share how you solve it?

@cofyc
Copy link
Member

cofyc commented Jan 16, 2019

@haoqing0110 If you want to hack and workaround this issue temporarily, you can check this PR: kpaas-io#53. We've done this in upgrading 1.7 to 1.9 in our private clusters before. The idea applies to upgrading between other versions. The key point is to keep container hash same after upgrade (described in #63814).
This needs to apply patch to kubernetes, and is not officially supported. I suggest to drain pods from node prior to upgrading minor versions if possible.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 17, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 17, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test kind/documentation Categorizes issue or PR as related to documentation. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants