Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted. #11843

Homura222 · 2024-05-03T03:06:40Z

What happened:
Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted.

The virt-handler device_controller is watching the kubelet.sock file, and when kubelet restarts, dpi.initialized is set to false. If virt-handler executes a heartbeat at this time, the kubevirt.io/schedulable label of the node will be set to false. After kubelet restarts, kubelet will kill pods that do not match the node label.（virt-launcher pod nodeSelector is kubevirt.io/schedulable: "true"）The virt-launcher pod will be killed by kubelet, causing the vm to restart.

related:
kubernetes/kubernetes#123980
kubernetes/kubernetes#124586
#11662

How to reproduce it (as minimally and precisely as possible):
Restart kubelet every 1s until the kubevirt.io/schedulable label on the node becomes false. Or delete virt-handler pod, set the kubevirt.io/schedulable label on the node to false, then restart kubelet.

Environment:

KubeVirt version (use virtctl version): 0.42.1
Kubernetes version (use kubectl version): 1.22
VM or VMI specifications: N/A
Cloud provider or hardware configuration: N/A
OS (e.g. from /etc/os-release): N/A
Kernel (e.g. uname -a): N/A
Install tools: N/A
Others: N/A

I believe the solution to this problem:
Fix the issue of kubelet kill pods that do not match the node label when restarting(kubernetes/kubernetes#124367), or Improve the virt-handler heartbeats.

The text was updated successfully, but these errors were encountered:

victortoso · 2024-05-03T14:36:59Z

I think it makes sense to virt-handler to be aware that kubelet has being restarted and avoid disruptions if possible.
Not sure for instance, when kubelet's restart, what's the consequence for device-plugins and kubelet's device-manager, do you know?
Thanks for the report!

victortoso · 2024-05-03T14:39:29Z

Also, @Homura222 do you see a case not covered by Fabian's #11662 (comment)? That is, kubelet restart in which we should keep VM running.

Homura222 · 2024-05-03T15:20:17Z

I think it makes sense to virt-handler to be aware that kubelet has being restarted and avoid disruptions if possible. Not sure for instance, when kubelet's restart, what's the consequence for device-plugins and kubelet's device-manager, do you know? Thanks for the report!

I think you can see pkg/virt-handler/device-manager/generic_device.go Start function and healthCheck function. I think dpi.initialized is set to false, when kubelet restarts.

First, Kubelet will rebuild the /var/lib/kubelet/device-plugins dir when kubelet is restarted. Next, the healthCheck function watch the socket file(socketPath is /var/lib/kubelet/device-plugins/kubevirt-${deviceName}.sock) changes and return nil. Then, the Start function executes defer dpi.Stop(). Finally, dpi.initialized is set to false.

// pkg/virt-handler/device-manager/generic_device.go
func (dpi *GenericDevicePlugin) healthCheck() error {
	...
	dirName = filepath.Dir(dpi.socketPath)
	err = watcher.Add(dirName)

	if err != nil {
		return fmt.Errorf("failed to add the device-plugin kubelet path to the watcher: %v", err)
	}
        // socketPath is /var/lib/kubelet/device-plugins/kubevirt-${deviceName}.sock
	_, err = os.Stat(dpi.socketPath)
	if err != nil {
		return fmt.Errorf("failed to stat the device-plugin socket: %v", err)
	}

	for {
		select {
		case <-dpi.stop:
			return nil
		case err := <-watcher.Errors:
			logger.Reason(err).Errorf("error watching devices and device plugin directory")
		case event := <-watcher.Events:
			logger.V(4).Infof("health Event: %v", event)
			if event.Name == devicePath {
				// Health in this case is if the device path actually exists
				if event.Op == fsnotify.Create {
					logger.Infof("monitored device %s appeared", dpi.deviceName)
					dpi.health <- pluginapi.Healthy
				} else if (event.Op == fsnotify.Remove) || (event.Op == fsnotify.Rename) {
					logger.Infof("monitored device %s disappeared", dpi.deviceName)
					dpi.health <- pluginapi.Unhealthy
				}
			} else if event.Name == dpi.socketPath && event.Op == fsnotify.Remove {
				logger.Infof("device socket file for device %s was removed, kubelet probably restarted.", dpi.deviceName)
				return nil
			}
		}
	}
}
// Start starts the device plugin
func (dpi *GenericDevicePlugin) Start(stop chan struct{}) (err error) {
	...
	defer dpi.Stop()
	...
	go func() {
		errChan <- dpi.healthCheck()
	}()

	dpi.setInitialized(true)
	logger.Infof("%s device plugin started", dpi.deviceName)
	err = <-errChan

	return err
}
// Stop stops the gRPC server
func (dpi *GenericDevicePlugin) Stop() error {
	defer func() {
		if !IsChanClosed(dpi.done) {
			close(dpi.done)
		}
	}()
	dpi.server.Stop()
	dpi.setInitialized(false)
	return dpi.cleanup()
}

If virt-handler executes a heartbeat at this time, the node kubevirt.io/schedulable label value will be set to false.

// pkg/virt-handler/device-manager/device_controller.go
func (c *DeviceController) Initialized() bool {
	c.startedPluginsMutex.Lock()
	defer c.startedPluginsMutex.Unlock()
	for _, dev := range c.startedPlugins {
		if !dev.devicePlugin.GetInitialized() {
			return false
		}
	}

	return true
}

// pkg/virt-handler/heartbeat/heartbeat.go
func (h *HeartBeat) do() {
	...
	kubevirtSchedulable := "true"
	if !h.deviceManagerController.Initialized() {
		kubevirtSchedulable = "false"
	}

	var data []byte
	// Label the node if cpu manager is running on it
	// This is a temporary workaround until k8s bug #66525 is resolved
	cpuManagerEnabled := false
	if h.clusterConfig.CPUManagerEnabled() {
		cpuManagerEnabled = h.isCPUManagerEnabled(h.cpuManagerPaths)
	}
	data = []byte(fmt.Sprintf(`{"metadata": { "labels": {"%s": "%s", "%s": "%t"}, "annotations": {"%s": %s}}}`,
		v1.NodeSchedulable, kubevirtSchedulable,
		v1.CPUManager, cpuManagerEnabled,
		v1.VirtHandlerHeartbeat, string(now),
	))
	_, err = h.clientset.Nodes().Patch(context.Background(), h.host, types.StrategicMergePatchType, data, metav1.PatchOptions{})
	if err != nil {
		log.DefaultLogger().Reason(err).Errorf("Can't patch node %s", h.host)
		return
	}
        ...
}

After kubelet restarts, kubelet will kill pods that do not match the node label.（virt-launcher pod nodeSelector is kubevirt.io/schedulable: "true"）At this time, the node kubevirt.io/schedulable label value is false.The virt-launcher pod will be killed by kubelet, causing the vm to restart.

Some issues about kubelet behavior:
kubernetes/kubernetes#123980
kubernetes/kubernetes#124586
/cc @victortoso

Homura222 · 2024-05-03T15:22:03Z

Also, @Homura222 do you see a case not covered by Fabian's #11662 (comment)? That is, kubelet restart in which we should keep VM running.

I also think kubelet restart in which we should keep VM running.

Homura222 · 2024-05-06T02:31:59Z

I think it makes sense to virt-handler to be aware that kubelet has being restarted and avoid disruptions if possible. Not sure for instance, when kubelet's restart, what's the consequence for device-plugins and kubelet's device-manager, do you know? Thanks for the report!

I think you can see pkg/virt-handler/device-manager/generic_device.go Start function and healthCheck function. I think dpi.initialized is set to false, when kubelet restarts.

First, Kubelet will rebuild the /var/lib/kubelet/device-plugins dir when kubelet is restarted. Next, the healthCheck function watch the socket file(socketPath is /var/lib/kubelet/device-plugins/kubevirt-${deviceName}.sock) changes and return nil. Then, the Start function executes defer dpi.Stop(). Finally, dpi.initialized is set to false.

// pkg/virt-handler/device-manager/generic_device.go
func (dpi *GenericDevicePlugin) healthCheck() error {
	...
	dirName = filepath.Dir(dpi.socketPath)
	err = watcher.Add(dirName)

	if err != nil {
		return fmt.Errorf("failed to add the device-plugin kubelet path to the watcher: %v", err)
	}
        // socketPath is /var/lib/kubelet/device-plugins/kubevirt-${deviceName}.sock
	_, err = os.Stat(dpi.socketPath)
	if err != nil {
		return fmt.Errorf("failed to stat the device-plugin socket: %v", err)
	}

	for {
		select {
		case <-dpi.stop:
			return nil
		case err := <-watcher.Errors:
			logger.Reason(err).Errorf("error watching devices and device plugin directory")
		case event := <-watcher.Events:
			logger.V(4).Infof("health Event: %v", event)
			if event.Name == devicePath {
				// Health in this case is if the device path actually exists
				if event.Op == fsnotify.Create {
					logger.Infof("monitored device %s appeared", dpi.deviceName)
					dpi.health <- pluginapi.Healthy
				} else if (event.Op == fsnotify.Remove) || (event.Op == fsnotify.Rename) {
					logger.Infof("monitored device %s disappeared", dpi.deviceName)
					dpi.health <- pluginapi.Unhealthy
				}
			} else if event.Name == dpi.socketPath && event.Op == fsnotify.Remove {
				logger.Infof("device socket file for device %s was removed, kubelet probably restarted.", dpi.deviceName)
				return nil
			}
		}
	}
}
// Start starts the device plugin
func (dpi *GenericDevicePlugin) Start(stop chan struct{}) (err error) {
	...
	defer dpi.Stop()
	...
	go func() {
		errChan <- dpi.healthCheck()
	}()

	dpi.setInitialized(true)
	logger.Infof("%s device plugin started", dpi.deviceName)
	err = <-errChan

	return err
}
// Stop stops the gRPC server
func (dpi *GenericDevicePlugin) Stop() error {
	defer func() {
		if !IsChanClosed(dpi.done) {
			close(dpi.done)
		}
	}()
	dpi.server.Stop()
	dpi.setInitialized(false)
	return dpi.cleanup()
}

If virt-handler executes a heartbeat at this time, the node kubevirt.io/schedulable label value will be set to false.

// pkg/virt-handler/device-manager/device_controller.go
func (c *DeviceController) Initialized() bool {
	c.startedPluginsMutex.Lock()
	defer c.startedPluginsMutex.Unlock()
	for _, dev := range c.startedPlugins {
		if !dev.devicePlugin.GetInitialized() {
			return false
		}
	}

	return true
}

// pkg/virt-handler/heartbeat/heartbeat.go
func (h *HeartBeat) do() {
	...
	kubevirtSchedulable := "true"
	if !h.deviceManagerController.Initialized() {
		kubevirtSchedulable = "false"
	}

	var data []byte
	// Label the node if cpu manager is running on it
	// This is a temporary workaround until k8s bug #66525 is resolved
	cpuManagerEnabled := false
	if h.clusterConfig.CPUManagerEnabled() {
		cpuManagerEnabled = h.isCPUManagerEnabled(h.cpuManagerPaths)
	}
	data = []byte(fmt.Sprintf(`{"metadata": { "labels": {"%s": "%s", "%s": "%t"}, "annotations": {"%s": %s}}}`,
		v1.NodeSchedulable, kubevirtSchedulable,
		v1.CPUManager, cpuManagerEnabled,
		v1.VirtHandlerHeartbeat, string(now),
	))
	_, err = h.clientset.Nodes().Patch(context.Background(), h.host, types.StrategicMergePatchType, data, metav1.PatchOptions{})
	if err != nil {
		log.DefaultLogger().Reason(err).Errorf("Can't patch node %s", h.host)
		return
	}
        ...
}

After kubelet restarts, kubelet will kill pods that do not match the node label.（virt-launcher pod nodeSelector is kubevirt.io/schedulable: "true"）At this time, the node kubevirt.io/schedulable label value is false.The virt-launcher pod will be killed by kubelet, causing the vm to restart.

Some issues about kubelet behavior: kubernetes/kubernetes#123980 kubernetes/kubernetes#124586 /cc @victortoso

/cc @victortoso @rmohr

rmohr · 2024-05-06T16:01:59Z

I think we should ensure that we are using node affinity with ignore during execution and closely follow kubernetes/kubernetes#124586. Once that is fixed and we use node affinity properly, we should be good. Thanks for raising this.

I am not very keen on changing the labeling behaviour as such. It solved quite some kubelet restart issues for us where pods got rescheduled to the node before device plugins were ready again, kubelets admission control rejected them all, they got rescheduled again (or not because of run strategy once).

rmohr · 2024-05-06T17:31:22Z

Ok, after checking in on kubernetes/kubernetes#124586, I think our path forward would be that the heartbeat label is requested through a pod affinity with ignore during execution that should give us right now already exactly what we want.

Homura222 · 2024-05-07T02:21:24Z

that do not match the node label.

Ok, after checking in on kubernetes/kubernetes#124586, I think our path forward would be that the heartbeat label is requested through a pod affinity with ignore during execution that should give us right now already exactly what we want.

@rmohr Do you mean to move kubevirt.io/schedulable: "true" from pod nodeSelector to pod affinity? But after kubelet restarts, kubelet will to kill the pod that affinity or nodeSelector don't match the node label. Even if it's an ignore during execution affinity. kubernetes/kubernetes#124586 (comment)

rmohr · 2024-05-07T03:06:30Z

@rmohr Do you mean to move kubevirt.io/schedulable: "true" from pod nodeSelector to pod affinity? But after kubelet restarts, kubelet will to kill the pod that affinity or nodeSelector don't match the node label. Even if it's an ignore during execution affinity. kubernetes/kubernetes#124586 (comment)

Well I mean I did not try it myself, but I asked for clarification and according to kubernetes/kubernetes#124586 (comment), in the reported case the pod was not yet running and therefore pushed out after the restart. If it would have been running, it would not have been stopped.

Homura222 · 2024-05-07T04:12:37Z

@rmohr Do you mean to move kubevirt.io/schedulable: "true" from pod nodeSelector to pod affinity? But after kubelet restarts, kubelet will to kill the pod that affinity or nodeSelector don't match the node label. Even if it's an ignore during execution affinity. kubernetes/kubernetes#124586 (comment)

Well I mean I did not try it myself, but I asked for clarification and according to kubernetes/kubernetes#124586 (comment), in the reported case the pod was not yet running and therefore pushed out after the restart. If it would have been running, it would not have been stopped.

In the reported case,kubelet also kill running pod when kubelet is restarted. kubernetes/kubernetes#123980

kubernetes/kubernetes#124367 can fix this case in kubelet. But it hasn't been approved yet.

rmohr · 2024-05-07T04:28:58Z

In the reported case,kubelet also kill running pod when kubelet is restarted. kubernetes/kubernetes#123980

This is definitely happening with nodeSelector. Not sure though about affinity, considering kubernetes/kubernetes#124586 (comment) where it was said that ignoreDuringExecution is working if the pods are already running (and not just scheduled at the point where the kubelet it restarted). If we don't have any new information, looks to me like someone would have to try it out.

let me also point out that you did not seem to add affinity related tests, so not sure if you checked it on your PR.

Homura222 · 2024-05-07T07:08:56Z

In the reported case,kubelet also kill running pod when kubelet is restarted. kubernetes/kubernetes#123980

This is definitely happening with nodeSelector. Not sure though about affinity, considering kubernetes/kubernetes#124586 (comment) where it was said that ignoreDuringExecution is working if the pods are already running (and not just scheduled at the point where the kubelet it restarted). If we don't have any new information, looks to me like someone would have to try it out.

let me also point out that you did not seem to add affinity related tests, so not sure if you checked it on your PR.

kubectl label node ${ip} a=b
create a pod with node IgnoredDuringExecution affinity. (requiredDuringSchedulingIgnoredDuringExecution nodeAffinity)

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: my-app
spec:
  containers:
  - command:
    - bash
    - -c
    - sleep 10000
    name: my-container
    image: toolkits-centos:1.0.0
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: a
            operator: In
            values:
            - b

wait pod running
kubectl label node ${ip} a-
systemctl restart kubelet
watch pod become error

pod event:

@rmohr you can try it

Homura222 · 2024-05-09T04:28:15Z

In the reported case,kubelet also kill running pod when kubelet is restarted. kubernetes/kubernetes#123980

This is definitely happening with nodeSelector. Not sure though about affinity, considering kubernetes/kubernetes#124586 (comment) where it was said that ignoreDuringExecution is working if the pods are already running (and not just scheduled at the point where the kubelet it restarted). If we don't have any new information, looks to me like someone would have to try it out.
let me also point out that you did not seem to add affinity related tests, so not sure if you checked it on your PR.

kubectl label node ${ip} a=b

create a pod with node IgnoredDuringExecution affinity.
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: my-app
spec:
  containers:
  - command:
    - bash
    - -c
    - sleep 10000
    name: my-container
    image: toolkits-centos:1.0.0
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: a
            operator: In
            values:
            - b
wait pod running

kubectl label node ${ip} a-

systemctl restart kubelet

watch pod become error

pod event:
@rmohr you can try it

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: my-app
spec:
  containers:
  - command:
    - bash
    - -c
    - sleep 10000
    name: my-container
    image: toolkits-centos:1.0.0.amd64
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: a
            operator: In
            values:
            - b

Use preferredDuringSchedulingIgnoredDuringExecution nodeAffinity,pod will not be killed by kubelet in report case.I think we can move kubevirt.io/schedulable: "true" from pod nodeSelector to pod preferredDuringSchedulingIgnoredDuringExecution nodeAffinity to fix it.

But preferredDuringSchedulingIgnoredDuringExecution If a matching node is not available, the scheduler still schedules the Pod. It may not be as expected. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
/cc @rmohr @victortoso

rmohr · 2024-05-09T20:36:06Z

I think going back to kubernetes/kubernetes#124586 (comment) (as you did) makes sense. Let's see what is said there. requiredDuringSchedulingIgnoredDuringExecution should imo not lead to evictions, but as you showed it seems to.

Homura222 added the kind/bug label May 3, 2024

rmohr mentioned this issue May 9, 2024

Restarted kubelet should not evict pods due to the node affinity kubernetes/kubernetes#124586

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted. #11843

Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted. #11843

Homura222 commented May 3, 2024 •

edited

victortoso commented May 3, 2024

victortoso commented May 3, 2024

Homura222 commented May 3, 2024 •

edited

Homura222 commented May 3, 2024

Homura222 commented May 6, 2024

rmohr commented May 6, 2024

rmohr commented May 6, 2024 •

edited

Homura222 commented May 7, 2024 •

edited

rmohr commented May 7, 2024

Homura222 commented May 7, 2024 •

edited

rmohr commented May 7, 2024 •

edited

Homura222 commented May 7, 2024 •

edited

Homura222 commented May 9, 2024 •

edited

rmohr commented May 9, 2024

Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted. #11843

Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted. #11843

Comments

Homura222 commented May 3, 2024 • edited

victortoso commented May 3, 2024

victortoso commented May 3, 2024

Homura222 commented May 3, 2024 • edited

Homura222 commented May 3, 2024

Homura222 commented May 6, 2024

rmohr commented May 6, 2024

rmohr commented May 6, 2024 • edited

Homura222 commented May 7, 2024 • edited

rmohr commented May 7, 2024

Homura222 commented May 7, 2024 • edited

rmohr commented May 7, 2024 • edited

Homura222 commented May 7, 2024 • edited

Homura222 commented May 9, 2024 • edited

rmohr commented May 9, 2024

Homura222 commented May 3, 2024 •

edited

Homura222 commented May 3, 2024 •

edited

rmohr commented May 6, 2024 •

edited

Homura222 commented May 7, 2024 •

edited

Homura222 commented May 7, 2024 •

edited

rmohr commented May 7, 2024 •

edited

Homura222 commented May 7, 2024 •

edited

Homura222 commented May 9, 2024 •

edited