Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted. #11843

Open
Homura222 opened this issue May 3, 2024 · 14 comments
Labels

Comments

@Homura222
Copy link

Homura222 commented May 3, 2024

What happened:
Virt-handler heartbeat causes the vm to be restarted when kubelet is restarted.

The virt-handler device_controller is watching the kubelet.sock file, and when kubelet restarts, dpi.initialized is set to false. If virt-handler executes a heartbeat at this time, the kubevirt.io/schedulable label of the node will be set to false. After kubelet restarts, kubelet will kill pods that do not match the node label.(virt-launcher pod nodeSelector is kubevirt.io/schedulable: "true")The virt-launcher pod will be killed by kubelet, causing the vm to restart.

related:
kubernetes/kubernetes#123980
kubernetes/kubernetes#124586
#11662

How to reproduce it (as minimally and precisely as possible):
Restart kubelet every 1s until the kubevirt.io/schedulable label on the node becomes false. Or delete virt-handler pod, set the kubevirt.io/schedulable label on the node to false, then restart kubelet.

Environment:

  • KubeVirt version (use virtctl version): 0.42.1
  • Kubernetes version (use kubectl version): 1.22
  • VM or VMI specifications: N/A
  • Cloud provider or hardware configuration: N/A
  • OS (e.g. from /etc/os-release): N/A
  • Kernel (e.g. uname -a): N/A
  • Install tools: N/A
  • Others: N/A

I believe the solution to this problem:
Fix the issue of kubelet kill pods that do not match the node label when restarting(kubernetes/kubernetes#124367), or Improve the virt-handler heartbeats.

@victortoso
Copy link
Member

I think it makes sense to virt-handler to be aware that kubelet has being restarted and avoid disruptions if possible.
Not sure for instance, when kubelet's restart, what's the consequence for device-plugins and kubelet's device-manager, do you know?
Thanks for the report!

@victortoso
Copy link
Member

Also, @Homura222 do you see a case not covered by Fabian's #11662 (comment)? That is, kubelet restart in which we should keep VM running.

@Homura222
Copy link
Author

Homura222 commented May 3, 2024

I think it makes sense to virt-handler to be aware that kubelet has being restarted and avoid disruptions if possible. Not sure for instance, when kubelet's restart, what's the consequence for device-plugins and kubelet's device-manager, do you know? Thanks for the report!

I think you can see pkg/virt-handler/device-manager/generic_device.go Start function and healthCheck function. I think dpi.initialized is set to false, when kubelet restarts.

First, Kubelet will rebuild the /var/lib/kubelet/device-plugins dir when kubelet is restarted. Next, the healthCheck function watch the socket file(socketPath is /var/lib/kubelet/device-plugins/kubevirt-${deviceName}.sock) changes and return nil. Then, the Start function executes defer dpi.Stop(). Finally, dpi.initialized is set to false.

// pkg/virt-handler/device-manager/generic_device.go
func (dpi *GenericDevicePlugin) healthCheck() error {
	...
	dirName = filepath.Dir(dpi.socketPath)
	err = watcher.Add(dirName)

	if err != nil {
		return fmt.Errorf("failed to add the device-plugin kubelet path to the watcher: %v", err)
	}
        // socketPath is /var/lib/kubelet/device-plugins/kubevirt-${deviceName}.sock
	_, err = os.Stat(dpi.socketPath)
	if err != nil {
		return fmt.Errorf("failed to stat the device-plugin socket: %v", err)
	}

	for {
		select {
		case <-dpi.stop:
			return nil
		case err := <-watcher.Errors:
			logger.Reason(err).Errorf("error watching devices and device plugin directory")
		case event := <-watcher.Events:
			logger.V(4).Infof("health Event: %v", event)
			if event.Name == devicePath {
				// Health in this case is if the device path actually exists
				if event.Op == fsnotify.Create {
					logger.Infof("monitored device %s appeared", dpi.deviceName)
					dpi.health <- pluginapi.Healthy
				} else if (event.Op == fsnotify.Remove) || (event.Op == fsnotify.Rename) {
					logger.Infof("monitored device %s disappeared", dpi.deviceName)
					dpi.health <- pluginapi.Unhealthy
				}
			} else if event.Name == dpi.socketPath && event.Op == fsnotify.Remove {
				logger.Infof("device socket file for device %s was removed, kubelet probably restarted.", dpi.deviceName)
				return nil
			}
		}
	}
}
// Start starts the device plugin
func (dpi *GenericDevicePlugin) Start(stop chan struct{}) (err error) {
	...
	defer dpi.Stop()
	...
	go func() {
		errChan <- dpi.healthCheck()
	}()

	dpi.setInitialized(true)
	logger.Infof("%s device plugin started", dpi.deviceName)
	err = <-errChan

	return err
}
// Stop stops the gRPC server
func (dpi *GenericDevicePlugin) Stop() error {
	defer func() {
		if !IsChanClosed(dpi.done) {
			close(dpi.done)
		}
	}()
	dpi.server.Stop()
	dpi.setInitialized(false)
	return dpi.cleanup()
}

If virt-handler executes a heartbeat at this time, the node kubevirt.io/schedulable label value will be set to false.

// pkg/virt-handler/device-manager/device_controller.go
func (c *DeviceController) Initialized() bool {
	c.startedPluginsMutex.Lock()
	defer c.startedPluginsMutex.Unlock()
	for _, dev := range c.startedPlugins {
		if !dev.devicePlugin.GetInitialized() {
			return false
		}
	}

	return true
}

// pkg/virt-handler/heartbeat/heartbeat.go
func (h *HeartBeat) do() {
	...
	kubevirtSchedulable := "true"
	if !h.deviceManagerController.Initialized() {
		kubevirtSchedulable = "false"
	}

	var data []byte
	// Label the node if cpu manager is running on it
	// This is a temporary workaround until k8s bug #66525 is resolved
	cpuManagerEnabled := false
	if h.clusterConfig.CPUManagerEnabled() {
		cpuManagerEnabled = h.isCPUManagerEnabled(h.cpuManagerPaths)
	}
	data = []byte(fmt.Sprintf(`{"metadata": { "labels": {"%s": "%s", "%s": "%t"}, "annotations": {"%s": %s}}}`,
		v1.NodeSchedulable, kubevirtSchedulable,
		v1.CPUManager, cpuManagerEnabled,
		v1.VirtHandlerHeartbeat, string(now),
	))
	_, err = h.clientset.Nodes().Patch(context.Background(), h.host, types.StrategicMergePatchType, data, metav1.PatchOptions{})
	if err != nil {
		log.DefaultLogger().Reason(err).Errorf("Can't patch node %s", h.host)
		return
	}
        ...
}

After kubelet restarts, kubelet will kill pods that do not match the node label.(virt-launcher pod nodeSelector is kubevirt.io/schedulable: "true")At this time, the node kubevirt.io/schedulable label value is false.The virt-launcher pod will be killed by kubelet, causing the vm to restart.

Some issues about kubelet behavior:
kubernetes/kubernetes#123980
kubernetes/kubernetes#124586
/cc @victortoso

@Homura222
Copy link
Author

Also, @Homura222 do you see a case not covered by Fabian's #11662 (comment)? That is, kubelet restart in which we should keep VM running.

I also think kubelet restart in which we should keep VM running.

@Homura222
Copy link
Author

I think it makes sense to virt-handler to be aware that kubelet has being restarted and avoid disruptions if possible. Not sure for instance, when kubelet's restart, what's the consequence for device-plugins and kubelet's device-manager, do you know? Thanks for the report!

I think you can see pkg/virt-handler/device-manager/generic_device.go Start function and healthCheck function. I think dpi.initialized is set to false, when kubelet restarts.

First, Kubelet will rebuild the /var/lib/kubelet/device-plugins dir when kubelet is restarted. Next, the healthCheck function watch the socket file(socketPath is /var/lib/kubelet/device-plugins/kubevirt-${deviceName}.sock) changes and return nil. Then, the Start function executes defer dpi.Stop(). Finally, dpi.initialized is set to false.

// pkg/virt-handler/device-manager/generic_device.go
func (dpi *GenericDevicePlugin) healthCheck() error {
	...
	dirName = filepath.Dir(dpi.socketPath)
	err = watcher.Add(dirName)

	if err != nil {
		return fmt.Errorf("failed to add the device-plugin kubelet path to the watcher: %v", err)
	}
        // socketPath is /var/lib/kubelet/device-plugins/kubevirt-${deviceName}.sock
	_, err = os.Stat(dpi.socketPath)
	if err != nil {
		return fmt.Errorf("failed to stat the device-plugin socket: %v", err)
	}

	for {
		select {
		case <-dpi.stop:
			return nil
		case err := <-watcher.Errors:
			logger.Reason(err).Errorf("error watching devices and device plugin directory")
		case event := <-watcher.Events:
			logger.V(4).Infof("health Event: %v", event)
			if event.Name == devicePath {
				// Health in this case is if the device path actually exists
				if event.Op == fsnotify.Create {
					logger.Infof("monitored device %s appeared", dpi.deviceName)
					dpi.health <- pluginapi.Healthy
				} else if (event.Op == fsnotify.Remove) || (event.Op == fsnotify.Rename) {
					logger.Infof("monitored device %s disappeared", dpi.deviceName)
					dpi.health <- pluginapi.Unhealthy
				}
			} else if event.Name == dpi.socketPath && event.Op == fsnotify.Remove {
				logger.Infof("device socket file for device %s was removed, kubelet probably restarted.", dpi.deviceName)
				return nil
			}
		}
	}
}
// Start starts the device plugin
func (dpi *GenericDevicePlugin) Start(stop chan struct{}) (err error) {
	...
	defer dpi.Stop()
	...
	go func() {
		errChan <- dpi.healthCheck()
	}()

	dpi.setInitialized(true)
	logger.Infof("%s device plugin started", dpi.deviceName)
	err = <-errChan

	return err
}
// Stop stops the gRPC server
func (dpi *GenericDevicePlugin) Stop() error {
	defer func() {
		if !IsChanClosed(dpi.done) {
			close(dpi.done)
		}
	}()
	dpi.server.Stop()
	dpi.setInitialized(false)
	return dpi.cleanup()
}

If virt-handler executes a heartbeat at this time, the node kubevirt.io/schedulable label value will be set to false.

// pkg/virt-handler/device-manager/device_controller.go
func (c *DeviceController) Initialized() bool {
	c.startedPluginsMutex.Lock()
	defer c.startedPluginsMutex.Unlock()
	for _, dev := range c.startedPlugins {
		if !dev.devicePlugin.GetInitialized() {
			return false
		}
	}

	return true
}

// pkg/virt-handler/heartbeat/heartbeat.go
func (h *HeartBeat) do() {
	...
	kubevirtSchedulable := "true"
	if !h.deviceManagerController.Initialized() {
		kubevirtSchedulable = "false"
	}

	var data []byte
	// Label the node if cpu manager is running on it
	// This is a temporary workaround until k8s bug #66525 is resolved
	cpuManagerEnabled := false
	if h.clusterConfig.CPUManagerEnabled() {
		cpuManagerEnabled = h.isCPUManagerEnabled(h.cpuManagerPaths)
	}
	data = []byte(fmt.Sprintf(`{"metadata": { "labels": {"%s": "%s", "%s": "%t"}, "annotations": {"%s": %s}}}`,
		v1.NodeSchedulable, kubevirtSchedulable,
		v1.CPUManager, cpuManagerEnabled,
		v1.VirtHandlerHeartbeat, string(now),
	))
	_, err = h.clientset.Nodes().Patch(context.Background(), h.host, types.StrategicMergePatchType, data, metav1.PatchOptions{})
	if err != nil {
		log.DefaultLogger().Reason(err).Errorf("Can't patch node %s", h.host)
		return
	}
        ...
}

After kubelet restarts, kubelet will kill pods that do not match the node label.(virt-launcher pod nodeSelector is kubevirt.io/schedulable: "true")At this time, the node kubevirt.io/schedulable label value is false.The virt-launcher pod will be killed by kubelet, causing the vm to restart.

Some issues about kubelet behavior: kubernetes/kubernetes#123980 kubernetes/kubernetes#124586 /cc @victortoso

/cc @victortoso @rmohr

@rmohr
Copy link
Member

rmohr commented May 6, 2024

I think we should ensure that we are using node affinity with ignore during execution and closely follow kubernetes/kubernetes#124586. Once that is fixed and we use node affinity properly, we should be good. Thanks for raising this.

I am not very keen on changing the labeling behaviour as such. It solved quite some kubelet restart issues for us where pods got rescheduled to the node before device plugins were ready again, kubelets admission control rejected them all, they got rescheduled again (or not because of run strategy once).

@rmohr
Copy link
Member

rmohr commented May 6, 2024

Ok, after checking in on kubernetes/kubernetes#124586, I think our path forward would be that the heartbeat label is requested through a pod affinity with ignore during execution that should give us right now already exactly what we want.

@Homura222
Copy link
Author

Homura222 commented May 7, 2024

that do not match the node label.

Ok, after checking in on kubernetes/kubernetes#124586, I think our path forward would be that the heartbeat label is requested through a pod affinity with ignore during execution that should give us right now already exactly what we want.

@rmohr Do you mean to move kubevirt.io/schedulable: "true" from pod nodeSelector to pod affinity? But after kubelet restarts, kubelet will to kill the pod that affinity or nodeSelector don't match the node label. Even if it's an ignore during execution affinity. kubernetes/kubernetes#124586 (comment)

@rmohr
Copy link
Member

rmohr commented May 7, 2024

@rmohr Do you mean to move kubevirt.io/schedulable: "true" from pod nodeSelector to pod affinity? But after kubelet restarts, kubelet will to kill the pod that affinity or nodeSelector don't match the node label. Even if it's an ignore during execution affinity. kubernetes/kubernetes#124586 (comment)

Well I mean I did not try it myself, but I asked for clarification and according to kubernetes/kubernetes#124586 (comment), in the reported case the pod was not yet running and therefore pushed out after the restart. If it would have been running, it would not have been stopped.

@Homura222
Copy link
Author

Homura222 commented May 7, 2024

@rmohr Do you mean to move kubevirt.io/schedulable: "true" from pod nodeSelector to pod affinity? But after kubelet restarts, kubelet will to kill the pod that affinity or nodeSelector don't match the node label. Even if it's an ignore during execution affinity. kubernetes/kubernetes#124586 (comment)

Well I mean I did not try it myself, but I asked for clarification and according to kubernetes/kubernetes#124586 (comment), in the reported case the pod was not yet running and therefore pushed out after the restart. If it would have been running, it would not have been stopped.

In the reported case,kubelet also kill running pod when kubelet is restarted. kubernetes/kubernetes#123980

kubernetes/kubernetes#124367 can fix this case in kubelet. But it hasn't been approved yet.

@rmohr
Copy link
Member

rmohr commented May 7, 2024

In the reported case,kubelet also kill running pod when kubelet is restarted. kubernetes/kubernetes#123980

This is definitely happening with nodeSelector. Not sure though about affinity, considering kubernetes/kubernetes#124586 (comment) where it was said that ignoreDuringExecution is working if the pods are already running (and not just scheduled at the point where the kubelet it restarted). If we don't have any new information, looks to me like someone would have to try it out.

let me also point out that you did not seem to add affinity related tests, so not sure if you checked it on your PR.

@Homura222
Copy link
Author

Homura222 commented May 7, 2024

In the reported case,kubelet also kill running pod when kubelet is restarted. kubernetes/kubernetes#123980

This is definitely happening with nodeSelector. Not sure though about affinity, considering kubernetes/kubernetes#124586 (comment) where it was said that ignoreDuringExecution is working if the pods are already running (and not just scheduled at the point where the kubelet it restarted). If we don't have any new information, looks to me like someone would have to try it out.

let me also point out that you did not seem to add affinity related tests, so not sure if you checked it on your PR.

  1. kubectl label node ${ip} a=b
  2. create a pod with node IgnoredDuringExecution affinity. (requiredDuringSchedulingIgnoredDuringExecution nodeAffinity)
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: my-app
spec:
  containers:
  - command:
    - bash
    - -c
    - sleep 10000
    name: my-container
    image: toolkits-centos:1.0.0
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: a
            operator: In
            values:
            - b
  1. wait pod running
  2. kubectl label node ${ip} a-
  3. systemctl restart kubelet
  4. watch pod become error
41fcd87b-57fc-45b1-a4b4-c4dfa6be3064

pod event:
b721e51d-74dc-41cd-9ba8-eaf1cba55239

@rmohr you can try it

@Homura222
Copy link
Author

Homura222 commented May 9, 2024

In the reported case,kubelet also kill running pod when kubelet is restarted. kubernetes/kubernetes#123980

This is definitely happening with nodeSelector. Not sure though about affinity, considering kubernetes/kubernetes#124586 (comment) where it was said that ignoreDuringExecution is working if the pods are already running (and not just scheduled at the point where the kubelet it restarted). If we don't have any new information, looks to me like someone would have to try it out.
let me also point out that you did not seem to add affinity related tests, so not sure if you checked it on your PR.

  1. kubectl label node ${ip} a=b
  2. create a pod with node IgnoredDuringExecution affinity.
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: my-app
spec:
  containers:
  - command:
    - bash
    - -c
    - sleep 10000
    name: my-container
    image: toolkits-centos:1.0.0
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: a
            operator: In
            values:
            - b
  1. wait pod running
  2. kubectl label node ${ip} a-
  3. systemctl restart kubelet
  4. watch pod become error
41fcd87b-57fc-45b1-a4b4-c4dfa6be3064 pod event: b721e51d-74dc-41cd-9ba8-eaf1cba55239

@rmohr you can try it

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
  labels:
    app: my-app
spec:
  containers:
  - command:
    - bash
    - -c
    - sleep 10000
    name: my-container
    image: toolkits-centos:1.0.0.amd64
  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: a
            operator: In
            values:
            - b

Use preferredDuringSchedulingIgnoredDuringExecution nodeAffinity,pod will not be killed by kubelet in report case.I think we can move kubevirt.io/schedulable: "true" from pod nodeSelector to pod preferredDuringSchedulingIgnoredDuringExecution nodeAffinity to fix it.

But preferredDuringSchedulingIgnoredDuringExecution If a matching node is not available, the scheduler still schedules the Pod. It may not be as expected. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
/cc @rmohr @victortoso

@rmohr
Copy link
Member

rmohr commented May 9, 2024

I think going back to kubernetes/kubernetes#124586 (comment) (as you did) makes sense. Let's see what is said there. requiredDuringSchedulingIgnoredDuringExecution should imo not lead to evictions, but as you showed it seems to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants