firecracker-containerd in-VM agent is not responsive after snapshot creation when using vanilla firecracker setup #818

CuriousGeorgiy · 2023-09-12T15:10:04Z

Describe the bug
After a VM snapshot is created, the firecracker-containerd in-VM agent stops responding:

an attempt to kill the running task using containerd API hangs;
an attempt to stop the VM using firecracker-containerd API returns a forcefully terminated VM error.

The text was updated successfully, but these errors were encountered:

ustiugov · 2023-09-14T12:56:48Z

Please provide details and comment on the severity of the issue.

CuriousGeorgiy · 2023-09-14T18:28:27Z

Clarification

Even though firecracker-containerd's in-VM agent does not respond, the container running inside the VM works fine and is responsive.

Severity

Since firecracker-containerd's in-VM agent does not respond, forceful termination of the VM is required. As a consequence, some containerd resources are leaked (the container and the container snapshot, at least).

AFAIC, cleaning up the resources requires the regular cleanup procedure to succeed, i.e., killing the task, deleting the task, and deleting the container — the resources are held by them:

vHive/ctriface/iface.go

Lines 246 to 265 in 899893a

    
           task := *vm.Task 
        
           if err := task.Kill(ctx, syscall.SIGKILL); err != nil { 
        
           	logger.WithError(err).Error("Failed to kill the task") 
        
           	return err 
        
           } 
        
           <-vm.TaskCh 
        
           //FIXME: Seems like some tasks need some extra time to die Issue#15, lr_training 
        
           time.Sleep(500 * time.Millisecond) 
        
           if _, err := task.Delete(ctx); err != nil { 
        
           	logger.WithError(err).Error("failed to delete task") 
        
           	return err 
        
           } 
        
           container := *vm.Container 
        
           if err := container.Delete(ctx, containerd.WithSnapshotCleanup); err != nil { 
        
           	logger.WithError(err).Error("failed to delete container") 
        
           	return err 
        
           }

I am not aware of a procedure to forcefully delete the task and the container (as the procedure described above hangs on the first step).

The reset procedure via scripts/clean_fcctr.sh works fine, but for the duration of a vHive session, the resources associated with containers disposed of by the orchestrator will leak.

CuriousGeorgiy · 2023-09-15T14:12:01Z

@ustiugov may be related firecracker-microvm/firecracker#4099.

CuriousGeorgiy added the bug Something isn't working label Sep 12, 2023

This was referenced Sep 14, 2023

Move to vanilla firecracker snapshots #816

Merged

[Bug] Processes get stuck after resuming VM from snapshot firecracker-microvm/firecracker#4099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

firecracker-containerd in-VM agent is not responsive after snapshot creation when using vanilla firecracker setup #818

firecracker-containerd in-VM agent is not responsive after snapshot creation when using vanilla firecracker setup #818

CuriousGeorgiy commented Sep 12, 2023 •

edited

ustiugov commented Sep 14, 2023

CuriousGeorgiy commented Sep 14, 2023 •

edited

CuriousGeorgiy commented Sep 15, 2023 •

edited

firecracker-containerd in-VM agent is not responsive after snapshot creation when using vanilla firecracker setup #818

firecracker-containerd in-VM agent is not responsive after snapshot creation when using vanilla firecracker setup #818

Comments

CuriousGeorgiy commented Sep 12, 2023 • edited

ustiugov commented Sep 14, 2023

CuriousGeorgiy commented Sep 14, 2023 • edited

Clarification

Severity

CuriousGeorgiy commented Sep 15, 2023 • edited

CuriousGeorgiy commented Sep 12, 2023 •

edited

CuriousGeorgiy commented Sep 14, 2023 •

edited

CuriousGeorgiy commented Sep 15, 2023 •

edited