Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

firecracker-containerd in-VM agent is not responsive after snapshot creation when using vanilla firecracker setup #818

Open
CuriousGeorgiy opened this issue Sep 12, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@CuriousGeorgiy
Copy link
Member

CuriousGeorgiy commented Sep 12, 2023

Describe the bug
After a VM snapshot is created, the firecracker-containerd in-VM agent stops responding:

  • an attempt to kill the running task using containerd API hangs;
  • an attempt to stop the VM using firecracker-containerd API returns a forcefully terminated VM error.
@CuriousGeorgiy CuriousGeorgiy added the bug Something isn't working label Sep 12, 2023
@ustiugov
Copy link
Member

Please provide details and comment on the severity of the issue.

@CuriousGeorgiy
Copy link
Member Author

CuriousGeorgiy commented Sep 14, 2023

Clarification

Even though firecracker-containerd's in-VM agent does not respond, the container running inside the VM works fine and is responsive.

Severity

Since firecracker-containerd's in-VM agent does not respond, forceful termination of the VM is required. As a consequence, some containerd resources are leaked (the container and the container snapshot, at least).

AFAIC, cleaning up the resources requires the regular cleanup procedure to succeed, i.e., killing the task, deleting the task, and deleting the container — the resources are held by them:

vHive/ctriface/iface.go

Lines 246 to 265 in 899893a

task := *vm.Task
if err := task.Kill(ctx, syscall.SIGKILL); err != nil {
logger.WithError(err).Error("Failed to kill the task")
return err
}
<-vm.TaskCh
//FIXME: Seems like some tasks need some extra time to die Issue#15, lr_training
time.Sleep(500 * time.Millisecond)
if _, err := task.Delete(ctx); err != nil {
logger.WithError(err).Error("failed to delete task")
return err
}
container := *vm.Container
if err := container.Delete(ctx, containerd.WithSnapshotCleanup); err != nil {
logger.WithError(err).Error("failed to delete container")
return err
}

I am not aware of a procedure to forcefully delete the task and the container (as the procedure described above hangs on the first step).

The reset procedure via scripts/clean_fcctr.sh works fine, but for the duration of a vHive session, the resources associated with containers disposed of by the orchestrator will leak.

@CuriousGeorgiy
Copy link
Member Author

CuriousGeorgiy commented Sep 15, 2023

@ustiugov may be related firecracker-microvm/firecracker#4099.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants