Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: describe pod on k8s-create-pod wait failure #9557

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

portersrc
Copy link
Contributor

Currently attempting to debug a nydus pod launch, and this should give more insight.

Temporary PR without an issue yet. NOT a full fix for #9378, though we need to debug this to do that kind of SNP work.

@katacontainersbot katacontainersbot added the size/tiny Smallest and simplest task label Apr 25, 2024
@portersrc portersrc marked this pull request as draft April 25, 2024 14:26
@portersrc portersrc marked this pull request as ready for review April 25, 2024 20:02
@Apokleos
Copy link
Contributor

Hi @portersrc Thx for your contribution.
the required CI reports error message "error_msg: See the document below for help on formatting commits for the project." I checked your commitCI: extra cleanup step for nydus myabe, you should add commit message to pass this CI.

@portersrc portersrc added the do-not-merge PR has problems or depends on another label Apr 26, 2024
@portersrc
Copy link
Contributor Author

Hey @Apokleos, thanks for catching this! I'll correct the commit message as you describe when we finalize it. In fact, We're not 100% sure on this second commit now anyway. Part of that discussion will probably be tracked in a slack thread here.

@wainersm
Copy link
Contributor

Hi @portersrc , do you need this PR merged or using just to debug the errors?

@portersrc
Copy link
Contributor Author

@wainersm Both, I think. We need to get SNP CI working, and to do that, we need a little more debug info. But the fix is possibly some nydus cleanup, which we've been putting in this PR as well. When we figure this out, then if we want two separate PRs (one for debug info, one for nydus cleanup (assuming that's the issue)), I'm OK closing this and making two PRs out of it.

Copy link
Member

@fidencio fidencio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, take a look at the nydus snapshotter's cleanup script.

for i in `sudo ctr -n k8s.io snapshot --snapshotter nydus list | grep -v KEY | cut -d' ' -f1 || true`; do
sudo ctr -n k8s.io snapshot --snapshotter nydus rm $i
done
sudo rm -rf "/var/lib/containerd-nydus" || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done by the nydus-snapshotter.
Any specific reason to enforce it here?

In fact, please, take a look at: https://github.com/containerd/nydus-snapshotter/blob/ce4848e4439b79723b0635e994b5d9991aa88819/misc/snapshotter/snapshotter.sh#L195

So, isn't the snapshotter doing exactly the same thing on its own side?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these guys ran into the Error: failed to create containerd container: create instance 6: object with key "6" already exists: unknown that you also saw on the TDX node. Did you end up doing anything to fix that besides local cleanup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of the error they hit, this should be fixed on the nydus snapshotter side, not on the Kata Containers side.

I didn't hit the issue on TDX, but that could happen there as well. I hit the issue on a local VM when trying the snapshotter before the release. That cleanup was enough to get me moving from the error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wainersm pointed out in the slack thread a snippet to clean up the content, that's not relevant for the error that Tobin mentioned to be happening, but rather to remove the image from overlayfs snapshotter and allowing to pull the image with nydus, which is an approach we do not want to take here, as that's just papering over real issues.

@danmihai1 danmihai1 self-requested a review April 30, 2024 19:48
Currently attempting to debug a nydus pod launch, and this
should give more insight.

Fixes: kata-containers#9378
Signed-off-by: Chris Porter <porter@ibm.com>
This provides both ctr cleaning commands and /var/lib cleanup

Signed-off-by: Chris Porter <porter@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge PR has problems or depends on another ok-to-test size/tiny Smallest and simplest task
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

9 participants