New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: describe pod on k8s-create-pod wait failure #9557
base: main
Are you sure you want to change the base?
CI: describe pod on k8s-create-pod wait failure #9557
Conversation
64ca12a
to
7c35fcb
Compare
f5b73ab
to
4b1b25d
Compare
Hi @portersrc Thx for your contribution. |
Hi @portersrc , do you need this PR merged or using just to debug the errors? |
@wainersm Both, I think. We need to get SNP CI working, and to do that, we need a little more debug info. But the fix is possibly some nydus cleanup, which we've been putting in this PR as well. When we figure this out, then if we want two separate PRs (one for debug info, one for nydus cleanup (assuming that's the issue)), I'm OK closing this and making two PRs out of it. |
4b1b25d
to
7c2ba77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, take a look at the nydus snapshotter's cleanup script.
for i in `sudo ctr -n k8s.io snapshot --snapshotter nydus list | grep -v KEY | cut -d' ' -f1 || true`; do | ||
sudo ctr -n k8s.io snapshot --snapshotter nydus rm $i | ||
done | ||
sudo rm -rf "/var/lib/containerd-nydus" || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done by the nydus-snapshotter.
Any specific reason to enforce it here?
In fact, please, take a look at: https://github.com/containerd/nydus-snapshotter/blob/ce4848e4439b79723b0635e994b5d9991aa88819/misc/snapshotter/snapshotter.sh#L195
So, isn't the snapshotter doing exactly the same thing on its own side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these guys ran into the Error: failed to create containerd container: create instance 6: object with key "6" already exists: unknown
that you also saw on the TDX node. Did you end up doing anything to fix that besides local cleanup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of the error they hit, this should be fixed on the nydus snapshotter side, not on the Kata Containers side.
I didn't hit the issue on TDX, but that could happen there as well. I hit the issue on a local VM when trying the snapshotter before the release. That cleanup was enough to get me moving from the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wainersm pointed out in the slack thread a snippet to clean up the content
, that's not relevant for the error that Tobin mentioned to be happening, but rather to remove the image from overlayfs snapshotter and allowing to pull the image with nydus, which is an approach we do not want to take here, as that's just papering over real issues.
Currently attempting to debug a nydus pod launch, and this should give more insight. Fixes: kata-containers#9378 Signed-off-by: Chris Porter <porter@ibm.com>
This provides both ctr cleaning commands and /var/lib cleanup Signed-off-by: Chris Porter <porter@ibm.com>
7c2ba77
to
5691758
Compare
Currently attempting to debug a nydus pod launch, and this should give more insight.
Temporary PR without an issue yet. NOT a full fix for #9378, though we need to debug this to do that kind of SNP work.