-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: cgroup delegation #7623
Comments
@giuseppe this is the case for me. Are you sure the annotation is properly configured on crio to take on on that pod? Here is the output working just fine for me:
This works for dind, etc. The pod I'm using is very simple:
But look at the annotations. You need to configure crio to match that, for example I have in
This means pods with the "crio-workload-userns" annotation are allowed to use the hierarchy-rw annotation too. I'm using crio 1.28:
|
|
@haircommander but that isn't needed with |
Thank you @rata for showing that example Pod, that certainly helps. I changed it a bit to say
to see if the Pod runs in the user name space and if the I tested on OpenShift, namely
In its
I had to add I then added
instead of that When I then uncommented that
things started to work:
However, I also noticed that when I add I guess my questions are:
|
I'm no RH employee, so I don't know if they know more info about this. With the info here in this public issue, it is hard to know what you mean on several things. Re 3: I don't know if you want to push or not for that annotation. It really depends on what you want to do. |
yes this is expected, as the feature is still in alpha and openshift doesn't enable alpha features
crio mounts privileged container's cgroup hierarchy differently, so this could be expected. Same as @rata it's not clear to me what you're seeing
technically speaking, (note: if you do this, it technically makes the node unsupported as an official openshift product, though OKD doesn't have this anyway. Mainly saying this for anyone who may come across this on the internet that does want support) |
correct, I didn't read fully, my bad 😅 |
Ah, sorry for not being precise. With just
I get
When I add
|
when you use privilege it's actually getting the cgroup mount of the host, rather than of the container. for instance, the first container that just has SYS_ADMIN, that's actually a view of the container's cgroup (or a child of the root). the second container should be looking at the host's cgroups. |
Do we know the timeline for the feature getting out of Alpha and thus potentially getting to OpenShift?
Is there a way to make the setup for privileged containers similar / the same as for unprivileged ones? At this point it seems unprivileged containers actually have better functionality than the privileged ones (but that might be specific to OpenShift).
The
Well what I wonder about is the general plan for OpenShift. Is it going to eventually support The reason I'm piggybacking on that
Right. I'm obviously not looking for supported setup at this point. But I'd like to keep hacking in the general direction that OpenShift will eventually evolve. |
I am hoping we can move it out in 1.30, which would target openshift 4.17 🤞
Uh I think this is a precedent docker set years ago so it may be tricky to change. We could consider an annotation
yeah. the one piece we need to figure out is the right way to support rw cgroups. There's not currently a proposal upstream for how to fix that, though we're considering introducing a "sysMountType" similar to "procMountType" to achieve this. |
Right. And the problem is that processes running in that container are already user-namespaced, so the root in the container cannot create new cgroup themselves -- they cannot access the host-root-owned cgroups. @giuseppe, should we update this issue to make it clear we are after the solution for privileged containers? |
I would kinda consider privileged containers with hostUser == false to be a strange case. I would expect privileged containers to usually have hostUser == true. Is there something privileged containers can do that vanilla can't? I think we'll want good reason to customize the behavior for privileged + !hostUser |
I admit I'm not fluent in what are all the things that the privileged container does differently and if we should be able to emulate / configure things accordingly. The use case I'm after is running systemd in a podman in a user-namespaced Pod, for the purpose of running Kind in a Pod with that podman. That makes it very easy to test for example ACM (Advanced Cluster Management for Kubernetes) because creating another cluster takes just minutes. In containers/podman#21008 (comment) we see some mount / fuse related failures from a
gets replaced with
|
I think we should pursue this without the privileged flag but with as many capabilities as needed. privileged is a very heavy hammer and I think we can get away without using or changing it for this. This is an interesting use case! |
Ok, so to sum up, the issue here is:
IMHO, I'd say let's see if we can make that work without making the cgroup delegation aware of "privileged" containers, and if that doesn't work, let's revisit how we can make it work here. |
@rata said:
To confuse matters further, that annotation actually isn't needed in some cases -- because cri-o looks for the magic string cri-o/internal/factory/container/container.go Line 607 in 521fdfb
The original report on podman did have @adelton said:
I think this is extra confusing with what privileged actually means, in this case it is hitting the logic that a privileged container does not enter a cgroup namespace, which has bitten people before (see kubernetes/kubernetes#119669 (comment)). For this use case I am thinking it doesn't make sense to use privileged as discussed above, but if desired a privileged pod can enter a cgroup namespace with FWIW, in our environment we do something a bit like this, where we drop a script as |
@dgl. Thanks. I have made some progress on the non-privileged front but I would like to understand the privileged situation as well, so I try to come up with the minimal reproducer to implement the above suggestions for example for environments where tweaking the CRI-O configuration on the worker nodes might not be desirable. When in OpenShift 4.14 with CRI-O 1.27.2-2.rhaos4.14.git9d684e2.el9.x86_64 where the privileged container's entrypoint (is that
but not
-- should we see I have
and
and I build a container image and push it to the internal repository of the OpenShift cluster and then create Pod
-- I see
So the Is that expected or should the When I replace
with
and uncomment that
So not only is Is that achievable with the |
In your podman-init.sh try adding at the end:
You'll obviously need the command line unshare in your image. (To actually use this you probably want that sh command to run another script, but I think this should be enough to demonstrate the difference of entering a cgroup namespace.) |
I've added this to the script, rebuilt and pushed the image. I can see
so we got a new cgroup namespace. But when I add
Is it expected that further cgroups cannot be created? |
I also seem to be able to get to exactly this state without the
What exactly is the use of the |
For the record / note to myself: I was able to observe the
so unprivileged containers seem to create cgroup namespace always, and that When I then add the annotations and
and use When I use
to overcome the So it looks like the only difference is in the SELinux context. |
@adelton eventually I'd like the type |
A friendly reminder that this issue had no activity for 30 days. |
Closing this issue since it had no activity in the past 90 days. |
What happened?
when creating a user namespace, it is currently not possible to chown the cgroup to an user in the user namespace
What did you expect to happen?
using the
io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw
annotation means the cgroup used by the container is owned by root in the user namespaceHow can we reproduce it (as minimally and precisely as possible)?
just create a user namespace and try
CGROUP=$(sed -e "s|0::|/sys/fs/cgroup|" < /proc/self/cgroup); ls -ld $CGROUP
. The directory is owned by the unknown user (since it is owned by root on the host)Anything else we need to know?
No response
CRI-O and Kubernetes version
any
OS version
any
Additional environment details (AWS, VirtualBox, physical, etc.)
cgroup v2
The text was updated successfully, but these errors were encountered: