Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: cgroup delegation #7623

Closed
giuseppe opened this issue Dec 21, 2023 · 25 comments
Closed

RFE: cgroup delegation #7623

giuseppe opened this issue Dec 21, 2023 · 25 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@giuseppe
Copy link
Member

What happened?

when creating a user namespace, it is currently not possible to chown the cgroup to an user in the user namespace

What did you expect to happen?

using the io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw annotation means the cgroup used by the container is owned by root in the user namespace

How can we reproduce it (as minimally and precisely as possible)?

just create a user namespace and try CGROUP=$(sed -e "s|0::|/sys/fs/cgroup|" < /proc/self/cgroup); ls -ld $CGROUP. The directory is owned by the unknown user (since it is owned by root on the host)

Anything else we need to know?

No response

CRI-O and Kubernetes version

any

OS version

any

Additional environment details (AWS, VirtualBox, physical, etc.)

cgroup v2

@rata
Copy link
Contributor

rata commented Jan 3, 2024

using the io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw annotation means the cgroup used by the container is owned by root in the user namespace

@giuseppe this is the case for me. Are you sure the annotation is properly configured on crio to take on on that pod?

Here is the output working just fine for me:

$ CGROUP=$(sed -e "s|0::|/sys/fs/cgroup|" < /proc/self/cgroup); ls -ld $CGROUP
drwxr-xr-x 3 root root 0 Jan  3 11:15 /sys/fs/cgroup/
# You can also create directories there just fine
$ mkdir /sys/fs/cgroup/asd

This works for dind, etc.

The pod I'm using is very simple:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
  annotations:
    # This only works with: /etc/crio/crio.conf.d/01-userns-workload.conf
    # To opt in in this workload and allow these annotations.
    crio-workload-userns: "true"
    #io.kubernetes.cri-o.userns-mode: "auto"
    io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw: "true"
spec:
  hostUsers: false
  dnsPolicy: Default
  terminationGracePeriodSeconds: 1
  restartPolicy: Always
  containers:
  - name: container1
    imagePullPolicy: IfNotPresent
    image: debian
    command: ["sh"]
    args: ["-c", "sleep infinity"]
    securityContext:
      capabilities:
        add: ["NET_ADMIN", "SYS_ADMIN"]

But look at the annotations. You need to configure crio to match that, for example I have in cat /etc/crio/crio.conf.d/01-userns-workload.conf:

[crio.runtime.workloads.userns]
activation_annotation = "crio-workload-userns"
allowed_annotations = ["io.kubernetes.cri-o.userns-mode", "io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw"]

This means pods with the "crio-workload-userns" annotation are allowed to use the hierarchy-rw annotation too.

I'm using crio 1.28:

rodrigo@lindsay: ~/src/kinvolk/cri-o/cri-o :rata/release-1.28$ bin/crio --version
crio version 1.28.0
Version:        1.28.0
GitCommit:      169acf7bf2a59ad64e093fb3c029cd153835791e
GitCommitDate:  2023-09-12T10:14:31Z
GitTreeState:   dirty
BuildDate:      2023-09-12T10:14:47Z
GoVersion:      go1.21.1
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:      
  containers_image_ostree_stub
  libdm_no_deferred_remove
  seccomp
  selinux
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  false

@haircommander
Copy link
Member

#io.kubernetes.cri-o.userns-mode: "auto" this looks to me like the annotation is not enabled @rata

@rata
Copy link
Contributor

rata commented Jan 5, 2024

@haircommander but that isn't needed with hostUsers: false. When you do that, you enable userns at k8s. See the pod I shared, that is commented out and works as expected. Am I missing something?

@adelton
Copy link
Contributor

adelton commented Jan 5, 2024

Thank you @rata for showing that example Pod, that certainly helps. I changed it a bit to say

    args: ["-c", "set -x ; id ; cat /proc/self/uid_map ; mount | grep cgroup ; ls -la /sys/fs/cgroup"]

to see if the Pod runs in the user name space and if the /sys/fs/cgroup is writable and owned by the root in the Pod's container (rather than nobody to which host's root would be mapped).

I tested on OpenShift, namely

Client Version: 4.14.0-202310201027.p0.g0c63f9d.assembly.stream-0c63f9d
Kustomize Version: v5.0.1
Kubernetes Version: v1.27.8+4fab27b

In its /etc/crio/crio.conf.d/00-default, OpenShift nodes define

[crio.runtime.workloads.openshift-builder]
activation_annotation = "io.openshift.builder"
allowed_annotations = [
  "io.kubernetes.cri-o.userns-mode",
  "io.kubernetes.cri-o.Devices"
]

I had to add "io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw" to the list to make any progress and restarted the crio systemd service. For the record, the version on the worker node was cri-o-1.27.2-2.rhaos4.14.git9d684e2.el9.x86_64.

I then added

metadata:
  annotations:
    io.openshift.builder: "true"

instead of that crio-workload-userns: "true".

When I then uncommented that

metadata:
  annotations:
    io.kubernetes.cri-o.userns-mode: "auto"

things started to work:

$ oc logs -f pod/mypod
+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         0     265536      65536
+ mount
+ grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ ls -la /sys/fs/cgroup
total 0
drwxr-xr-x. 2 root   nogroup 0 Jan  5 17:08 .
drwxr-xr-x. 9 nobody nogroup 0 Jan  5 17:08 ..
-r--r--r--. 1 nobody nogroup 0 Jan  5 17:08 cgroup.controllers
-r--r--r--. 1 nobody nogroup 0 Jan  5 17:08 cgroup.events
[...]

However, hostUsers: false did not seem to have any effect.

I also noticed that when I add privileged: true to the container's securityContext, the ownership of /sys/fs/cgroup gets broken again.

I guess my questions are:

  1. Is it expected that hostUsers: false does not work and io.kubernetes.cri-o.userns-mode: "auto" is needed on this 1.27 Kubernetes distribution?
  2. Is it expected that for privileged containers, the logic fails?
  3. Should we be pushing for io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw to get added to OpenShift's io.openshift.builder's default allowed_annotations list, is that something that will be needed long-term?

@rata
Copy link
Contributor

rata commented Jan 5, 2024

I'm no RH employee, so I don't know if they know more info about this. With the info here in this public issue, it is hard to know what you mean on several things.

Re 3: I don't know if you want to push or not for that annotation. It really depends on what you want to do.
Re 2: I don't understand what you see exactly when you say "the logic breaks again". Can you paste output verbatim?
Re 1: For hostUsers to work you need to check the k8s documentation. You need to enable feature gates, specific version of crun, etc. You can use the userns annotation to create a userns, it is not exactly the same, but it will create a userns.

@haircommander
Copy link
Member

haircommander commented Jan 5, 2024

Is it expected that hostUsers: false does not work and io.kubernetes.cri-o.userns-mode: "auto" is needed on this 1.27 Kubernetes distribution?

yes this is expected, as the feature is still in alpha and openshift doesn't enable alpha features

Is it expected that for privileged containers, the logic fails?

crio mounts privileged container's cgroup hierarchy differently, so this could be expected. Same as @rata it's not clear to me what you're seeing

Should we be pushing for io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw to get added to OpenShift's io.openshift.builder's default allowed_annotations list, is that something that will be needed long-term?

technically speaking, io.openshift.builder is intended primarily for unprivileged builds, not for container in container. I don't know if it's correct to add the annotation there by default. I would recommend making your own crio config that defines the workload in the way you need

(note: if you do this, it technically makes the node unsupported as an official openshift product, though OKD doesn't have this anyway. Mainly saying this for anyone who may come across this on the internet that does want support)

@haircommander
Copy link
Member

@haircommander but that isn't needed with hostUsers: false. When you do that, you enable userns at k8s. See the pod I shared, that is commented out and works as expected. Am I missing something?

correct, I didn't read fully, my bad 😅

@adelton
Copy link
Contributor

adelton commented Jan 5, 2024

Re 2: I don't understand what you see exactly when you say "the logic breaks again". Can you paste output verbatim?

Ah, sorry for not being precise.

With just

    securityContext:
      capabilities:
        add: ["NET_ADMIN", "SYS_ADMIN"]

I get

+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         0     200000      65536
+ mount
+ grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ ls -la /sys/fs/cgroup
total 0
drwxr-xr-x. 2 root   nogroup 0 Jan  5 20:23 .
drwxr-xr-x. 9 nobody nogroup 0 Jan  5 20:23 ..
-r--r--r--. 1 nobody nogroup 0 Jan  5 20:23 cgroup.controllers
-r--r--r--. 1 nobody nogroup 0 Jan  5 20:23 cgroup.events
-rw-r--r--. 1 nobody nogroup 0 Jan  5 20:23 cgroup.freeze
--w-------. 1 nobody nogroup 0 Jan  5 20:23 cgroup.kill
-rw-r--r--. 1 nobody nogroup 0 Jan  5 20:23 cgroup.max.depth
-rw-r--r--. 1 nobody nogroup 0 Jan  5 20:23 cgroup.max.descendants
-rw-r--r--. 1 root   nogroup 0 Jan  5 20:23 cgroup.procs
-r--r--r--. 1 nobody nogroup 0 Jan  5 20:23 cgroup.stat
-rw-r--r--. 1 root   nogroup 0 Jan  5 20:23 cgroup.subtree_control
-rw-r--r--. 1 root   nogroup 0 Jan  5 20:23 cgroup.threads
-rw-r--r--. 1 nobody nogroup 0 Jan  5 20:23 cgroup.type
-rw-r--r--. 1 nobody nogroup 0 Jan  5 20:23 cpu.idle
-rw-r--r--. 1 nobody nogroup 0 Jan  5 20:23 cpu.max
[...]

When I add privileged: true, I get

+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         0     200000      65536
+ mount
+ grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ ls -la /sys/fs/cgroup
total 0
dr-xr-xr-x. 12 nobody nogroup 0 Jan  5 20:09 .
drwxr-xr-x.  9 nobody nogroup 0 Jan  5 20:24 ..
-r--r--r--.  1 nobody nogroup 0 Jan  5 20:09 cgroup.controllers
-rw-r--r--.  1 nobody nogroup 0 Jan  5 20:09 cgroup.max.depth
-rw-r--r--.  1 nobody nogroup 0 Jan  5 20:09 cgroup.max.descendants
-rw-r--r--.  1 nobody nogroup 0 Jan  5 20:09 cgroup.procs
-r--r--r--.  1 nobody nogroup 0 Jan  5 20:09 cgroup.stat
-rw-r--r--.  1 nobody nogroup 0 Jan  5 20:15 cgroup.subtree_control
-rw-r--r--.  1 nobody nogroup 0 Jan  5 20:09 cgroup.threads
-rw-r--r--.  1 nobody nogroup 0 Jan  5 20:09 cpu.pressure
-r--r--r--.  1 nobody nogroup 0 Jan  5 20:09 cpu.stat
[...]

@haircommander
Copy link
Member

when you use privilege it's actually getting the cgroup mount of the host, rather than of the container. for instance, the first container that just has SYS_ADMIN, that's actually a view of the container's cgroup (or a child of the root). the second container should be looking at the host's cgroups.

@adelton
Copy link
Contributor

adelton commented Jan 5, 2024

yes this is expected, as the feature is still in alpha and openshift doesn't enable alpha features

Do we know the timeline for the feature getting out of Alpha and thus potentially getting to OpenShift?

crio mounts privileged container's cgroup hierarchy differently, so this could be expected.

Is there a way to make the setup for privileged containers similar / the same as for unprivileged ones? At this point it seems unprivileged containers actually have better functionality than the privileged ones (but that might be specific to OpenShift).

Same as @rata it's not clear to me what you're seeing

The /sys/fs/cgroup permissions / ownership are dr-xr-xr-x. 12 nobody nobody, rather than drwxr-xr-x. 2 root nogroup.

technically speaking, io.openshift.builder is intended primarily for unprivileged builds, not for container in container. I don't know if it's correct to add the annotation there by default. I would recommend making your own crio config that defines the workload in the way you need

Well what I wonder about is the general plan for OpenShift. Is it going to eventually support hostUsers: false in OpenShift, for both privileged and unprivileged containers, out of box, and with read-write cgroups, and all the io.kubernetes.cri-o.* annotations that are currently needed will be obsolete?

The reason I'm piggybacking on that io.openshift.builder with this investigation is that it's actually the only place where user namespaces seem exposed to users in current OpenShifts. So when looking for the limits of that current functionality, starting with that one is the easiest way.

(note: if you do this, it technically makes the node unsupported as an official openshift product, though OKD doesn't have this anyway. Mainly saying this for anyone who may come across this on the internet that does want support)

Right. I'm obviously not looking for supported setup at this point. But I'd like to keep hacking in the general direction that OpenShift will eventually evolve.

@haircommander
Copy link
Member

Do we know the timeline for the feature getting out of Alpha and thus potentially getting to OpenShift?

I am hoping we can move it out in 1.30, which would target openshift 4.17 🤞

Is there a way to make the setup for privileged containers similar / the same as for unprivileged ones? At this point it seems unprivileged containers actually have better functionality than the privileged ones (but that might be specific to OpenShift).

Uh I think this is a precedent docker set years ago so it may be tricky to change. We could consider an annotation

Is it goind to eventually support hostUsers: false in OpenShift, for both privileged and unprivileged containers, out of box, and with read-write cgroups, and all the io.kubernetes.cri-o.* annotations that are currently needed will be obsolete?

yeah. the one piece we need to figure out is the right way to support rw cgroups. There's not currently a proposal upstream for how to fix that, though we're considering introducing a "sysMountType" similar to "procMountType" to achieve this.

@adelton
Copy link
Contributor

adelton commented Jan 5, 2024

when you use privilege it's actually getting the cgroup mount of the host, rather than of the container. for instance, the first container that just has SYS_ADMIN, that's actually a view of the container's cgroup (or a child of the root). the second container should be looking at the host's cgroups.

Right. And the problem is that processes running in that container are already user-namespaced, so the root in the container cannot create new cgroup themselves -- they cannot access the host-root-owned cgroups.

@giuseppe, should we update this issue to make it clear we are after the solution for privileged containers?

@haircommander
Copy link
Member

I would kinda consider privileged containers with hostUser == false to be a strange case. I would expect privileged containers to usually have hostUser == true. Is there something privileged containers can do that vanilla can't? I think we'll want good reason to customize the behavior for privileged + !hostUser

@adelton
Copy link
Contributor

adelton commented Jan 5, 2024

I admit I'm not fluent in what are all the things that the privileged container does differently and if we should be able to emulate / configure things accordingly.

The use case I'm after is running systemd in a podman in a user-namespaced Pod, for the purpose of running Kind in a Pod with that podman. That makes it very easy to test for example ACM (Advanced Cluster Management for Kubernetes) because creating another cluster takes just minutes.

In containers/podman#21008 (comment) we see some mount / fuse related failures from a podman in that user-namespaced OpenShift Pod when

    securityContext:
      privileged: true

gets replaced with

    securityContext:
      capabilities:
        add: ["NET_ADMIN", "SYS_ADMIN"]

@haircommander
Copy link
Member

I think we should pursue this without the privileged flag but with as many capabilities as needed. privileged is a very heavy hammer and I think we can get away without using or changing it for this. This is an interesting use case!

@rata
Copy link
Contributor

rata commented Jan 8, 2024

Ok, so to sum up, the issue here is:

  • The annotation: io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw doesn't work with userns AND privileged containers
  • This seems to be needed due to the errors reported here
  • @haircommander would like to have that use case work without privileged (that usually escapes seccomp and apparmor, among other things). I would love for that to work unprivileged too and proposed something on the podman issue 🤞

IMHO, I'd say let's see if we can make that work without making the cgroup delegation aware of "privileged" containers, and if that doesn't work, let's revisit how we can make it work here.

@dgl
Copy link
Contributor

dgl commented Jan 9, 2024

@rata said:

The annotation: io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw doesn't work with userns AND privileged containers

To confuse matters further, that annotation actually isn't needed in some cases -- because cri-o looks for the magic string /sbin/init (or systemd):

return strings.Contains(entrypoint, "/sbin/init") || (filepath.Base(entrypoint) == "systemd")

The original report on podman did have /sbin/init as part of the command so I think it would trigger that logic.

@adelton said:

Is there a way to make the setup for privileged containers similar / the same as for unprivileged ones? At this point it seems unprivileged containers actually have better functionality than the privileged ones (but that might be specific to OpenShift).

I think this is extra confusing with what privileged actually means, in this case it is hitting the logic that a privileged container does not enter a cgroup namespace, which has bitten people before (see kubernetes/kubernetes#119669 (comment)).

For this use case I am thinking it doesn't make sense to use privileged as discussed above, but if desired a privileged pod can enter a cgroup namespace with unshare -C so with that understanding it is possible to make it work for a privileged pod.

FWIW, in our environment we do something a bit like this, where we drop a script as /sbin/init.sh that does some setup before exec unshare ... /sbin/init. I think for some of this we don't necessarily need full support from cri-o, particularly while it is annotation based (but we should work out the semantics for a potential CRI / k8s level future...).

@adelton
Copy link
Contributor

adelton commented Jan 10, 2024

@dgl. Thanks. I have made some progress on the non-privileged front but I would like to understand the privileged situation as well, so I try to come up with the minimal reproducer to implement the above suggestions for example for environments where tweaking the CRI-O configuration on the worker nodes might not be desirable.

When in OpenShift 4.14 with CRI-O 1.27.2-2.rhaos4.14.git9d684e2.el9.x86_64 where the privileged container's entrypoint (is that ENTRYPOINT in a Dockerfile, or spec.containers[*].command[0], or either of them?) contains /sbin/init substring, for example /sbin/init-podman, and the Pod uses annotations

    io.openshift.builder: "true"
    io.kubernetes.cri-o.userns-mode: auto

but not

    io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw: "true"

-- should we see /sys/fs/cgroup writable or not?

I have podman-init.sh

#!/bin/bash
set -x
id
cat /proc/self/uid_map
mount | grep cgroup
ls -lad /sys/fs/cgroup

and Dockerfile

FROM debian
COPY podman-init.sh /sbin/init-podman
ENTRYPOINT [ "/sbin/init-podman" ]

and I build a container image and push it to the internal repository of the OpenShift cluster and then create Pod

apiVersion: v1
kind: Pod
metadata:
  name: test-podman
  annotations:
    io.openshift.builder: "true"
    io.kubernetes.cri-o.userns-mode: auto
#     io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw: "true"
spec:
  restartPolicy: Never
  containers:
  - name: container
    image: image-registry.openshift-image-registry.svc:5000/test-1/podman-privileged
    imagePullPolicy: Always
    securityContext:
      privileged: true
#      capabilities:
#        add: ["SYS_ADMIN"]
    command:
    - /sbin/init-podman

-- I see

+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         0     200000      65535
     65535     527679          1
+ mount
+ grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ ls -lad /sys/fs/cgroup
dr-xr-xr-x. 12 nobody nogroup 0 Jan 10 11:08 /sys/fs/cgroup

So the /sys/fs/cgroup mountpoint is rw but the owner is clearly the host uid 0, not the container uid 0 (== host uid 265536).

Is that expected or should the WillRunSystemd logic not only make the sys/fs/cgroup rw but also set the ownership to match the container's uid 0?

When I replace

      privileged: true

with

      capabilities:
        add: ["SYS_ADMIN"]

and uncomment that io.kubernetes.cri-o.cgroup2-mount-hierarchy-rw: "true", I do get

+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         0     200000      65535
     65535     265536          1
+ mount
+ grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (rw,relatime,seclabel)
+ ls -lad /sys/fs/cgroup
drwxr-xr-x. 2 root nogroup 0 Jan 10 17:07 /sys/fs/cgroup

So not only is /sys/fs/cgroup mounted rw but the ownership and permissions on it make it usable for the root in the container.

Is that achievable with the privileged: true as well, somehow?

@dgl
Copy link
Contributor

dgl commented Jan 10, 2024

In your podman-init.sh try adding at the end:

cat /proc/self/cgroup
echo creating cgroup namespace
exec unshare -C sh -c 'cat /proc/self/cgroup; ls -ld /sys/fs/cgroup'

You'll obviously need the command line unshare in your image.

(To actually use this you probably want that sh command to run another script, but I think this should be enough to demonstrate the difference of entering a cgroup namespace.)

@adelton
Copy link
Contributor

adelton commented Jan 11, 2024

I've added this to the script, rebuilt and pushed the image.

I can see

+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         0     200000      65536
+ mount
+ grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ ls -lad /sys/fs/cgroup
dr-xr-xr-x. 12 nobody nobody 0 Jan 11 08:51 /sys/fs/cgroup
+ cat /proc/self/cgroup
0::/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod78fe5ad3_808a_4bfb_a3a1_14b6d3eabda8.slice/crio-b3cc012fadda7d2078e29d7d636a63d7c3975b7a7006773c785cc1d6271c4f88.scope
+ echo creating cgroup namespace
creating cgroup namespace
+ exec unshare -C sh -c 'cat /proc/self/cgroup; ls -ld /sys/fs/cgroup'
0::/
dr-xr-xr-x. 12 nobody nobody 0 Jan 11 08:51 /sys/fs/cgroup

so we got a new cgroup namespace.

But when I add mkdir /sys/fs/cgroup/test-1 to that internal shell, it fails:

+ id
uid=0(root) gid=0(root) groups=0(root)
+ cat /proc/self/uid_map
         0     265536      65536
+ mount
+ grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,seclabel)
+ ls -lad /sys/fs/cgroup
dr-xr-xr-x. 12 nobody nobody 0 Jan 11 08:51 /sys/fs/cgroup
+ cat /proc/self/cgroup
0::/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod9f38d751_79d0_4267_b88e_307e1c643da7.slice/crio-439840b5a5a637ca4e3752e834459c113e74d5debc5132b3c4cbff2bcc05d2d5.scope
+ echo creating cgroup namespace
creating cgroup namespace
+ exec unshare -C sh -c 'cat /proc/self/cgroup; ls -ld /sys/fs/cgroup; mkdir /sys/fs/cgroup/test-1'
0::/
dr-xr-xr-x. 12 nobody nobody 0 Jan 11 08:51 /sys/fs/cgroup
mkdir: cannot create directory '/sys/fs/cgroup/test-1': Permission denied

Is it expected that further cgroups cannot be created?

@adelton
Copy link
Contributor

adelton commented Jan 11, 2024

I also seem to be able to get to exactly this state without the /sbin/init logic:

apiVersion: v1
kind: Pod
metadata:
  name: test-podman
  annotations:
    io.openshift.builder: "true"
    io.kubernetes.cri-o.userns-mode: auto
spec:
  restartPolicy: Never
  containers:
  - name: container
    image: docker.io/library/debian
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    command:
    - bash
    - -c
    - set -x ; id ; cat /proc/self/uid_map ; mount | grep cgroup ; ls -lad /sys/fs/cgroup ; cat /proc/self/cgroup ; echo creating cgroup namespace ; exec unshare -C sh -c 'cat /proc/self/cgroup; ls -ld /sys/fs/cgroup; mkdir /sys/fs/cgroup/test-1'

What exactly is the use of the /sbin/init-esque entrypoint expected to change for the privileged containers?

@adelton
Copy link
Contributor

adelton commented Jan 11, 2024

For the record / note to myself:

I was able to observe the /sbin/init / non-/sbin/init difference only in unprivileged containers. Out of box, no matter what entrypoint, I see that even without user namespaces,

+ cat /proc/self/cgroup
0::/

so unprivileged containers seem to create cgroup namespace always, and that unshare -C is not needed.

When I then add the annotations and

      capabilities:
        add: ["SYS_ADMIN"]

and use /sbin/init-something command, I can then mkdir /sys/fs/cgroup/test-1 right away.

When I use /sbin/xinit-something (to disable the /sbin/init logic) but the same script, I need to add

      seLinuxOptions:
        type: spc_t

to overcome the Permission denied.

So it looks like the only difference is in the SELinux context.

@haircommander
Copy link
Member

@adelton eventually I'd like the type container_engine_t to serve this purpose though I'm still working out issues with it

Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 24, 2024
Copy link

Closing this issue since it had no activity in the past 90 days.

@github-actions github-actions bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 24, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants