Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between crun and runc when disallowing access by default to devices with cgroups v1 #1438

Open
Madeeks opened this issue Mar 14, 2024 · 5 comments

Comments

@Madeeks
Copy link

Madeeks commented Mar 14, 2024

Hello, thank you for developing crun!

I use Docker containers as CI environments for developing container tools, so I often use OCI runtimes within privileged Docker containers.
I noticed that on systems with cgroups v1, when the bundle's config.json is set to disallow access to all devices by default, crun apparently allows all container devices, while runc abides to the config (besides the essential special devices it sets up on its own).

For example, within a Fedora 39 Docker container:

[root@39f2b2db9bb6 /]# runc --version
runc version 1.1.12
spec: 1.0.2-dev
go: go1.21.6
libseccomp: 2.5.3
[root@39f2b2db9bb6 /]# crun --version
crun version 1.14.4
commit: a220ca661ce078f2c37b38c92e66cf66c012d9c1
rundir: /run/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL

[root@39f2b2db9bb6 /]# cat /sys/fs/cgroup/devices/devices.list 
a *:* rwm

# cd to an OCI bundle with a Ubuntu rootfs
[root@39f2b2db9bb6 /]# cd oci-bundle/
[root@39f2b2db9bb6 oci-bundle]# ls -l
total 4
-rw-r--r-- 1 1000 users 2700 Mar 13 18:54 config.json
drwxr-xr-x 1 1000 users  154 Mar 13 16:24 rootfs

[root@39f2b2db9bb6 oci-bundle]# runc run test
docker@39f2b2db9bb6:/$ cat /sys/fs/cgroup/devices/devices.list 
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm
docker@39f2b2db9bb6:/$ 
exit

[root@39f2b2db9bb6 oci-bundle]# crun run test
docker@39f2b2db9bb6:/$ cat /sys/fs/cgroup/devices/devices.list 
a *:* rwm
docker@39f2b2db9bb6:/$       
exit

The config.json is the following:

{
   "ociVersion": "1.0.0",
   "process": {
      "terminal": true,
      "user": {
         "uid": 1000,
         "gid": 1000,
         "additionalGids": [
            1000
         ]
      },
      "args": [
         "bash"
      ],
      "env": [
         "SHLVL=1",
         "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
         "TERM=xterm",
         "HOME=/home/docker",
         "PWD=/home/docker"
      ],
      "cwd": "/",
      "capabilities": {},
      "noNewPrivileges": true
   },
   "root": {
      "path": "rootfs",
      "readonly": false
   },
   "mounts": [
      {
         "destination": "/proc",
         "type": "proc",
         "source": "proc"
      },
      {
         "destination": "/dev/pts",
         "type": "devpts",
         "source": "devpts",
         "options": [
            "nosuid",
            "noexec",
            "newinstance",
            "ptmxmode=0666",
            "mode=0620",
            "gid=5"
         ]
      },
      {
         "destination": "/dev/shm",
         "type": "bind",
         "source": "/dev/shm",
         "options": [
            "nosuid",
            "noexec",
            "nodev",
            "rbind",
            "slave",
            "rw"
         ]
      },
      {
         "destination": "/dev/mqueue",
         "type": "mqueue",
         "source": "mqueue",
         "options": [
            "nosuid",
            "noexec",
            "nodev"
         ]
      },
      {
         "destination": "/sys",
         "type": "sysfs",
         "source": "sysfs",
         "options": [
            "nosuid",
            "noexec",
            "nodev",
            "ro"
         ]
      },
      {
         "destination": "/sys/fs/cgroup",
         "type": "cgroup",
         "source": "cgroup",
         "options": [
            "nosuid",
            "noexec",
            "nodev",
            "relatime",
            "ro"
         ]
      }
   ],
   "linux": {
      "resources": {
         "cpu": {
            "cpus": "0,1,2,3,4,5,6,7"
         },
         "devices": [
            {
               "allow": false,
               "access": "rwm"
            }
         ]
      },
      "namespaces": [
         {
            "type": "mount"
         }
      ],
      "rootfsPropagation": "slave",
      "maskedPaths": [
         "/proc/kcore",
         "/proc/latency_stats",
         "/proc/timer_list",
         "/proc/timer_stats",
         "/proc/sched_debug",
         "/sys/firmware",
         "/proc/scsi"
      ],
      "readonlyPaths": [
         "/proc/asound",
         "/proc/bus",
         "/proc/fs",
         "/proc/irq",
         "/proc/sys",
         "/proc/sysrq-trigger"
      ]
   }
}

The configuration of a privileged container (no user namespace) is intentional in this case.

I can reproduce the behavior described above only when calling crun within Docker containers, not when using it on native hosts.
What am I missing?

Thanks in advance for any help provided!

@giuseppe
Copy link
Member

I tried reproducing it using a Podman container created with podman run --privileged -v /root:/root --rm -ti fedora:39 bash but in both cases the inner container does not create a cgroup.

how have you created the outer Docker container?

Can you please verify the cgroup of the container process cat /proc/$PID_CGROUP/cgroup from the host in both cases?

@giuseppe
Copy link
Member

I tried reproducing it using a Podman container created with podman run --privileged -v /root:/root --rm -ti fedora:39 bash but in both cases the inner container does not create a cgroup.

EDIT: I was looking at the wrong thing.

They both create a cgroup, but I see the same configuration:

# cat /sys/fs/cgroup/devices/machine.slice/libpod-bcf881874d62ce2cf2226eb8598e0a1dd2bc4d1ea96c9fa9e577872720aca34c.scope/container/runc-container/devices.list
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm

# cat /sys/fs/cgroup/devices/machine.slice/libpod-bcf881874d62ce2cf2226eb8598e0a1dd2bc4d1ea96c9fa9e577872720aca34c.scope/container/crun-container/devices.list
c *:* m
b *:* m
c 1:3 rwm
c 1:8 rwm
c 1:7 rwm
c 5:0 rwm
c 1:5 rwm
c 1:9 rwm
c 5:1 rwm
c 136:* rwm
c 5:2 rwm

@Madeeks
Copy link
Author

Madeeks commented Mar 20, 2024

Hi @giuseppe, thanks for your reply.
The outer container is created with a command like

> docker run --rm -it -v $(pwd):/oci-bundle --privileged fedora:39 bash

The Docker config I'm running on my laptop is

> docker info
Client:
 Version:    24.0.7-ce
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  0.11.2
    Path:     /usr/lib/docker/cli-plugins/docker-buildx

Server:
 Containers: 6
  Running: 1
  Paused: 0
  Stopped: 5
 Images: 274
 Server Version: 24.0.7-ce
 Storage Driver: btrfs
  Btrfs: 
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 oci runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8e4b0bde866788eec76735cc77c4720144248fb7
 runc version: v1.1.10-0-g18a0cb0f32bc
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: builtin
 Kernel Version: 5.14.21-150400.24.81-default
 Operating System: openSUSE Leap 15.4
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.06GiB
 Name: carbon
 ID: 7IY2:7RUT:5VJZ:QQKN:S75T:CZBI:VM4J:UHRR:JT5K:75EH:FCU5:IKA7
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: madeeks
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

I thought the behavior could be related to the cgroup driver, but I obtained the same results (i.e. different device allowed lists) when using runc/crun on a Ubuntu 20.04 VM using Docker with a cgroupfs driver.
I can also reproduce the results with Podman using both systemd and cgroupfs arguments to --cgroup-manager (in this case I ran rootful Podman since the option is not supported with rootless Podman and cgroups v1).

I'll keep digging in how my container engines are setting up cgroups in the outer containers.

@giuseppe
Copy link
Member

I thought the behavior could be related to the cgroup driver, but I obtained the same results (i.e. different device allowed lists) when using runc/crun on a Ubuntu 20.04 VM using Docker with a cgroupfs driver.

do you get the same results with runc and crun?

@Madeeks
Copy link
Author

Madeeks commented Mar 25, 2024

Apologies for the ambiguous wording.

I get the same results on my OpenSUSE laptop and the Ubuntu 20.04 VM.
That is, on both systems the device cgroups produced by runc and crun are different (when the runtimes are started inside a Docker container).

The Docker cgroup driver is different between the 2 platforms: systemd on the OpenSUSE laptop, cgroupfs on the Ubuntu VM.

If I create the outside container with Podman, I still obtain crun/runc device cgroup differences.
This happens even when using different values for Podman's --cgroup-manager option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants