Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsexec: cloned binary rework #3987

Merged
merged 8 commits into from
Sep 24, 2023
Merged

nsexec: cloned binary rework #3987

merged 8 commits into from
Sep 24, 2023

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Aug 18, 2023

Split out from #3953 to be reviewed separately.


First, Migrate all of the memfd /proc/self/exe logic to Go code.
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.

Then, carry the code from #3983 and combine it with the above to
keep all of the runc-dmz logic in the Go code as well. The runc-dmz
feature is optional (controlled by the runc_nodmz build tag at
build time, or RUNC_DMZ=legacy at runtime), with a fallback to the
classic /proc/self/exe cloning. If /proc/self/exe is already a
cloned binary then neither runc-dmz nor /proc/self/exe are
cloned.

This also adds a memfd-bind helper binary for users which really
need to avoid any excess copies. Unfortunately, the semantics of
bind-mounting a memfd are quite ugly (they cannot be bind-mounted
directly -- instead, you must bind-mount the magic-link of a file
descriptor pointing to the memfd, which means you need a long-running
process to keep the magic-link alive).

Fixes #3965
Fixes #3973
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

@cyphar cyphar added this to the 1.2.0 milestone Aug 18, 2023
@cyphar cyphar marked this pull request as ready for review August 20, 2023 13:58
README.md Outdated Show resolved Hide resolved
}
path := "/proc/self/fd/" + strconv.Itoa(int(f.Fd()))
if err := unix.Access(path, unix.X_OK); err == nil {
return true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary might be still unexecutable depending on LSM?
Maybe we should just execute the binary and see whether it returns a magic ret code?

Copy link
Member Author

@cyphar cyphar Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, though doing a maximum of ~10 extra fork+execs in the absolute worst case seems a bit expensive... This also can't detect every LSM issue (it's possible executing a host /tmp program will be blocked by the container's LSM setup -- and we cannot detect that this early into container setup).

I think the only true solution for a really serious LSM setup is to disable runc-dmz entirely, or we'll have to create a tmpfs mountfd that allows executables, which would add more mount_lock lock contention again....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new RUNC_DMZ=legacy option means that users can work around this if they hit an issue here. I don't think there's a nicer way of dealing with this problem completely tbh.

Makefile Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
@cyphar cyphar requested a review from lifubang August 23, 2023 07:10
@cyphar cyphar dismissed lifubang’s stale review August 23, 2023 08:07

switched to LANG=C.UTF-8

Makefile Outdated Show resolved Hide resolved
libcontainer/dmz/dmz.go Outdated Show resolved Hide resolved
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall; left some nits but those can be addressed in a followup.

if err != nil {
return nil, err
}
copied, err := io.Copy(file, src)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cyphar Do you really want to use io.Copy instead of unix.Sendfile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's necessary. I only used sendfile(2) in C because it is easier than doing buffered copies. io.Copy is more than fast enough in Go (it might even use sendfile(2) as an optimisation, I'm not sure).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.
  2. ‘io.Copy’ may needs more 32kb memory than sendfile().

But maybe no need for a 14MB file copy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed Go had an (*os.File).ReadFrom implementation that used sendfile(2) (which would make io.Copy use it without a buffer) but it turns out they only have one which uses splice(2) (which does efficient copies to and from pipes). I might send a patch to fix that...

I'm not sure whether it'll makes a real difference, I can write a quick benchmark to check...

This comment was marked as outdated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems that on some systems the difference can be pretty significant. I've added sendfile(2) acceleration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed Go had an (*os.File).ReadFrom implementation that used sendfile(2) (which would make io.Copy use it without a buffer) but it turns out they only have one which uses splice(2) (which does efficient copies to and from pipes). I might send a patch to fix that...

@cyphar hmm, from what I see, Go runtime uses copy_file_range(2) to copy files since Go 1.15 (https://go-review.googlesource.com/c/go/+/229101). I haven't checked but I assume that copy_file_range(2) is as efficient as sendfile(2) for this case.

cyphar and others added 8 commits September 22, 2023 15:13
We need these to match the Makefile detection of the right gcc for
runc-dmz, as well as making sure that everything builds properly for our
cross-i386 tests. While we're at it, add x86 to the list of build
targets for release builds (presumably nobody will use it, but since we
do test builds of this anyway it probably won't hurt).

In addition, clean up the handling of the native architecture build by
treating it the same as any other build (ensuring that building runc
from a different platform will work the same way regardless of the
native architecture). In practice, the build works the same way as
before.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Previously, if /var/run was mounted noexec, our cloned binary logic
would not work if memfd_create(2) was not available because we would try
to exec a binary that is on a noexec filesystem.

We cannot guarantee there will be an executable filesystem on the system
(other than mounting one ourselves, which would cause a bunch of other
headaches) but we can at least try the obvious options (/tmp, /bin, and
/). If none of these work, we will have to fail.

Reported-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The idea is to remove the need for cloning the entire runc binary by
replacing the final execve() call of the container process with an
execve() call to a clone of a small C binary which just does an execve()
of its arguments.

This provides similar protection against CVE-2019-5736 but without
requiring a >10MB binary copy for each "runc init". When compiled with
musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB
which is still quite large).

It should be noted that there is still a window where the container
processes could get access to the host runc binary, but because we set
ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which
is not enabled by default in Docker) in order to get around the
proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the
kernel blocks access entirely for user namespaced containers in this
scenario. For those cases we cannot use runc-dmz, but most containers
won't have this issue.

This new runc-dmz binary can be opted out of at compile time by setting
the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy
environment variable. In both cases, runc will fall back to the classic
/proc/self/exe-based cloning trick. If /proc/self/exe is already a
sealed memfd (namely if the user is using contrib/cmd/memfd-bind to
create a persistent sealed memfd for runc), neither runc-dmz nor
/proc/self/exe cloning will be used because they are not necessary.

[1]: torvalds/linux@bfedb58

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[cyphar: address various review nits]
[cyphar: fix runc-dmz cross-compilation]
[cyphar: embed runc-dmz into runc binary and clone in Go code]
[cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning]
[cyphar: do not use runc-dmz when the container has certain privs]
Co-authored-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This really isn't ideal but it can be used to avoid the largest issues
with the memfd-based runc binary protection. There are several caveats
with using this tool, see the help page for the new binary for details.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Due to the way capabilities have to be set by runc, capabilities need to
be included in the inheritable and ambient sets anyway. Otherwise, the
container process would not have the correct privileges. This test only
functioned because adding CAP_DAC_OVERRIDE to the inherited,
permissible, and bounding sets means that only "runc init" has these
capabilities -- everything other than the bounding set is cleared on the
first execve(). This breaks with runc-dmz, but the behaviour was broken
from the outset.

Docker appears to not handle this properly at all (the logic for
capability sets changed with the introduction of ambient capabilities,
and while Docker was updated it seems the behaviour is still incorrect
for non-root users).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This results in a 5-20% speedup of dmz.CloneBinary(), depending on the
machine.

io.Copy:

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/dmz
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkCloneBinary
  BenchmarkCloneBinary-8               139           8075074 ns/op
  PASS
  ok      github.com/opencontainers/runc/libcontainer/dmz 2.286s

unix.Sendfile:

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/dmz
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkCloneBinary
  BenchmarkCloneBinary-8               192           6382121 ns/op
  PASS
  ok      github.com/opencontainers/runc/libcontainer/dmz 2.415s

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants