Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

stage1: upgrade to systemd-v241 from Flatcar 2219.99.0 #4002

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

agermain
Copy link
Collaborator

@agermain agermain commented Jun 20, 2019

This PR is part of Google Summer of Code 2019.

Summary of work included within this PR:

  • Update to the latest version of systemd (v241)
  • Fix broken tests due to systemd upgrade (one of the changes being systemd adding an extra directory in the cgroupv1 hierarchy)
  • Switch to Flatcar as a drop-in replacement of Container Linux as some of the tests were failing due to unexpected behaviour of systemd and Flatcar's Edge channel contains the needed bug fixes.
  • Make rkt compatible with latest Go versions (in reference to gofmt and keyed composite literals being more future-proof)
  • Update both SemaphoreCI and Travis to use newer Go versions as well as fixing some of the dependency issues, thus making both platforms pass their builds successfully after this PR.

@agermain
Copy link
Collaborator Author

@iaguis PTAL.

@iaguis
Copy link
Member

iaguis commented Jun 21, 2019

Thanks for this! Now we need to fix the CI.

As @alban says in #3790 there was a change in system that made rkt not work when systemd<205 is used on the host. There were two suggestions: updating the SemaphoreCI platform to ubuntu 16.04 which has a newer systemd and skip registration with machined (more details here) when we detect systemd on the host has a version <205.

Initially, I thought we could do both. However it seems the function (allocate_scope()) that calls StartTransientUnit (which is the one that fails on old systemds) is called when machined registration is disabled and keep-unit (which rkt enables when it is run in a systemd service) is also disabled. It doesn't make sense to enable machined registration when systemd is that old because it will fail calling the registration methods and it doesn't make sense to enable keep-unit when we're not running in a systemd service so I don't think we can do anything here that doesn't require changing systemd-nspawn itself.

At this point I think we can ignore legacy systems so I just updated the Semaphore platform and it now complains about some package not being available, please fix that and we'll continue going through issues the CI finds until we fix them all.

@iaguis
Copy link
Member

iaguis commented Jun 21, 2019

it doesn't make sense to enable keep-unit when we're not running in a systemd service

Now that I look at the commit that added this allocate_scope function, maybe it does make sense to pass --keep-unit and register=no on old systems since for that combination it says "do not allocate nor register anything". However, keep-unit might have other bad effects so we should check it doesn't break other stuff.

In any case, this is something we can do later so don't worry about it for now.

@agermain agermain force-pushed the master branch 4 times, most recently from 99e3a18 to 1cac248 Compare June 28, 2019 16:26
@agermain agermain mentioned this pull request Jun 28, 2019
@agermain agermain force-pushed the master branch 3 times, most recently from b097878 to 77c224b Compare July 2, 2019 14:10
@agermain
Copy link
Collaborator Author

agermain commented Jul 2, 2019

Currently both jobs on semaphoreCI are failing, status is as follows:

Job 1:

stage1/usr_from_kvm/qemu.mk:71: recipe for target '/home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.30.0+git/stamps/__stage1_usr_from_kvm_qemu_mk_conf.stamp' failed
make: *** [/home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.30.0+git/stamps/__stage1_usr_from_kvm_qemu_mk_conf.stamp] Error 1

Job 2:

kt-coreos/build-rkt-1.30.0+git/stamps/__tests_functional_mk_functional_tests.stamp' failed
make: *** [/home/runner/rkt/builds/build-rkt-coreos/build-rkt-1.30.0+git/stamps/__tests_functional_mk_functional_tests.stamp] Error 1

It seems related to dependency issues (see #3698), however the aim is to fix it for the CI so I'm unsure how we can port that solution in this case, thoughts? @iaguis

@iaguis
Copy link
Member

iaguis commented Jul 3, 2019

They are two different failures, you often need to go further up to see the actual error.

Job 1 (KVM)

I assume this KVM build failure happens now because we're using the "Docker Light" Semaphore platform which misses some dependencies needed to build qemu. I suggest getting SSH access to the SemaphoreCI machine. Once there you can build with V=3 to see all the output and check what dependency is missing, install it, and continue until it builds. Then you need to modify the install-deps.sh script to add the new dependencies.

Job 2

Here it seems pretty much all the tests fail with a message like this:

Failed to allocate scope: Cannot set property CollectMode, or unknown property.

This means systemd-nspawn (through rkt) started the container specifying a property that systemd on the host doesn't understand (CollectMode was added on systemd v236 and Semaphore has systemd v229). This is another option set when allocate_scope() is called.

As mentioned in #4002 (comment), allocate_scope() is called when machined registration is not supported and keep-unit is disabled. Functional tests run rkt in a regular shell, so keep-unit is disabled (it's only enabled when rkt runs in a systemd service), but machined registration should work. Doing some tests, it seems machined registration doesn't work on Semaphore because we're missing machined itself on the host.

So an easy solution for this error is installing machined, which is part of the systemd-container package.

@agermain
Copy link
Collaborator Author

@iaguis Thanks for the feedback!

I have now solved issues related to the dependencies (libglib2.0-dev and libpixman-1-dev) for Job 1, as well as fixing the Failed to allocate scope: Cannot set property CollectMode, or unknown property. issue by installing machined as suggested.

Issues related to TestConfig have also been solved, the problem was that the json package had been relying on a bug within the reflect package, where it was able to allocate the unexported field, so after the bug was fixed it just needed modification.

Job 1 seems to be stuck at TestAppIsolatorMemory consistently despite of rebuilding (there was a possibility that maybe we were getting a slower VM on Semaphore), whereas Job 2 just fails the test, could there possibly be something that makes it timeout for the KVM flavor, but not for the coreOS?

Is it safe to assume that I should just keep on trying to make the rest of the tests pass? Mostly referring to Job 2, which as of now are:

  • TestAppIsolatorMemory (times out on Job 1)
  • TestAppIsolatorCPU
  • TestAppIsolatorCPUShares
  • TestDevices
  • TestPrepareAppCheckEtcHostname
  • TestNoNewPrivileges
  • TestPodManifest
  • TestAppIsolatorSeccomp
  • TestVolumeMount

@iaguis
Copy link
Member

iaguis commented Jul 15, 2019

Cool! Hopefully we're getting close.

For the KVM flavor, let's worry about it after we fix all the issues in the CoreOS flavor.

Yes, we should fix those Job 2 tests pass.

I'll have a look at them tomorrow to provide some guidance.

@iaguis
Copy link
Member

iaguis commented Jul 16, 2019

It seems systemd added an extra directory level in the cgroupv1 hierarchy (systemd/systemd@720f0a2) whereas before they only had that extra directory in the cgroupv2 hierarchy.

This breaks rkt's cgroup settings because we assume that the pod will be in /sys/fs/cgroup/$CONTROLLER/machine.slice/machine-rkt-.../ and it's now in /sys/fs/cgroup/$CONTROLLER/machine.slice/machine-rkt-.../payload/.

Adding the payload subdirectory here seems to make the isolator tests pass. We also need to remove the if here because systemd expects the rkt pod to be inside payload too when running inside a systemd service.

About TestPrepareAppCheckEtcHostname, it seems to be about new systemds removing underscores from the hostname, since we get customhostnamesetting and we expect custom_hostname_setting so we should change the expected hostname to something without underscores. It's probably this commit: systemd/systemd@d65652f

With that we still need to fix:

  • TestDevices
  • TestNoNewPrivileges
  • TestAppIsolatorSeccomp
  • TestVolumeMount

I haven't had time to look at those.

@agermain agermain force-pushed the master branch 2 times, most recently from 19cd83c to 0a77e96 Compare July 18, 2019 13:46
@agermain
Copy link
Collaborator Author

For the record, instead of committing the commented volume mount tests I am now awaiting on Flatcar's Edge build which will have reverted the change that broke the test, which was to mount recursively instead of non-recursively, reference: systemd/systemd#13170

Once the build is released we will update rkt to use Flatcar Edge instead of CoreOS Container Linux Stable.

@agermain agermain changed the title stage1: upgrade to systemd-v241 from CoreOS 2079.6.0 stage1: upgrade to systemd-v241 from Flatcar 2205.99.1 Jul 30, 2019
@agermain agermain force-pushed the master branch 2 times, most recently from 4898c0a to a2b5a05 Compare August 5, 2019 18:46
@agermain agermain changed the title stage1: upgrade to systemd-v241 from Flatcar 2205.99.1 stage1: upgrade to systemd-v241 from Flatcar 2219.99.0 Aug 5, 2019
Copy link
Member

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that tests pass now! 🎉

Some small things:

  • Could you rework commits so only the ones needed are present? For example, there's one commit that introduces Flatcar and another that updates the Flatcar version, the first one is not needed. Something similar with the several commits that deal with the payload addition. The idea is to keep changes divided logically in different commits so future developers have a good history.

  • Can you limit commit messages to 80 columns like mentioned in the contributing guidelines?

  • Please change the PR description summarizing the PR changes and mention this work was done as a part of GSOC.

Thanks for this!

stage1/gc/gc.go Outdated Show resolved Hide resolved
tests/build-and-run-tests.sh Show resolved Hide resolved
tests/build-and-run-tests.sh Outdated Show resolved Hide resolved
tests/build-and-run-tests.sh Show resolved Hide resolved
tests/rkt_devices_test.go Outdated Show resolved Hide resolved
tests/rkt_cat_manifest_test.go Show resolved Hide resolved
tests/rkt_devices_test.go Show resolved Hide resolved
tests/rkt_seccomp_test.go Show resolved Hide resolved
stage1/usr_from_coreos/cache.sh Show resolved Hide resolved
"make unit-check" throws a "gofmt checking failed", this resolves the issue.

Refers to rkt#4001
"make unit-check" throws many "KeyValue composite literal uses unkeyed fields", 
even though these checks can be disabled manually, this solution seems to be
more future-proof.

Refers to rkt#4001
Because of the switch to Docker Light from the Semaphore platform to have a 
newer distro, this is required to successfully build rkt on SemaphoreCI.
There are three arguments but call only takes two, which stops the 
test from passing.
iaguis and others added 3 commits August 7, 2019 19:28
New versions of systemd allow creating the ptmx device so we're switching to mem.
Since go1.11, `syscall.Stat()` is implemented with `newfstatat()`
(golang/go@1073256).

However we are testing that blocking `stat()` resulted in `syscall.Stat()`
failing, which doesn't work anymore.

We were already blocking `newfstatat()` in arm64 architectures so we
just block that syscall unconditionally now.
Currently functional tests are being run for the host flavor,
this will disable them if it detects the semaphoreCI environment,
as its systemd version is too old there.
@agermain agermain force-pushed the master branch 2 times, most recently from cad5f20 to d492b3a Compare August 7, 2019 19:05
@agermain agermain mentioned this pull request Aug 7, 2019
@agermain
Copy link
Collaborator Author

agermain commented Aug 7, 2019

Thanks for the feedback @iaguis! I believe all of the comments have been addressed, let me know.

Currently the build with SemaphoreCI fails as it can't find missing/obsolete
packages, this aims to fix libsystemd-journal-dev.
Arguments of wrong type are being passed to Fatalf, which 
stops the test from passing.
Some dependencies are required to compile QEMU and
also to address the "cannot set property CollectMode" 
in regards to allocate_scope().
Test was able to pass in the past due to a bug on the reflect
package, but now it cannot set an embedded pointer to an unexported
struct.
Due to systemd adding an extra directory in the cgroupv1 hierarchy, 
rkt's cgroup settings broke as it assumed that the pod
would be in another path.
Newer versions of systemd removes underscores from the hostname,
therefore expected value needs changed.
As CoreOS Container Linux is in maintenance mode
(coreos/bugs#2559 (comment)) and won't
see many new features and we need some systemd bugfixes that rkt needs
(specifically systemd/systemd#13173 and
systemd/systemd#12860) let's switch to Flatcar Linux.

Flatcar Linux us a drop-in replacement of Container Linux and its Edge channel
contains the needed bugfixes.

To do that, we need to change the GPG key to verify the images through the script.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants