stage1: upgrade to systemd-v241 from Flatcar 2219.99.0 #4002

agermain · 2019-06-20T16:57:42Z

This PR is part of Google Summer of Code 2019.

Summary of work included within this PR:

Update to the latest version of systemd (v241)
Fix broken tests due to systemd upgrade (one of the changes being systemd adding an extra directory in the cgroupv1 hierarchy)
Switch to Flatcar as a drop-in replacement of Container Linux as some of the tests were failing due to unexpected behaviour of systemd and Flatcar's Edge channel contains the needed bug fixes.
Make rkt compatible with latest Go versions (in reference to gofmt and keyed composite literals being more future-proof)
Update both SemaphoreCI and Travis to use newer Go versions as well as fixing some of the dependency issues, thus making both platforms pass their builds successfully after this PR.

agermain · 2019-06-20T20:45:49Z

iaguis · 2019-06-21T14:29:06Z

Thanks for this! Now we need to fix the CI.

As @alban says in #3790 there was a change in system that made rkt not work when systemd<205 is used on the host. There were two suggestions: updating the SemaphoreCI platform to ubuntu 16.04 which has a newer systemd and skip registration with machined (more details here) when we detect systemd on the host has a version <205.

Initially, I thought we could do both. However it seems the function (allocate_scope()) that calls StartTransientUnit (which is the one that fails on old systemds) is called when machined registration is disabled and keep-unit (which rkt enables when it is run in a systemd service) is also disabled. It doesn't make sense to enable machined registration when systemd is that old because it will fail calling the registration methods and it doesn't make sense to enable keep-unit when we're not running in a systemd service so I don't think we can do anything here that doesn't require changing systemd-nspawn itself.

At this point I think we can ignore legacy systems so I just updated the Semaphore platform and it now complains about some package not being available, please fix that and we'll continue going through issues the CI finds until we fix them all.

iaguis · 2019-06-21T14:31:19Z

it doesn't make sense to enable keep-unit when we're not running in a systemd service

Now that I look at the commit that added this allocate_scope function, maybe it does make sense to pass --keep-unit and register=no on old systems since for that combination it says "do not allocate nor register anything". However, keep-unit might have other bad effects so we should check it doesn't break other stuff.

In any case, this is something we can do later so don't worry about it for now.

agermain · 2019-07-02T16:17:23Z

Currently both jobs on semaphoreCI are failing, status is as follows:

Job 1:

stage1/usr_from_kvm/qemu.mk:71: recipe for target '/home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.30.0+git/stamps/__stage1_usr_from_kvm_qemu_mk_conf.stamp' failed
make: *** [/home/runner/rkt/builds/build-rkt-kvm/build-rkt-1.30.0+git/stamps/__stage1_usr_from_kvm_qemu_mk_conf.stamp] Error 1

Job 2:

kt-coreos/build-rkt-1.30.0+git/stamps/__tests_functional_mk_functional_tests.stamp' failed
make: *** [/home/runner/rkt/builds/build-rkt-coreos/build-rkt-1.30.0+git/stamps/__tests_functional_mk_functional_tests.stamp] Error 1

It seems related to dependency issues (see #3698), however the aim is to fix it for the CI so I'm unsure how we can port that solution in this case, thoughts? @iaguis

iaguis · 2019-07-03T14:45:20Z

They are two different failures, you often need to go further up to see the actual error.

Job 1 (KVM)

I assume this KVM build failure happens now because we're using the "Docker Light" Semaphore platform which misses some dependencies needed to build qemu. I suggest getting SSH access to the SemaphoreCI machine. Once there you can build with V=3 to see all the output and check what dependency is missing, install it, and continue until it builds. Then you need to modify the install-deps.sh script to add the new dependencies.

Job 2

Here it seems pretty much all the tests fail with a message like this:

Failed to allocate scope: Cannot set property CollectMode, or unknown property.

This means systemd-nspawn (through rkt) started the container specifying a property that systemd on the host doesn't understand (CollectMode was added on systemd v236 and Semaphore has systemd v229). This is another option set when allocate_scope() is called.

As mentioned in #4002 (comment), allocate_scope() is called when machined registration is not supported and keep-unit is disabled. Functional tests run rkt in a regular shell, so keep-unit is disabled (it's only enabled when rkt runs in a systemd service), but machined registration should work. Doing some tests, it seems machined registration doesn't work on Semaphore because we're missing machined itself on the host.

So an easy solution for this error is installing machined, which is part of the systemd-container package.

agermain · 2019-07-15T13:05:46Z

@iaguis Thanks for the feedback!

I have now solved issues related to the dependencies (libglib2.0-dev and libpixman-1-dev) for Job 1, as well as fixing the Failed to allocate scope: Cannot set property CollectMode, or unknown property. issue by installing machined as suggested.

Issues related to TestConfig have also been solved, the problem was that the json package had been relying on a bug within the reflect package, where it was able to allocate the unexported field, so after the bug was fixed it just needed modification.

Job 1 seems to be stuck at TestAppIsolatorMemory consistently despite of rebuilding (there was a possibility that maybe we were getting a slower VM on Semaphore), whereas Job 2 just fails the test, could there possibly be something that makes it timeout for the KVM flavor, but not for the coreOS?

Is it safe to assume that I should just keep on trying to make the rest of the tests pass? Mostly referring to Job 2, which as of now are:

TestAppIsolatorMemory (times out on Job 1)
TestAppIsolatorCPU
TestAppIsolatorCPUShares
TestDevices
TestPrepareAppCheckEtcHostname
TestNoNewPrivileges
TestPodManifest
TestAppIsolatorSeccomp
TestVolumeMount

iaguis · 2019-07-15T17:01:45Z

Cool! Hopefully we're getting close.

For the KVM flavor, let's worry about it after we fix all the issues in the CoreOS flavor.

Yes, we should fix those Job 2 tests pass.

I'll have a look at them tomorrow to provide some guidance.

iaguis · 2019-07-16T16:39:09Z

It seems systemd added an extra directory level in the cgroupv1 hierarchy (systemd/systemd@720f0a2) whereas before they only had that extra directory in the cgroupv2 hierarchy.

This breaks rkt's cgroup settings because we assume that the pod will be in /sys/fs/cgroup/$CONTROLLER/machine.slice/machine-rkt-.../ and it's now in /sys/fs/cgroup/$CONTROLLER/machine.slice/machine-rkt-.../payload/.

Adding the payload subdirectory here seems to make the isolator tests pass. We also need to remove the if here because systemd expects the rkt pod to be inside payload too when running inside a systemd service.

About TestPrepareAppCheckEtcHostname, it seems to be about new systemds removing underscores from the hostname, since we get customhostnamesetting and we expect custom_hostname_setting so we should change the expected hostname to something without underscores. It's probably this commit: systemd/systemd@d65652f

With that we still need to fix:

TestDevices
TestNoNewPrivileges
TestAppIsolatorSeccomp
TestVolumeMount

I haven't had time to look at those.

agermain · 2019-07-27T14:22:18Z

For the record, instead of committing the commented volume mount tests I am now awaiting on Flatcar's Edge build which will have reverted the change that broke the test, which was to mount recursively instead of non-recursively, reference: systemd/systemd#13170

Once the build is released we will update rkt to use Flatcar Edge instead of CoreOS Container Linux Stable.

iaguis

It's great that tests pass now! 🎉

Some small things:

Could you rework commits so only the ones needed are present? For example, there's one commit that introduces Flatcar and another that updates the Flatcar version, the first one is not needed. Something similar with the several commits that deal with the payload addition. The idea is to keep changes divided logically in different commits so future developers have a good history.
Can you limit commit messages to 80 columns like mentioned in the contributing guidelines?
Please change the PR description summarizing the PR changes and mention this work was done as a part of GSOC.

Thanks for this!

stage1/gc/gc.go

tests/build-and-run-tests.sh

tests/rkt_devices_test.go

tests/rkt_cat_manifest_test.go

tests/rkt_devices_test.go

tests/rkt_seccomp_test.go

stage1/usr_from_coreos/cache.sh

"make unit-check" throws a "gofmt checking failed", this resolves the issue. Refers to rkt#4001

"make unit-check" throws many "KeyValue composite literal uses unkeyed fields", even though these checks can be disabled manually, this solution seems to be more future-proof. Refers to rkt#4001

Because of the switch to Docker Light from the Semaphore platform to have a newer distro, this is required to successfully build rkt on SemaphoreCI.

There are three arguments but call only takes two, which stops the test from passing.

New versions of systemd allow creating the ptmx device so we're switching to mem.

Since go1.11, `syscall.Stat()` is implemented with `newfstatat()` (golang/go@1073256). However we are testing that blocking `stat()` resulted in `syscall.Stat()` failing, which doesn't work anymore. We were already blocking `newfstatat()` in arm64 architectures so we just block that syscall unconditionally now.

Currently functional tests are being run for the host flavor, this will disable them if it detects the semaphoreCI environment, as its systemd version is too old there.

agermain · 2019-08-07T21:24:32Z

Thanks for the feedback @iaguis! I believe all of the comments have been addressed, let me know.

Currently the build with SemaphoreCI fails as it can't find missing/obsolete packages, this aims to fix libsystemd-journal-dev.

Arguments of wrong type are being passed to Fatalf, which stops the test from passing.

Some dependencies are required to compile QEMU and also to address the "cannot set property CollectMode" in regards to allocate_scope().

Test was able to pass in the past due to a bug on the reflect package, but now it cannot set an embedded pointer to an unexported struct.

Due to systemd adding an extra directory in the cgroupv1 hierarchy, rkt's cgroup settings broke as it assumed that the pod would be in another path.

Newer versions of systemd removes underscores from the hostname, therefore expected value needs changed.

As CoreOS Container Linux is in maintenance mode (coreos/bugs#2559 (comment)) and won't see many new features and we need some systemd bugfixes that rkt needs (specifically systemd/systemd#13173 and systemd/systemd#12860) let's switch to Flatcar Linux. Flatcar Linux us a drop-in replacement of Container Linux and its Edge channel contains the needed bugfixes. To do that, we need to change the GPG key to verify the images through the script.

agermain force-pushed the master branch from 5db71ca to d188deb Compare June 20, 2019 20:43

agermain force-pushed the master branch 4 times, most recently from 99e3a18 to 1cac248 Compare June 28, 2019 16:26

agermain mentioned this pull request Jun 28, 2019

Fix "make unit-check" #4003

Closed

agermain force-pushed the master branch 3 times, most recently from b097878 to 77c224b Compare July 2, 2019 14:10

agermain force-pushed the master branch from d582dae to 77ce6b7 Compare July 12, 2019 17:54

agermain force-pushed the master branch 2 times, most recently from 19cd83c to 0a77e96 Compare July 18, 2019 13:46

agermain changed the title ~~stage1: upgrade to systemd-v241 from CoreOS 2079.6.0~~ stage1: upgrade to systemd-v241 from Flatcar 2205.99.1 Jul 30, 2019

agermain force-pushed the master branch 2 times, most recently from 4898c0a to a2b5a05 Compare August 5, 2019 18:46

agermain changed the title ~~stage1: upgrade to systemd-v241 from Flatcar 2205.99.1~~ stage1: upgrade to systemd-v241 from Flatcar 2219.99.0 Aug 5, 2019

iaguis suggested changes Aug 7, 2019

View reviewed changes

agermain added 4 commits August 7, 2019 18:50

rkt: fix gofmt checks in "make unit-check"

cbc47e1

"make unit-check" throws a "gofmt checking failed", this resolves the issue. Refers to rkt#4001

rkt: fix KeyValue unkeyed fields in "make unit-check"

46d8e1a

"make unit-check" throws many "KeyValue composite literal uses unkeyed fields", even though these checks can be disabled manually, this solution seems to be more future-proof. Refers to rkt#4001

tests: fix Go installation for SemaphoreCI

24c1063

Because of the switch to Docker Light from the Semaphore platform to have a newer distro, this is required to successfully build rkt on SemaphoreCI.

tests: fix rkt_cat_manifest_test

b267e40

There are three arguments but call only takes two, which stops the test from passing.

iaguis and others added 3 commits August 7, 2019 19:28

tests: fix TestDevices

3c14346

New versions of systemd allow creating the ptmx device so we're switching to mem.

tests: disable functional tests for host flavor on CI

57570cc

Currently functional tests are being run for the host flavor, this will disable them if it detects the semaphoreCI environment, as its systemd version is too old there.

agermain force-pushed the master branch 2 times, most recently from cad5f20 to d492b3a Compare August 7, 2019 19:05

agermain mentioned this pull request Aug 7, 2019

make unit-check fails #4001

Closed

agermain added 8 commits August 9, 2019 17:54

tests: Fix libsystemd-journal-dev for CI

8120d4d

Currently the build with SemaphoreCI fails as it can't find missing/obsolete packages, this aims to fix libsystemd-journal-dev.

tests: fix rkt_volume_mount_test

449bd28

Arguments of wrong type are being passed to Fatalf, which stops the test from passing.

tests: add dependencies for SemaphoreCI

6c14bbe

Some dependencies are required to compile QEMU and also to address the "cannot set property CollectMode" in regards to allocate_scope().

tests: fix rkt_config_test

ebca8d4

Test was able to pass in the past due to a bug on the reflect package, but now it cannot set an embedded pointer to an unexported struct.

travis: update go versions

1adc3e2

stage1: fix cgroup cleanup and directories

59c9dc0

Due to systemd adding an extra directory in the cgroupv1 hierarchy, rkt's cgroup settings broke as it assumed that the pod would be in another path.

tests: fix TestPrepareAppCheckEtcHostname

c3b86f5

Newer versions of systemd removes underscores from the hostname, therefore expected value needs changed.

agermain force-pushed the master branch from d492b3a to 19ede75 Compare August 9, 2019 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage1: upgrade to systemd-v241 from Flatcar 2219.99.0 #4002

stage1: upgrade to systemd-v241 from Flatcar 2219.99.0 #4002

agermain commented Jun 20, 2019 •

edited

agermain commented Jun 20, 2019

iaguis commented Jun 21, 2019

iaguis commented Jun 21, 2019

agermain commented Jul 2, 2019

iaguis commented Jul 3, 2019

agermain commented Jul 15, 2019

iaguis commented Jul 15, 2019

iaguis commented Jul 16, 2019

agermain commented Jul 27, 2019

iaguis left a comment

agermain commented Aug 7, 2019

stage1: upgrade to systemd-v241 from Flatcar 2219.99.0 #4002

Are you sure you want to change the base?

stage1: upgrade to systemd-v241 from Flatcar 2219.99.0 #4002

Conversation

agermain commented Jun 20, 2019 • edited

agermain commented Jun 20, 2019

iaguis commented Jun 21, 2019

iaguis commented Jun 21, 2019

agermain commented Jul 2, 2019

iaguis commented Jul 3, 2019

Job 1 (KVM)

Job 2

agermain commented Jul 15, 2019

iaguis commented Jul 15, 2019

iaguis commented Jul 16, 2019

agermain commented Jul 27, 2019

iaguis left a comment

Choose a reason for hiding this comment

agermain commented Aug 7, 2019

agermain commented Jun 20, 2019 •

edited