Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Updates after 0.5.18 breaks Gitlab CI (CNI) #1160

Closed
fullykubed opened this issue Aug 13, 2021 · 9 comments
Closed

Bug: Updates after 0.5.18 breaks Gitlab CI (CNI) #1160

fullykubed opened this issue Aug 13, 2021 · 9 comments
Assignees
Labels
type:bug Something isn't working

Comments

@fullykubed
Copy link

Recently tried to upgrade our Gitlab runners from earthly 0.5.18 to 0.5.22 but it appears the new default Buildkit backend breaks on GitlabCI. The version bundled with 0.5.18 worked.

This appears to be due to some CNI configuration issues. I am investigating further, but I wanted to share here in case anyone runs into similar issues.

Error logs

�[34m           buildkitd�[0m | Starting buildkit daemon as a docker container (earthly-buildkitd)...
�[35m      buildkitd-pull�[0m | Pulling buildkitd image...
�[35m      buildkitd-pull�[0m | ...Done
�[91mError stack trace:
�[0m�[91mgithub.com/earthly/earthly/buildkitd.init
�[0m�[91m	/earthly/buildkitd/buildkitd.go:28
�[0m�[91mruntime.doInit
�[0m�[91m	/usr/local/go/src/runtime/proc.go:6308
�[0m�[91mruntime.doInit
�[0m�[91m	/usr/local/go/src/runtime/proc.go:6285
�[0m�[91mruntime.main
�[0m�[91m	/usr/local/go/src/runtime/proc.go:208
�[0m�[91mruntime.goexit
�[0m�[91m	/usr/local/go/src/runtime/asm_amd64.s:1371
�[0m�[91mError: build new buildkitd client: maybe start buildkitd: wait until started: buildkitd crashed
�[0m�[91mIt seems that buildkitd is shutting down or it has crashed. You can report crashes at https://github.com/earthly/earthly/issues/new.
�[0m�[91m==================================System Info===================================
�[0mversion: v0.5.22
build-sha: bdd2a82a47e9249f7ecba05cd24f4483f07c7101
platform: linux/amd64; Debian 10
�[91m=================================Docker Version=================================
�[0mClient:
 Version:           20.10.6
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        370c289
 Built:             Fri Apr  9 22:42:10 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.8
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.6
  Git commit:       75249d8
  Built:            Fri Jul 30 19:55:09 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.4.9
  GitCommit:        e25210fe30a0a703442421b0f60afac609f950a3
 runc:
  Version:          1.0.1
  GitCommit:        v1.0.1-0-g4144b638
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

�[91m=================================Buildkit Logs==================================
�[0mstarting earthly-buildkit with EARTHLY_GIT_HASH=bdd2a82a47e9249f7ecba05cd24f4483f07c7101 BUILDKIT_BASE_IMAGE=github.com/earthly/buildkit:199ad6a5c213d2a6937ced9e2b52b5b57e0a3a37+build
Legacy ip_tables is not loaded.
Switching to iptables-nft.
Module                  Size  Used by    Tainted: G  
xt_nat                 16384  1 
nf_log_ipv4            16384  1 
nf_log_common          16384  1 nf_log_ipv4
xt_LOG                 16384  1 
xt_limit               16384  1 
xt_hashlimit           20480  4 
veth                   24576  0 
xt_MASQUERADE          16384  3 
xt_addrtype            16384  4 
iptable_nat            16384  2 
nf_nat                 61440  3 xt_nat,xt_MASQUERADE,iptable_nat
br_netfilter           24576  0 
ip6table_filter        16384  1 
ip6_tables             28672  1 ip6table_filter
aesni_intel           368640  0 
glue_helper            20480  1 aesni_intel
crypto_simd            16384  1 aesni_intel
cryptd                 24576  1 crypto_simd
loadpin_trigger        12288  0 [permanent]
BUILDKIT_ROOT_DIR=/tmp/earthly/buildkit
CACHE_SIZE_MB=0
EARTHLY_ADDITIONAL_BUILDKIT_CONFIG=
CNI_MTU=1500

======== CNI config ==========
{
	"cniVersion": "0.3.0",
	"name": "buildkitbuild",
	"type": "bridge",
	"bridge": "cni0",
	"isGateway": true,
	"ipMasq": true,
	"mtu": 1500,
	"ipam": {
		"type": "host-local",
		"subnet": "172.30.0.0/16",
		"routes": [
			{ "dst": "0.0.0.0/0" }
		]
	}
}
======== End CNI config ==========

======== Buildkitd config ==========
debug = false
root = "/tmp/earthly/buildkit"
insecure-entitlements = [ "security.insecure" ]




[worker.oci]
  enabled = true
  snapshotter = "auto"
  max-parallelism = 20
  gc = true
  networkMode = "cni"
  cniBinaryPath = "/usr/libexec/cni"
  cniConfigPath = "/etc/cni/cni-conf.json"
  


======== End buildkitd config ==========
Detected container architecture is x86_64
starting shellrepeater
time="2021-08-13T16:09:46Z" level=debug msg="starting debugger server" addr="0.0.0.0:8373" app=shellrepeater
time="2021-08-13T16:09:46Z" level=info msg="auto snapshotter: using overlayfs"
buildkitd: failed to list chains: running [/sbin/iptables -t nat -S --wait]: exit status 4: iptables v1.8.7 (nf_tables): Could not fetch rule set generation id: Invalid argument


CNI setup error
github.com/moby/buildkit/util/network/cniprovider.(*cniProvider).New
	/src/util/network/cniprovider/cni.go:81
github.com/moby/buildkit/util/network/cniprovider.(*cniProvider).initNetwork
	/src/util/network/cniprovider/cni.go:65
github.com/moby/buildkit/util/network/cniprovider.New
	/src/util/network/cniprovider/cni.go:46
github.com/moby/buildkit/util/network/netproviders.Providers
	/src/util/network/netproviders/network.go:22
github.com/moby/buildkit/worker/runc.NewWorkerOpt
	/src/worker/runc/runc.go:44
main.ociWorkerInitializer
	/src/cmd/buildkitd/main_oci_worker.go:290
main.newWorkerController
	/src/cmd/buildkitd/main.go:712
main.newController
	/src/cmd/buildkitd/main.go:655
main.main.func3
	/src/cmd/buildkitd/main.go:276
github.com/urfave/cli.HandleAction
	/src/vendor/github.com/urfave/cli/app.go:523
github.com/urfave/cli.(*App).Run
	/src/vendor/github.com/urfave/cli/app.go:285
main.main
	/src/cmd/buildkitd/main.go:338
runtime.main
	/usr/local/go/src/runtime/proc.go:225
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1371
Error: buildkit process has exited

Warning: timedout while sending analytics
Message sent!
section_end:1628870988:step_script
�[0Ksection_start:1628870988:cleanup_file_variables
�[0K�[0K�[36;1mCleaning up file based variables�[0;m
�[0;msection_end:1628870989:cleanup_file_variables
�[0K�[31;1mERROR: Job failed: exit code 1
�[0;m
@fullykubed fullykubed changed the title Bug: Updated Buildkit fails to work in Gitlab CI (CNI) Bug: Updates after 0.5.18 breaks Gitlab CI (CNI) Aug 13, 2021
@fullykubed
Copy link
Author

I have confirmed that the breakage begins in 0.5.19. I see the release notes mention a new iptables autodetection mechanism. Continuing to investigate.

@fullykubed
Copy link
Author

I believe this commit is the culprit: 27ad909#diff-3c2736a240a36fe2aa60abcac682b73839b9cecbe0822e35a5ce12f281e31102

@fullykubed
Copy link
Author

I think it will be difficult to auto-detect the appropriate binary in all use cases. It looks like this already ran into trouble with WSL as well: 0bb37ac#diff-3c2736a240a36fe2aa60abcac682b73839b9cecbe0822e35a5ce12f281e31102.

I'd propose a fix that allows users to explicitly choose which version of iptables to use in case autodetection makes the incorrect decision.

@dchw
Copy link
Collaborator

dchw commented Aug 13, 2021

Yeah; I agree. This was introduced in response to a user having the opposite problem; if we shaped the fix like the override we have for CNI_MTU, would that be acceptable?

Though I am curious. What module does the kernel show in your case for ip_tables? And what distro/kernel?

@dchw dchw self-assigned this Aug 13, 2021
@fullykubed
Copy link
Author

Specifically, it appears that the Gitlab issue is likely related to the iptables being statically compiled into the kernel. In this case, ip_tables will not appear in the output to lsmod. For example, here is the output of lsmod for the Gitlab-provided kernel:

Module                  Size  Used by    Tainted: G  
nf_log_ipv4            16384  1 
nf_log_common          16384  1 nf_log_ipv4
xt_LOG                 16384  1 
xt_limit               16384  1 
xt_hashlimit           20480  4 
veth                   24576  0 
xt_MASQUERADE          16384  2 
xt_addrtype            16384  4 
iptable_nat            16384  2 
nf_nat                 61440  2 xt_MASQUERADE,iptable_nat
br_netfilter           24576  0 
ip6table_filter        16384  1 
ip6_tables             28672  1 ip6table_filter
aesni_intel           368640  0 
glue_helper            20480  1 aesni_intel
crypto_simd            16384  1 aesni_intel
cryptd                 24576  1 crypto_simd
loadpin_trigger        12288  0 [permanent]

Notice that neither ip_tables nor nf_tables are listed as modules.

So I believe that the auto-detection script needs to be updated to be a bit more intelligent to handle cases like this. I will submit a proposal here after a bit more research.

Unfortunately, this is not an MTU issue.

@dchw
Copy link
Collaborator

dchw commented Aug 13, 2021

Unfortunately, this is not an MTU issue.

I hear you. When I said "shaped like", I meant more like provide an environment variable + config file option for this setting, like MTU does.

We could either:

  1. Provide a setting to specifically choose what ip_tables implementation to use.
  2. Provide a setting to just bypass the check and use what is already configured.

@fullykubed
Copy link
Author

Ah gotchya. I misinterpreted. My recommendation would be the first option as it allows a bit more control.

@fullykubed
Copy link
Author

Here is my proposal for the auto-detection algo:

  1. If the user specifically selects one, use that. Otherwise, continue.
  2. Check lsmod for ip_tables. If present, use /sbin/iptables. Otherwise, continue.
  3. Check lsmod for nf_tables. If present, use /sbin/iptables-nft. Otherwise, continue.
  4. Try both /sbin/iptables -t nat -S --wait and /sbin/iptables-nft -t nat -S --wait
    3.1 If one of those fails, use the other.
    3.2 If both succeed, use the one that prints more lines (hacky but appears to be what is suggested).
    3.3 If both fail, exit with error message.

@vladaionescu vladaionescu added the type:bug Something isn't working label Aug 13, 2021
@dchw
Copy link
Collaborator

dchw commented Aug 19, 2021

Should be fixed here: #1172

@dchw dchw closed this as completed Aug 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants