Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Conformance Ginkgo: Client pods not ready after timeout: timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expired #32404

Open
christarazi opened this issue May 7, 2024 · 2 comments
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@christarazi
Copy link
Member

CI failure


• Failure [1086.350 seconds]
K8sDatapathConfig
/home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
  High-scale IPcache
  /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:461
    Test ingress policy enforcement with GENEVE and endpoint routes [It]
    /home/runner/work/cilium/cilium/test/ginkgo-ext/scopes.go:515

    Client pods not ready after timeout
    Unexpected error:
        <*fmt.wrapError | 0xc003562e20>: 
        timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expired
        {
            msg: "timed out waiting for pods with filter -l type=client to be ready: 8m0s timeout expired",
            err: <*errors.errorString | 0xc003561e90>{
                s: "8m0s timeout expired",
            },
        }
    occurred

    /home/runner/work/cilium/cilium/test/k8s/datapath_configuration.go:695

Sysdump is too large to upload here.

https://github.com/cilium/cilium/actions/runs/8980253691

@christarazi christarazi added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels May 7, 2024
@mhofstetter
Copy link
Member

mhofstetter commented May 16, 2024

catched another one and detected the following in the sysdump

https://github.com/cilium/cilium/actions/runs/9107971706/job/25037869655

(inner sysdump cilium-sysdump.zip)

kube-system         cilium-9n6jt                                 0/1     CrashLoopBackOff   4 (81s ago)     4m1s    172.18.0.2   kind-worker          <none>           <none>
    containerStatuses:
    - containerID: containerd://67dd5a15659f8ca69aed97b688564f8f8ac2bda6d4a235738c47055737125ac8
      image: quay.io/cilium/cilium-ci:4aa6a7fa004b3147bb0cf6f368bebb8f4df450df
      imageID: quay.io/cilium/cilium-ci@sha256:39c4823838cae593227de56881164efd72b15534cf82395999c83d2746c43d97
      lastState:
        terminated:
          containerID: containerd://67dd5a15659f8ca69aed97b688564f8f8ac2bda6d4a235738c47055737125ac8
          exitCode: 1
          finishedAt: "2024-05-16T07:32:49Z"
          message: |-
            time="2024-05-16T07:32:48Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 118.75% MEM: 0.61% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/baf7f5ca798fcaa63467381be04513cbeb161c9041d734680d12cf82541b2db0 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -] MEM-EXT: RSS: 73 VMS: 125 Data: 20 Stack: 0 Locked: 0 Swap: 0" endpointID=2645 subsys=endpoint
            time="2024-05-16T07:32:48Z" level=debug msg="Load 1-min: 1.48 5-min: 3.11 15min: 2.79" subsys=loadinfo type=background
            time="2024-05-16T07:32:48Z" level=debug msg="Memory: Total: 11969 Used: 2138 (17.86%) Free: 2335 Buffers: 145 Cached: 7155" subsys=loadinfo type=background
            time="2024-05-16T07:32:48Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" subsys=loadinfo type=background
            time="2024-05-16T07:32:48Z" level=debug msg="NAME cilium-agent STATUS S PID 1 CPU: 30.13% MEM: 1.17% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 140 VMS: 1308 Data: 108 Stack: 0 Locked: 0 Swap: 0" subsys=loadinfo type=background
            time="2024-05-16T07:32:48Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 128.75% MEM: 0.61% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/baf7f5ca
          reason: Error
          startedAt: "2024-05-16T07:32:36Z"
      name: cilium-agent
      ready: false
      restartCount: 4
      started: false
      state:
        waiting:
          message: back-off 1m20s restarting failed container=cilium-agent pod=cilium-9n6jt_kube-system(3726000e-e008-4b5c-b82d-55b295cf8cd6)
          reason: CrashLoopBackOff

another one (https://github.com/cilium/cilium/actions/runs/9101307271) with the same ending loglines in the prev logfile of the agents. (inner sysdump: cilium-sysdump.zip)

2024-05-15T19:36:27.076881948Z time="2024-05-15T19:36:27Z" level=debug msg="Controller run succeeded; waiting for next controller update or stop" name=ipcache-inject-labels subsys=controller uuid=9208713a-1b4f-4c51-977e-977c62191690
2024-05-15T19:36:27.076884613Z time="2024-05-15T19:36:27Z" level=debug msg="Load 1-min: 1.90 5-min: 2.96 15min: 2.90" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.076887719Z time="2024-05-15T19:36:27Z" level=debug msg="Memory: Total: 11969 Used: 2116 (17.68%) Free: 2361 Buffers: 143 Cached: 7150" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.076890233Z time="2024-05-15T19:36:27Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.076892908Z time="2024-05-15T19:36:27Z" level=debug msg="NAME cilium-agent STATUS R PID 1 CPU: 31.71% MEM: 1.16% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 138 VMS: 1308 Data: 104 Stack: 0 Locked: 0 Swap: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:27.080792604Z time="2024-05-15T19:36:27Z" level=debug msg="Controller func execution time: 5.009658018s" name="sync-to-k8s-ciliumendpoint (347)" subsys=controller uuid=0ef00285-88af-4b92-8cd4-296fbf43fde3
2024-05-15T19:36:27.080978030Z time="2024-05-15T19:36:27Z" level=debug msg="upserting health status" reporter-id="agent.controlplane.endpoint-manager.cilium-endpoint-347 (kube-system/coredns-7db6d8ff4d-hjnjp).cep-k8s-sync" status="agent.controlplane.endpoint-manager.cilium-endpoint-347 (kube-system/coredns-7db6d8ff4d-hjnjp).cep-k8s-sync: [OK] sync-to-k8s-ciliumendpoint (347)" subsys=healthv2
2024-05-15T19:36:27.084355080Z time="2024-05-15T19:36:27Z" level=debug msg="Compiling datapath" clang="clang version 17.0.6 (https://github.com/llvm/llvm-project.git 6009708b4367171ccdbf4b5905cb6a803753fe18)\nTarget: unknown\nThread model: posix\nInstalledDir: /usr/local/bin\n" debug=true subsys=datapath-loader
2024-05-15T19:36:27.084462357Z time="2024-05-15T19:36:27Z" level=debug msg="Launching compiler" args="[-I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -E -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -]" subsys=datapath-loader target=clang
2024-05-15T19:36:27.123704426Z time="2024-05-15T19:36:27Z" level=debug msg="Compilation had peak RSS of 141772 bytes" compiler-pid=397 output=/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043/bpf_lxc.c subsys=datapath-loader
2024-05-15T19:36:27.123854806Z time="2024-05-15T19:36:27Z" level=debug msg="Launching compiler" args="[-I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -]" subsys=datapath-loader target=clang
2024-05-15T19:36:27.590204451Z time="2024-05-15T19:36:27Z" level=debug msg="CES deleted, calling endpointDeleted" CEPName=kube-system/coredns-7db6d8ff4d-hjnjp CESName=ces-2qrzhpbx4-vp65v subsys=k8s-watcher
2024-05-15T19:36:27.599922248Z time="2024-05-15T19:36:27Z" level=debug msg="CEP inserted, calling endpointUpdated" CEPName=kube-system/coredns-7db6d8ff4d-hjnjp CESName=ces-xdkmhv42f-4j2d2 subsys=k8s-watcher
2024-05-15T19:36:27.600134524Z time="2024-05-15T19:36:27Z" level=debug msg="Upserting IP into ipcache layer" identity="{112 custom-resource [] false false}" ipAddr=10.0.0.107 k8sNamespace=kube-system k8sPodName=coredns-7db6d8ff4d-hjnjp key=0 namedPorts="map[dns:{53 17} dns-tcp:{53 6} metrics:{9153 6}]" subsys=ipcache
2024-05-15T19:36:27.600380186Z time="2024-05-15T19:36:27Z" level=debug msg="Daemon notified of IP-Identity cache state change" identity="{112 custom-resource [] false false}" ipAddr="{10.0.0.107 ffffffff}" modification=Upsert subsys=datapath-ipcache
2024-05-15T19:36:27.721631463Z time="2024-05-15T19:36:27Z" level=debug msg="Load 1-min: 1.90 5-min: 2.96 15min: 2.90" subsys=loadinfo type=background
2024-05-15T19:36:27.721940229Z time="2024-05-15T19:36:27Z" level=debug msg="Memory: Total: 11969 Used: 2117 (17.69%) Free: 2359 Buffers: 143 Cached: 7150" subsys=loadinfo type=background
2024-05-15T19:36:27.723146119Z time="2024-05-15T19:36:27Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" subsys=loadinfo type=background
2024-05-15T19:36:27.723159083Z time="2024-05-15T19:36:27Z" level=debug msg="NAME cilium-agent STATUS R PID 1 CPU: 31.79% MEM: 1.16% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 138 VMS: 1308 Data: 104 Stack: 0 Locked: 0 Swap: 0" subsys=loadinfo type=background
2024-05-15T19:36:27.723163091Z time="2024-05-15T19:36:27Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 122.92% MEM: 0.61% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -] MEM-EXT: RSS: 73 VMS: 125 Data: 20 Stack: 0 Locked: 0 Swap: 0" subsys=loadinfo type=background
2024-05-15T19:36:27.780576147Z time="2024-05-15T19:36:27Z" level=info msg="starting xDS stream processing" subsys=xds xdsStreamID=1
2024-05-15T19:36:27.782123083Z time="2024-05-15T19:36:27Z" level=debug msg="received request from xDS stream" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782134003Z time="2024-05-15T19:36:27Z" level=debug msg="xDS was restarted, setting nonce to 1" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782297037Z time="2024-05-15T19:36:27Z" level=debug msg="notifying observers of ACKs" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782306946Z time="2024-05-15T19:36:27Z" level=debug msg="starting watch on 0 resources" subsys=xds xdsAckedVersion=1 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce= xdsStreamID=1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:27.782399248Z time="2024-05-15T19:36:27Z" level=debug msg="current resource version is 1, waiting for it to become > 1" subsys=xds xdsAckedVersion=1 xdsClientNode=127.0.0.1 xdsTypeURL=type.googleapis.com/envoy.config.listener.v3.Listener
2024-05-15T19:36:28.076216711Z time="2024-05-15T19:36:28Z" level=debug msg="Load 1-min: 1.90 5-min: 2.96 15min: 2.90" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.076522622Z time="2024-05-15T19:36:28Z" level=debug msg="Memory: Total: 11969 Used: 2117 (17.69%) Free: 2359 Buffers: 143 Cached: 7150" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.076528853Z time="2024-05-15T19:36:28Z" level=debug msg="Swap: Total: 0 Used: 0 (NaN%) Free: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.077147909Z time="2024-05-15T19:36:28Z" level=debug msg="NAME cilium-agent STATUS S PID 1 CPU: 29.48% MEM: 1.16% CMDLINE: [cilium-agent --config-dir=/tmp/cilium/config-map] MEM-EXT: RSS: 139 VMS: 1308 Data: 104 Stack: 0 Locked: 0 Swap: 0" endpointID=347 subsys=endpoint
2024-05-15T19:36:28.077624929Z time="2024-05-15T19:36:28Z" level=debug msg="NAME clang STATUS R PID 398 CPU: 62.84% MEM: 0.62% CMDLINE: [clang -I/var/run/cilium/state/globals -I/var/run/cilium/state/templates/0d86b75579809eb47cc1d13454eb958dd3a16aa2e791b7ef9f145f8306442043 -I/var/lib/cilium/bpf -I/var/lib/cilium/bpf/include -g -O2 --target=bpf -std=gnu89 -nostdinc -Wall -Wextra -Werror -Wshadow -Wno-address-of-packed-member -Wno-unknown-warning-option -Wno-gnu-variable-sized-type-not-at-end -Wdeclaration-after-statement -Wimplicit-int-conversion -Wenum-conversion -D__NR_CPUS__=4 -mcpu=v3 -c /var/lib/cilium/bpf/bpf_lxc.c -o -] MEM-EXT: RSS: 73 VMS: 124 Data: 19 Stack: 0 Locked: 0 Swap: 0" endpointID=347 subsys=endpoint

=> clang issue? (maybe also red herring as cilium seems to print the status of long-running clang every 5 seconds)

@mhofstetter
Copy link
Member

maybe the timeout of 240s from creating the Cilium Agent DaemonSet to readiness of all its Pods is no longer enough? 🫤

We also have similar ci flake issues - e.g. #32431

it seems to take quite a long time until the health endpoint is up and ready - and this is used for the Pods readiness probe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

No branches or pull requests

2 participants