Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Cilium E2E Upgrade: cilium-health-ep is failing #32431

Open
thorn3r opened this issue May 9, 2024 · 6 comments
Open

CI: Cilium E2E Upgrade: cilium-health-ep is failing #32431

thorn3r opened this issue May 9, 2024 · 6 comments
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@thorn3r
Copy link
Contributor

thorn3r commented May 9, 2024

E2E Upgrade: cilium-health-ep is failing; connect: no route to host

This is occurring in the E2E Upgrade workflow, especially on config 7

Output of Upgrade Cilium step:

    /¯¯\
 /¯¯\__/¯¯\    Cilium:             1 errors, 1 warnings
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    OK
 \__/¯¯\__/    Hubble Relay:       disabled
    \__/       ClusterMesh:        disabled

DaemonSet              cilium-envoy       Desired: 3, Ready: 3/3, Available: 3/3
DaemonSet              cilium             Desired: 3, Ready: 3/3, Available: 3/3
Deployment             cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
Containers:            cilium             Running: 3
                       cilium-operator    Running: 2
                       cilium-envoy       Running: 3
Cluster Pods:          16/16 managed by Cilium
Helm chart version:    
Image versions         cilium             quay.io/cilium/cilium-ci:b5f1910567d79f30d1ff8e9847e562612b74d65a: 3
                       cilium-operator    quay.io/cilium/operator-generic-ci:b5f1910567d79f30d1ff8e9847e562612b74d65a: 2
                       cilium-envoy       quay.io/cilium/cilium-envoy:v1.29.4-e2f41ce0e07065f353b8ce28f36c01de3cfdd819@sha256:ec288baaac3758b2b0cfe2e146feb49204543961bd04f0195aae67f7b85d00f4: 3
Errors:                cilium             cilium-c4qkb    controller cilium-health-ep is failing since 2m7s (15x): Get "http://10.244.2.186:4240/hello": dial tcp 10.244.2.186:4240: connect: no route to host
Warnings:              cilium             cilium-c4qkb    2 endpoints are not ready

Error: Unable to determine status:  timeout while waiting for status to become successful: context deadline exceeded

There is a long delay between agent startup and the Cilium Health API being served. Snippet of relevant logs below, full output in this gist.

 grep -E 'Serving cilium health API|Initializing daemon|cilium-health' logs-cilium-tlrz8-cilium-agent-20240508-184709.log
2024-05-08T18:41:10.440637584Z time="2024-05-08T18:41:10Z" level=info msg="  --enable-cilium-health-api-server-access='*'" subsys=daemon
2024-05-08T18:42:01.177957396Z time="2024-05-08T18:42:01Z" level=info msg="Initializing daemon" subsys=daemon
2024-05-08T18:42:01.435587992Z time="2024-05-08T18:42:01Z" level=debug msg="Starting new controller" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:42:01.436458888Z time="2024-05-08T18:42:01Z" level=debug msg="Killing old health endpoint process" pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
2024-05-08T18:42:01.436837434Z time="2024-05-08T18:42:01Z" level=debug msg="Didn't find existing cilium-health endpoint to delete" subsys=daemon
2024-05-08T18:42:01.438491372Z time="2024-05-08T18:42:01Z" level=debug msg="Didn't find existing device" error="Link not found" subsys=cilium-health-launcher veth=cilium_health
2024-05-08T18:42:01.439104327Z time="2024-05-08T18:42:01Z" level=debug msg="Didn't find existing device" error="Link not found" subsys=cilium-health-launcher veth=lxc_health
2024-05-08T18:42:01.531656170Z time="2024-05-08T18:42:01Z" level=debug msg="Spawning health endpoint with command \"cilium-health-responder\" [\"--listen\" \"4240\" \"--pidfile\" \"/var/run/cilium/state/health-endpoint.pid\"]" subsys=cilium-health-launcher
2024-05-08T18:42:02.548037952Z time="2024-05-08T18:42:02Z" level=debug msg="cilium-health agent running" pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
2024-05-08T18:42:02.548299771Z time="2024-05-08T18:42:02Z" level=debug msg="Running \"ip [route add 10.244.3.36/32 dev cilium]\"" subsys=cilium-health-launcher
2024-05-08T18:42:02.566198612Z time="2024-05-08T18:42:02Z" level=debug msg="Running \"ip [route add 0.0.0.0/0 via 10.244.3.36 mtu 1450 dev cilium]\"" subsys=cilium-health-launcher
2024-05-08T18:42:02.576873301Z time="2024-05-08T18:42:02Z" level=debug msg="Running \"ip [-6 route add fd00:10:244:3::7de1/128 dev cilium]\"" subsys=cilium-health-launcher
2024-05-08T18:42:02.590310618Z time="2024-05-08T18:42:02Z" level=debug msg="Running \"ip [-6 route add ::/0 via fd00:10:244:3::7de1 mtu 1450 dev cilium]\"" subsys=cilium-health-launcher
2024-05-08T18:42:02.613757232Z time="2024-05-08T18:42:02Z" level=debug msg="Controller func execution time: 1.177539141s" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:02.652446256Z time="2024-05-08T18:43:02Z" level=debug msg="Controller func execution time: 37.757972ms" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:02.659163175Z time="2024-05-08T18:43:02Z" level=debug msg="Controller run failed" consecutiveErrors=1 error="Get \"http://10.244.3.83:4240/hello\": dial tcp 10.244.3.83:4240: connect: no route to host" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:03.702615627Z time="2024-05-08T18:43:03Z" level=debug msg="Controller func execution time: 41.720535ms" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:03.702695787Z time="2024-05-08T18:43:03Z" level=debug msg="Controller run failed" consecutiveErrors=2 error="Get \"http://10.244.3.83:4240/hello\": dial tcp 10.244.3.83:4240: connect: no route to host" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:05.737520073Z time="2024-05-08T18:43:05Z" level=debug msg="Controller func execution time: 34.47376ms" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:05.737772044Z time="2024-05-08T18:43:05Z" level=debug msg="Controller run failed" consecutiveErrors=3 error="Get \"http://10.244.3.83:4240/hello\": dial tcp 10.244.3.83:4240: connect: no route to host" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:08.764440909Z time="2024-05-08T18:43:08Z" level=debug msg="Controller func execution time: 25.873905ms" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:08.765230654Z time="2024-05-08T18:43:08Z" level=debug msg="Controller run failed" consecutiveErrors=4 error="Get \"http://10.244.3.83:4240/hello\": dial tcp 10.244.3.83:4240: connect: no route to host" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:12.782919522Z time="2024-05-08T18:43:12Z" level=debug msg="Controller func execution time: 15.746758ms" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:12.782953916Z time="2024-05-08T18:43:12Z" level=debug msg="Controller run failed" consecutiveErrors=5 error="Get \"http://10.244.3.83:4240/hello\": dial tcp 10.244.3.83:4240: connect: no route to host" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:17.127322793Z time="2024-05-08T18:43:17Z" level=debug msg="Cannot find socket" error="stat /var/run/cilium/health.sock: no such file or directory" file-path=/var/run/cilium/health.sock subsys=cilium-health-launcher
2024-05-08T18:43:17.801472167Z time="2024-05-08T18:43:17Z" level=debug msg="Controller func execution time: 18.313179ms" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
2024-05-08T18:43:17.803203579Z time="2024-05-08T18:43:17Z" level=debug msg="Controller run failed" consecutiveErrors=6 error="Get \"http://10.244.3.83:4240/hello\": dial tcp 10.244.3.83:4240: connect: no route to host" name=cilium-health-ep subsys=controller uuid=9d7ad1c5-bb82-42b5-a87d-49018525d9a1
---snip---
2024-05-08T18:45:14.417628672Z time="2024-05-08T18:45:14Z" level=debug msg="Cannot find socket" error="stat /var/run/cilium/health.sock: no such file or directory" file-path=/var/run/cilium/health.sock subsys=cilium-health-launcher
2024-05-08T18:45:15.418268611Z time="2024-05-08T18:45:15Z" level=debug msg="Cannot find socket" error="stat /var/run/cilium/health.sock: no such file or directory" file-path=/var/run/cilium/health.sock subsys=cilium-health-launcher
2024-05-08T18:45:16.435185026Z time="2024-05-08T18:45:16Z" level=debug msg="Cannot find socket" error="stat /var/run/cilium/health.sock: no such file or directory" file-path=/var/run/cilium/health.sock subsys=cilium-health-launcher
2024-05-08T18:45:17.422605754Z time="2024-05-08T18:45:17Z" level=debug msg="Cannot find socket" error="stat /var/run/cilium/health.sock: no such file or directory" file-path=/var/run/cilium/health.sock subsys=cilium-health-launcher
2024-05-08T18:45:18.363998216Z time="2024-05-08T18:45:18Z" level=info msg="Serving cilium health API at unix:///var/run/cilium/health.sock" subsys=health-server

Hit on PR #32403
Workflow: https://github.com/cilium/cilium/actions/runs/9003017786/attempts/2
Sysdump:
cilium-sysdumps.zip

@thorn3r thorn3r added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels May 9, 2024
@thorn3r
Copy link
Contributor Author

thorn3r commented May 9, 2024

Some more relevant timestamps:

2024-05-08T18:42:10.320873309Z time="2024-05-08T18:42:10Z" level=info msg="Launching Cilium health endpoint" subsys=daemon

2024-05-08T18:45:12.815790534Z time="2024-05-08T18:45:12Z" level=debug msg="Removing existing cilium-health endpoint" subsys=daemon

It looks like the health endpoint is restored, deleted, and recreated prior to the daemon running initHealth. This spans ~4 minutes:

grep 'identity=4 ' logs-cilium-c4qkb-cilium-agent-20240508-184709.log                                                   ✔ 
2024-05-08T18:41:09.739540543Z time="2024-05-08T18:41:09Z" level=debug msg="Endpoint restoring" ciliumEndpointName=/ code=OK containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=261 endpointState=restoring identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ policyRevision=0 subsys=endpoint type=0
2024-05-08T18:42:10.144742528Z time="2024-05-08T18:42:10Z" level=debug msg="Deleting endpoint" ciliumEndpointName=/ code=OK containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=261 endpointState=disconnecting identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ policyRevision=0 subsys=endpoint type=0
2024-05-08T18:42:10.159097848Z time="2024-05-08T18:42:10Z" level=debug msg="removing directory" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 directory=261 endpointID=261 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:42:10.159100814Z time="2024-05-08T18:42:10Z" level=debug msg="removing directory" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 directory=261_next_fail endpointID=261 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:42:10.159103068Z time="2024-05-08T18:42:10Z" level=debug msg="removing directory" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 directory=261_next endpointID=261 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:42:10.159105432Z time="2024-05-08T18:42:10Z" level=debug msg="removing directory" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 directory=261_stale endpointID=261 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:42:10.222398359Z time="2024-05-08T18:42:10Z" level=debug msg="Endpoint removed" ciliumEndpointName=/ code=OK containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=261 endpointState=disconnected identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ policyRevision=0 subsys=endpoint type=0
2024-05-08T18:42:10.222549561Z time="2024-05-08T18:42:10Z" level=info msg="Removed endpoint" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=261 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:42:10.222708718Z time="2024-05-08T18:42:10Z" level=debug msg="Waiting for proxy updates to complete..." ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=261 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:42:10.222988230Z time="2024-05-08T18:42:10Z" level=debug msg="Wait time for proxy updates: 133.279µs" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=261 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:42:11.518206894Z time="2024-05-08T18:42:11Z" level=debug msg="Assigned new identity to endpoint" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 identity=4 identityLabels="reserved:health" ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:42:11.518855184Z time="2024-05-08T18:42:11Z" level=info msg="Identity of endpoint changed" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 identity=4 identityLabels="reserved:health" ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ oldIdentity="no identity" subsys=endpoint
2024-05-08T18:42:11.519163970Z time="2024-05-08T18:42:11Z" level=debug msg="Triggering endpoint regeneration due to updated security labels" ciliumEndpointName=/ code=OK containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 endpointState=waiting-to-regenerate identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ policyRevision=0 subsys=endpoint type=0
2024-05-08T18:42:11.523354815Z time="2024-05-08T18:42:11Z" level=debug msg="Endpoint labels unchanged, skipping resolution of identity" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 identity=4 identityLabels="reserved:health" ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:43:10.082860804Z time="2024-05-08T18:43:10Z" level=debug msg="Endpoint has no envoy redirects, skipping Envoy update" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:44:10.111408144Z time="2024-05-08T18:44:10Z" level=debug msg="Endpoint has no envoy redirects, skipping Envoy update" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:44:36.678314751Z time="2024-05-08T18:44:36Z" level=debug msg="Dequeued endpoint from build queue" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:44:36.678318308Z time="2024-05-08T18:44:36Z" level=debug msg="Regenerating endpoint" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ reason="updated security labels" startTime="2024-05-08 18:44:36.677880921 +0000 UTC m=+217.211875273" subsys=endpoint
2024-05-08T18:44:36.678321383Z time="2024-05-08T18:44:36Z" level=debug msg="Regenerating endpoint: updated security labels" ciliumEndpointName=/ code=OK containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 endpointState=regenerating identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ policyRevision=0 subsys=endpoint type=0
2024-05-08T18:44:36.678324259Z time="2024-05-08T18:44:36Z" level=debug msg="removing directory" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 directory=94_next endpointID=94 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:45:10.094829476Z time="2024-05-08T18:45:10Z" level=debug msg="Endpoint has no envoy redirects, skipping Envoy update" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:45:12.822775279Z time="2024-05-08T18:45:12Z" level=debug msg="exiting retrying regeneration goroutine due to endpoint being deleted" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 identity=4 ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint
2024-05-08T18:47:11.525717619Z time="2024-05-08T18:47:11Z" level=debug msg="Endpoint labels unchanged, skipping resolution of identity" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=94 identity=4 identityLabels="reserved:health" ipv4=10.244.2.186 ipv6="fd00:10:244:2::4270" k8sPodName=/ subsys=endpoint

Seems like endpoint restoration might be taking a long time

@giorio94
Copy link
Member

@jschwinger233
Copy link
Member

@giorio94
Copy link
Member

@mhofstetter
Copy link
Member

mhofstetter commented May 14, 2024

#32528 increases the wait-timeout for the Cilium upgrade task from 5 to 10 minutes.

This should help fixing the failures on that specific job, but might not fix the root cause (EP restoration logic, compile times? maybe increased a little bit with the recent LLVM 17 update?). Hence, lets keep this open and monitor the situation.

@giorio94
Copy link
Member

Hmm, hit again even with the 10 minutes timeout: https://github.com/cilium/cilium/actions/runs/9188138331/job/25267307890

Errors: cilium cilium-cbkl2 controller cilium-health-ep is failing since 22s (31x): Get "http://10.244.2.6:4240/hello": dial tcp 10.244.2.6:4240: connect: no route to host
Warnings: cilium cilium-cbkl2 1 endpoints are not ready

There seems to be something weird going on with the health endpoint restoration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

No branches or pull requests

4 participants