Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: ci-external-workload: Cilium API handler panicked #19425

Closed
borkmann opened this issue Apr 13, 2022 · 6 comments · Fixed by #19496
Closed

CI: ci-external-workload: Cilium API handler panicked #19425

borkmann opened this issue Apr 13, 2022 · 6 comments · Fixed by #19496
Assignees
Labels
area/CI Continuous Integration testing issue or flake priority/medium This is considered important, but not urgent. release-blocker/1.12 This issue will prevent the release of the next version of Cilium.

Comments

@borkmann
Copy link
Member

CI failure

Seen in one of the agent logs attached to the zip in here:

2022-04-13T08:05:39.897310085Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
2022-04-13T08:07:15.761674513Z level=info msg="regenerating all endpoints" reason="one or more identities created or deleted" subsys=endpoint-manager
2022-04-13T08:08:21.258858526Z level=warning msg="Cilium API handler panicked" client=@ method=GET panic_message="json: unsupported value: NaN" subsys=api url=/v1/metrics/
2022-04-13T08:08:21.259518780Z 2022/04/13 08:08:21 http: superfluous response.WriteHeader call from github.com/cilium/cilium/pkg/api.(*APIPanicHandler).ServeHTTP.func1 (apipanic.go:36)

https://github.com/cilium/cilium/actions/runs/2158036465
cilium-sysdump-out.zip (23).zip

@borkmann borkmann added area/CI Continuous Integration testing issue or flake needs/triage This issue requires triaging to establish severity and next steps. ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Apr 13, 2022
@aanm aanm added the release-blocker/1.12 This issue will prevent the release of the next version of Cilium. label Apr 13, 2022
@aanm
Copy link
Member

aanm commented Apr 13, 2022

Marking as a release blocker because the cilium metrics list file is empty and this file is important for debugging purposes.

@ti-mo
Copy link
Contributor

ti-mo commented Apr 19, 2022

I'm not sure if this is a flake, it's breaking #19159 CI for me on every run. It's causing so much noise that I can't see actual failures anymore.

This took a while for me to narrow this down to cilium metrics list being the culprit. APIPanicHandler recovers the json panic, but WriteHeader() at apipanic.go:36 seems to cause another panic.

I think 6c5e2d6 is the culprit. If I dump var metrics in getMetrics.Handle(), I see these NaNs popping up (when running cilium metrics list):

{map[] go_gc_pauses_seconds_total NaN}
{map[] go_sched_latencies_seconds NaN}

As suggested in golang/go#25721 (comment), this might require implementing a custom codec for models.Metric.

@ti-mo ti-mo added priority/medium This is considered important, but not urgent. and removed ci/flake This is a known failure that occurs in the tree. Please investigate me! needs/triage This issue requires triaging to establish severity and next steps. labels Apr 19, 2022
@chancez
Copy link
Contributor

chancez commented Apr 19, 2022

I don't believe 6c5e2d6 is the cause, but the recent bump to prometheus/client_golang is. See: d8e3f28.
Newer versions of prometheus/client_golang introduced new metrics, and they seem to have some problems.

This bug has been reported upstream, and it seems they reverted adding the new metrics:
prometheus/client_golang#995
prometheus/client_golang#1033

@chancez
Copy link
Contributor

chancez commented Apr 19, 2022

I suggest reverting the update to prometheus/client_golang so we're at 1.11 until a new release is made to fix the new metrics introduced in 1.12.

@ti-mo
Copy link
Contributor

ti-mo commented Apr 20, 2022

Fix proposed at: #19496.

ti-mo added a commit to ti-mo/cilium that referenced this issue Apr 27, 2022
This reverts commit d8e3f28.

---

See cilium#19425.

Extra Go metrics were added to the Prometheus client, some having
NaN values, breaking json marshaling on the server side of `cilium
metrics list`:

msg="Cilium API handler panicked" panic_message="json: unsupported
value: NaN" url=/v1/metrics/ ...

Signed-off-by: Timo Beckers <timo@isovalent.com>
joestringer pushed a commit that referenced this issue Apr 29, 2022
This reverts commit d8e3f28.

---

See #19425.

Extra Go metrics were added to the Prometheus client, some having
NaN values, breaking json marshaling on the server side of `cilium
metrics list`:

msg="Cilium API handler panicked" panic_message="json: unsupported
value: NaN" url=/v1/metrics/ ...

Signed-off-by: Timo Beckers <timo@isovalent.com>
@ti-mo
Copy link
Contributor

ti-mo commented May 31, 2022

This issue was reintroduced by f321a6a, which upgraded client_golang to 1.12.1 again. 1.12.2 is out, which supposedly fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake priority/medium This is considered important, but not urgent. release-blocker/1.12 This issue will prevent the release of the next version of Cilium.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants