Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate labels on telemetry metric export #10236

Closed
cknowles opened this issue Dec 3, 2018 · 21 comments
Closed

Duplicate labels on telemetry metric export #10236

cknowles opened this issue Dec 3, 2018 · 21 comments

Comments

@cknowles
Copy link

cknowles commented Dec 3, 2018

Describe the bug
We just installed Istio 1.0.4 using the Helm chart instructions on k8s 1.11.4 and it has been running several days. We also hooked it up to Datadog metric collection using their instructions. Metrics were coming through ok for a couple of days and all of a sudden stopped. Looking into why, it appears the telemetry is returning dozens of the below error about duplicate labels. The only place I found such an error so far is at https://github.com/prometheus/node_exporter/blob/f9dd8e9b8c29f6c9da676036d8a8c587326bb710/vendor/github.com/prometheus/client_golang/prometheus/registry.go#L845

$ curl -v http://istio-telemetry.istio-system:42422/metrics
*   Trying 10.3.237.129...
* TCP_NODELAY set
* Connected to istio-telemetry.istio-system (10.3.237.129) port 42422 (#0)
> GET /metrics HTTP/1.1
> Host: istio-telemetry.istio-system:42422
> User-Agent: curl/7.59.0
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Mon, 03 Dec 2018 05:14:42 GMT
< Transfer-Encoding: chunked
< 
An error has occurred during metrics gathering:

47 error(s) occurred:
* collected metric "istio_tcp_sent_bytes_total" { label:<name:"connection_security_policy" value:"none" > label:<name:"connection_security_policy" value:"none" > label:<name:"destination_app" value:"cms" > label:<name:"destination_service" value:"unknown" > label:<name:"destination_service_name" value:"unknown" > label:<name:"destination_service_namespace" value:"unknown" > label:<name:"destination_version" value:"81bb5c3" > label:<name:"destination_workload_namespace" value:"int" > label:<name:"destination_principal" value:"unknown" > label:<name:"reporter" value:"destination" > label:<name:"source_app" value:"frontend" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"06195b2" > label:<name:"source_workload" value:"frontend-fix-update-blog-article-layout" > label:<name:"source_workload_namespace" value:"int" > counter:<value:1.088915e+06 > } has two or more labels with the same name: connection_security_policy
* collected metric "istio_tcp_sent_bytes_total" { label:<name:"connection_security_policy" value:"unknown" > label:<name:"destination_app" value:"unknown" > label:<name:"destination_app" value:"unknown" > label:<name:"destination_service" value:"aerospike-3" > label:<name:"destination_service_name" value:"aerospike-3" > label:<name:"destination_service_namespace" value:"int-shared" > label:<name:"destination_version" value:"unknown" > label:<name:"destination_workload" value:"unknown" > label:<name:"destination_workload_namespace" value:"unknown" > label:<name:"reporter" value:"source" > label:<name:"source_app" value:"entities-manager" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"unknown" > label:<name:"source_workload" value:"entities-manager-statefulset-v1-next" > label:<name:"source_workload_namespace" value:"int-shared" > counter:<value:2.362107154e+09 > } has two or more labels with the same name: destination_app
* collected metric "istio_tcp_sent_bytes_total" { label:<name:"connection_security_policy" value:"none" > label:<name:"destination_app" value:"be-api" > label:<name:"destination_app" value:"be-api" > label:<name:"destination_principal" value:"unknown" > label:<name:"destination_service" value:"unknown" > label:<name:"destination_service_name" value:"unknown" > label:<name:"destination_service_namespace" value:"unknown" > label:<name:"destination_workload" value:"be-api-next" > label:<name:"destination_workload_namespace" value:"int" > label:<name:"reporter" value:"destination" > label:<name:"source_app" value:"unknown" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"unknown" > label:<name:"source_workload" value:"unknown" > label:<name:"source_workload_namespace" value:"unknown" > counter:<value:3065 > } has two or more labels with the same name: destination_app
* collected metric "istio_tcp_sent_bytes_total" { label:<name:"connection_security_policy" value:"none" > label:<name:"destination_app" value:"client" > label:<name:"destination_app" value:"client" > label:<name:"destination_principal" value:"unknown" > label:<name:"destination_service" value:"unknown" > label:<name:"destination_service_name" value:"unknown" > label:<name:"destination_service_namespace" value:"unknown" > label:<name:"destination_version" value:"5d746da" > label:<name:"destination_workload" value:"client-master" > label:<name:"reporter" value:"destination" > label:<name:"source_app" value:"frontend" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"6559340" > label:<name:"source_workload" value:"frontend-fix-outrights-loading" > label:<name:"source_workload_namespace" value:"int" > counter:<value:44973 > } has two or more labels with the same name: destination_app
* collected metric "istio_tcp_received_bytes_total" { label:<name:"connection_security_policy" value:"unknown" > label:<name:"destination_app" value:"unknown" > label:<name:"destination_app" value:"unknown" > label:<name:"destination_app" value:"unknown" > label:<name:"destination_principal" value:"unknown" > label:<name:"destination_service" value:"aerospike-3" > label:<name:"destination_service_name" value:"aerospike-3" > label:<name:"destination_workload" value:"unknown" > label:<name:"destination_workload_namespace" value:"unknown" > label:<name:"reporter" value:"source" > label:<name:"source_app" value:"distribution-engine" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"9ec06fb" > label:<name:"source_workload" value:"distribution-engine-next" > label:<name:"source_workload_namespace" value:"int" > counter:<value:1.15981052e+08 > } has two or more labels with the same name: destination_app
* collected metric "istio_tcp_received_bytes_total" { label:<name:"destination_app" value:"unknown" > label:<name:"destination_principal" value:"unknown" > label:<name:"destination_service" value:"aerospike-2" > label:<name:"destination_service_name" value:"aerospike-2" > label:<name:"destination_service_namespace" value:"int" > label:<name:"destination_service_namespace" value:"int" > label:<name:"destination_workload" value:"unknown" > label:<name:"connection_security_policy" value:"unknown" > label:<name:"destination_workload_namespace" value:"unknown" > label:<name:"reporter" value:"source" > label:<name:"source_app" value:"api" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"8b52afe" > label:<name:"source_workload" value:"api-chore-clusterization" > label:<name:"source_workload_namespace" value:"int" > counter:<value:4425 > } has two or more labels with the same name: destination_service_namespace

[...]

Expected behavior

HTTP 200 for metrics as per the first few days of collection.

Steps to reproduce the bug

Not totally sure how to reproduce yet, I thought I would file now in case anyone experiences the same issue. I'd like to help in whatever way I can to try to reproduce it.

Version

Istio 1.0.4
k8s 1.11.4

Installation

Official Helm chart

Environment

AWS, Container Linux (kube-aws installer)

@cknowles
Copy link
Author

cknowles commented Dec 3, 2018

Might be related to the same underlying issue - at the same time as this issue some app deployments including existing ones only had some of the replicas marked as healthy. The other pods had live or readiness check failures even though they were running fine. We have mtls set to false so permissive mode. i.e. no changes needed to the healthchecks.

I suspected the sidecars mostly because the app logs looked fine, they appeared to be running and the app health checks have not changed in months. Since I was not sure entirely what was wrong after some initial debugging I killed all pods in the istio-system namespace. Once the Istio pods came back up both the issue described above plus the app health check issue disappeared.

@cknowles
Copy link
Author

cknowles commented Dec 3, 2018

I found some comments that appear to be related in #9043 (comment)

@douglas-reid
Copy link
Contributor

@c-knowles there was a report of a similar issue (that we were never able to repro) in #8906. However, after some back and forth with the Prometheus folks, we believed the issue to be fixed with the client-library updates for 1.0.4. It is strange/troubling to see you still experiencing this issue.

@mandarjog @geeknoid do we possibly have an issue with instance creation? memory reuse, or similar?

@douglas-reid
Copy link
Contributor

Nope. So, even if the instance has multiple of the same labels somehow, the prom adapter code uses a map for labels. So, this should be impossible.

@c-knowles do you have any way to recreate ?

@cknowles
Copy link
Author

cknowles commented Dec 4, 2018

@douglas-reid was looking at this for a while yesterday but no luck so far. Interesting it's fallen over again last night with the same error so it is reproducible just not sure of the cause as a lot of moving pieces. Any advice where to start? Here is a gist of the helm values I am using - https://gist.github.com/c-knowles/f606fb0e0462759a0354ef737c9e7cc8. The ${} style parameters are substituted for real values.

Not sure if it's related but I found it interesting that the metric for request count seems to be an ever increasing value that occasionally dumps (just not what I'd expect):

screenshot 2018-12-04 09 31 12

@douglas-reid
Copy link
Contributor

@c-knowles how are you generating that chart ? do the drop in request count totals correspond with restarts of the telemetry pods?

Can you post the output from a :9093/version lookup on one of the istio-telemetry pods here just to be sure we have the right version of everything?

@beorn7 any ideas here? This should be with:

  name = "github.com/prometheus/client_golang"
...
  revision = "1cafe34db7fdec6022e17e00e1c1ea501022f3e4"
  version = "v0.9.0"

@beorn7
Copy link

beorn7 commented Dec 4, 2018

I don't have any theory that hasn't been falsified already. I haven't found any way to create the above error with the Prometheus client library alone, and my look at the Istio code didn't yield any abusive usage that would explain the error. It must be something subtle...

@douglas-reid
Copy link
Contributor

@beorn7 wild guess: is it possible that the metric.Write() in gather is racy with some other update to the label pairs (maybe in a Sort)? i can't quite see how at the moment.

@beorn7
Copy link

beorn7 commented Dec 4, 2018

That's what I tried to provoke in this unit test:
https://github.com/prometheus/client_golang/blob/master/prometheus/registry_test.go#L781

It all behaved as expected.

@cknowles
Copy link
Author

cknowles commented Dec 5, 2018

@douglas-reid the chart is a screenshot from datadog, it's the average of reported values scraped from the prometheus endpoints of the telemetry service:

istio_mesh_endpoint: http://istio-telemetry.istio-system:42422/metrics
mixer_endpoint: http://istio-telemetry.istio-system:9093/metrics

Selecting max or sum gets different values but similar charts. Perhaps slightly less spikey with sum but still bottoms out after some time. Is it expected that this value continually increases or is it meant to act more like a request rate?

Still not sure if this is related to the original problem but looks suspicious nonetheless. Maybe this problem is the cause of the sudden drop.

Version info:

$ curl localhost:9093/version

root@8c2feba0b568-docker.io/istio-1.0.4-d5cb99f479ad9da88eebb8bb3637b17c323bc50b-Clean

@bobbytables
Copy link
Contributor

Seeing this as well (I opened the original issue #8906 for this too).

Istio 1.0.4 on K8s 1.9.3. We're using SignalFx to scrape prometheus endpoints. I've annotated the telemetry pod with the prometheus annotations so SignalFx picks up the metrics. This works for a couple of hours and then we see the same issue.

This is a screenshot of it dying in SignalFx

image

The error we see now is:

* collected metric "istio_requests_total" { label:<name:"connection_security_policy" value:"unknown" > label:<name:"destination_app" value:"locator" > label:<name:"destination_service" value:"locator.applications.svc.cluster.local" > label:<name:"destination_service_name" value:"locator" > label:<name:"destination_service_namespace" value:"applications" > label:<name:"destination_version" value:"unknown" > label:<name:"destination_version" value:"unknown" > label:<name:"destination_workload" value:"locator-grpc" > label:<name:"destination_workload_namespace" value:"applications" > label:<name:"reporter" value:"source" > label:<name:"request_protocol" value:"grpc" > label:<name:"response_code" value:"200" > label:<name:"source_app" value:"hcm-web" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"unknown" > label:<name:"source_workload" value:"hcm-web-public" > label:<name:"source_workload_namespace" value:"applications" > counter:<value:8 > } has two or more labels with the same name: destination_version

For now I think we'll be trying to just use the SignalFx adapter in Istio but this is super frustrating.

@bobbytables
Copy link
Contributor

@douglas-reid would it be helpful to jump on a google hangout or something? I'd be happy to show you what we're doing instead of back and forth on this issue.

@douglas-reid
Copy link
Contributor

@bobbytables perhaps -- thanks for the offer. maybe we should sync via a different mechanism about times, etc.

I have an idea I want to chase down to see if I can first repro and then fix the issue.

@c-knowles and @bobbytables, when you see the failure case, do you by any chance see a time-correlated log in istio-telemetry logs like: "Built new config.Snapshot: id=" or "Reloading adapter" or even "adapter did not close all the scheduled daemons {"adapter": "prometheus.istio-system"}" ?

@douglas-reid
Copy link
Contributor

douglas-reid commented Dec 5, 2018

@beorn7 I modified that test by adding 16 labels instead of 3 and ran it for 60 seconds. I was able to repro the issue.

Sample logs:

 * collected metric "test_histogram" { label:<name:"one" value:"b" > label:<name:"five" value:"a" > label:<name:"foo" value:"bar" > label:<name:"fourteen" value:"b" > label:<name:"fourteen" value:"b" > label:<name:"fourteen" value:"b" > label:<name:"nine" value:"beta" > label:<name:"one" value:"b" > label:<name:"eight" value:"beta" > label:<name:"seven" value:"beta" > label:<name:"six" value:"beth" > label:<name:"sixteen" value:"c" > label:<name:"ten" value:"a" > label:<name:"thirteen" value:"beth" > label:<name:"three" value:"a" > label:<name:"twelve" value:"b" > label:<name:"two" value:"c" > histogram:<sample_count:1 sample_sum:0.10041906053093258 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:1 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > > } has two or more labels with the same name: fourteen
* collected metric "test_histogram" { label:<name:"eight" value:"alpha" > label:<name:"fifteen" value:"beta" > label:<name:"five" value:"b" > label:<name:"foo" value:"bar" > label:<name:"four" value:"beth" > label:<name:"fourteen" value:"alpha" > label:<name:"fourteen" value:"alpha" > label:<name:"nine" value:"c" > label:<name:"one" value:"gimel" > label:<name:"seven" value:"gamma" > label:<name:"six" value:"c" > label:<name:"sixteen" value:"aleph" > label:<name:"ten" value:"c" > label:<name:"thirteen" value:"gamma" > label:<name:"three" value:"gamma" > label:<name:"twelve" value:"beta" > label:<name:"two" value:"b" > histogram:<sample_count:1 sample_sum:0.1747269037010694 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:1 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > > } has two or more labels with the same name: fourteen
* collected metric "test_histogram" { label:<name:"eight" value:"c" > label:<name:"eleven" value:"gamma" > label:<name:"eleven" value:"gamma" > label:<name:"fifteen" value:"a" > label:<name:"five" value:"b" > label:<name:"foo" value:"bar" > label:<name:"fourteen" value:"b" > label:<name:"nine" value:"gamma" > label:<name:"one" value:"gimel" > label:<name:"seven" value:"beth" > label:<name:"six" value:"c" > label:<name:"sixteen" value:"aleph" > label:<name:"ten" value:"alpha" > label:<name:"thirteen" value:"b" > label:<name:"three" value:"beta" > label:<name:"twelve" value:"alpha" > label:<name:"two" value:"gimel" > histogram:<sample_count:1 sample_sum:0.2885442155939729 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:0 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > > } has two or more labels with the same name: eleven

So, it appears that the number of labels seems to make a difference. I have to bounce around to a few meetings, but I'll follow up with more info later today.

@beorn7
Copy link

beorn7 commented Dec 5, 2018

@douglas-reid Perfect. I'll try to do the same. Once we can reproduce it in a test, we are half-way to solving this problem.

@douglas-reid
Copy link
Contributor

@beorn7 I sent a provisional PR for your consideration and to help in tracking the issue: prometheus/client_golang#511.

@cknowles
Copy link
Author

cknowles commented Dec 6, 2018

@douglas-reid I had a look through the last day or so. It seems every time that chart bottoms out we get several Built new config.Snapshot: id= logs.
Max ID is around 4 or 6 (it varies). Do not see any Reloading adapter logs at all for the last day. There are some adapter did not close all the scheduled daemons {"adapter": "prometheus.istio-system"} but the timings of these vary, some are aligned and some are not.

@beorn7
Copy link

beorn7 commented Dec 6, 2018

Good news: The issue in client_golang has been identified. prometheus/client_golang#513 should fix it.
It will be part of v0.9.2, which I'll release once the fix is merged.

@mhamrah
Copy link

mhamrah commented Dec 7, 2018 via email

@douglas-reid
Copy link
Contributor

As this has been patched in the 1.0 and 1.1 branch, I think it is now safe to close this issue. Please re-open if you feel it is not appropriate.

@mhamrah
Copy link

mhamrah commented Jan 9, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants