Duplicate labels on telemetry metric export #10236

cknowles · 2018-12-03T05:38:11Z

Describe the bug
We just installed Istio 1.0.4 using the Helm chart instructions on k8s 1.11.4 and it has been running several days. We also hooked it up to Datadog metric collection using their instructions. Metrics were coming through ok for a couple of days and all of a sudden stopped. Looking into why, it appears the telemetry is returning dozens of the below error about duplicate labels. The only place I found such an error so far is at https://github.com/prometheus/node_exporter/blob/f9dd8e9b8c29f6c9da676036d8a8c587326bb710/vendor/github.com/prometheus/client_golang/prometheus/registry.go#L845

$ curl -v http://istio-telemetry.istio-system:42422/metrics
*   Trying 10.3.237.129...
* TCP_NODELAY set
* Connected to istio-telemetry.istio-system (10.3.237.129) port 42422 (#0)
> GET /metrics HTTP/1.1
> Host: istio-telemetry.istio-system:42422
> User-Agent: curl/7.59.0
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Mon, 03 Dec 2018 05:14:42 GMT
< Transfer-Encoding: chunked
< 
An error has occurred during metrics gathering:

47 error(s) occurred:
* collected metric "istio_tcp_sent_bytes_total" { label:<name:"connection_security_policy" value:"none" > label:<name:"connection_security_policy" value:"none" > label:<name:"destination_app" value:"cms" > label:<name:"destination_service" value:"unknown" > label:<name:"destination_service_name" value:"unknown" > label:<name:"destination_service_namespace" value:"unknown" > label:<name:"destination_version" value:"81bb5c3" > label:<name:"destination_workload_namespace" value:"int" > label:<name:"destination_principal" value:"unknown" > label:<name:"reporter" value:"destination" > label:<name:"source_app" value:"frontend" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"06195b2" > label:<name:"source_workload" value:"frontend-fix-update-blog-article-layout" > label:<name:"source_workload_namespace" value:"int" > counter:<value:1.088915e+06 > } has two or more labels with the same name: connection_security_policy
* collected metric "istio_tcp_sent_bytes_total" { label:<name:"connection_security_policy" value:"unknown" > label:<name:"destination_app" value:"unknown" > label:<name:"destination_app" value:"unknown" > label:<name:"destination_service" value:"aerospike-3" > label:<name:"destination_service_name" value:"aerospike-3" > label:<name:"destination_service_namespace" value:"int-shared" > label:<name:"destination_version" value:"unknown" > label:<name:"destination_workload" value:"unknown" > label:<name:"destination_workload_namespace" value:"unknown" > label:<name:"reporter" value:"source" > label:<name:"source_app" value:"entities-manager" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"unknown" > label:<name:"source_workload" value:"entities-manager-statefulset-v1-next" > label:<name:"source_workload_namespace" value:"int-shared" > counter:<value:2.362107154e+09 > } has two or more labels with the same name: destination_app
* collected metric "istio_tcp_sent_bytes_total" { label:<name:"connection_security_policy" value:"none" > label:<name:"destination_app" value:"be-api" > label:<name:"destination_app" value:"be-api" > label:<name:"destination_principal" value:"unknown" > label:<name:"destination_service" value:"unknown" > label:<name:"destination_service_name" value:"unknown" > label:<name:"destination_service_namespace" value:"unknown" > label:<name:"destination_workload" value:"be-api-next" > label:<name:"destination_workload_namespace" value:"int" > label:<name:"reporter" value:"destination" > label:<name:"source_app" value:"unknown" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"unknown" > label:<name:"source_workload" value:"unknown" > label:<name:"source_workload_namespace" value:"unknown" > counter:<value:3065 > } has two or more labels with the same name: destination_app
* collected metric "istio_tcp_sent_bytes_total" { label:<name:"connection_security_policy" value:"none" > label:<name:"destination_app" value:"client" > label:<name:"destination_app" value:"client" > label:<name:"destination_principal" value:"unknown" > label:<name:"destination_service" value:"unknown" > label:<name:"destination_service_name" value:"unknown" > label:<name:"destination_service_namespace" value:"unknown" > label:<name:"destination_version" value:"5d746da" > label:<name:"destination_workload" value:"client-master" > label:<name:"reporter" value:"destination" > label:<name:"source_app" value:"frontend" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"6559340" > label:<name:"source_workload" value:"frontend-fix-outrights-loading" > label:<name:"source_workload_namespace" value:"int" > counter:<value:44973 > } has two or more labels with the same name: destination_app
* collected metric "istio_tcp_received_bytes_total" { label:<name:"connection_security_policy" value:"unknown" > label:<name:"destination_app" value:"unknown" > label:<name:"destination_app" value:"unknown" > label:<name:"destination_app" value:"unknown" > label:<name:"destination_principal" value:"unknown" > label:<name:"destination_service" value:"aerospike-3" > label:<name:"destination_service_name" value:"aerospike-3" > label:<name:"destination_workload" value:"unknown" > label:<name:"destination_workload_namespace" value:"unknown" > label:<name:"reporter" value:"source" > label:<name:"source_app" value:"distribution-engine" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"9ec06fb" > label:<name:"source_workload" value:"distribution-engine-next" > label:<name:"source_workload_namespace" value:"int" > counter:<value:1.15981052e+08 > } has two or more labels with the same name: destination_app
* collected metric "istio_tcp_received_bytes_total" { label:<name:"destination_app" value:"unknown" > label:<name:"destination_principal" value:"unknown" > label:<name:"destination_service" value:"aerospike-2" > label:<name:"destination_service_name" value:"aerospike-2" > label:<name:"destination_service_namespace" value:"int" > label:<name:"destination_service_namespace" value:"int" > label:<name:"destination_workload" value:"unknown" > label:<name:"connection_security_policy" value:"unknown" > label:<name:"destination_workload_namespace" value:"unknown" > label:<name:"reporter" value:"source" > label:<name:"source_app" value:"api" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"8b52afe" > label:<name:"source_workload" value:"api-chore-clusterization" > label:<name:"source_workload_namespace" value:"int" > counter:<value:4425 > } has two or more labels with the same name: destination_service_namespace

[...]

Expected behavior

HTTP 200 for metrics as per the first few days of collection.

Steps to reproduce the bug

Not totally sure how to reproduce yet, I thought I would file now in case anyone experiences the same issue. I'd like to help in whatever way I can to try to reproduce it.

Version

Istio 1.0.4
k8s 1.11.4

Installation

Official Helm chart

Environment

AWS, Container Linux (kube-aws installer)

The text was updated successfully, but these errors were encountered:

cknowles · 2018-12-03T06:21:06Z

Might be related to the same underlying issue - at the same time as this issue some app deployments including existing ones only had some of the replicas marked as healthy. The other pods had live or readiness check failures even though they were running fine. We have mtls set to false so permissive mode. i.e. no changes needed to the healthchecks.

I suspected the sidecars mostly because the app logs looked fine, they appeared to be running and the app health checks have not changed in months. Since I was not sure entirely what was wrong after some initial debugging I killed all pods in the istio-system namespace. Once the Istio pods came back up both the issue described above plus the app health check issue disappeared.

cknowles · 2018-12-03T08:51:00Z

I found some comments that appear to be related in #9043 (comment)

douglas-reid · 2018-12-03T17:18:14Z

@c-knowles there was a report of a similar issue (that we were never able to repro) in #8906. However, after some back and forth with the Prometheus folks, we believed the issue to be fixed with the client-library updates for 1.0.4. It is strange/troubling to see you still experiencing this issue.

@mandarjog @geeknoid do we possibly have an issue with instance creation? memory reuse, or similar?

douglas-reid · 2018-12-03T17:21:49Z

Nope. So, even if the instance has multiple of the same labels somehow, the prom adapter code uses a map for labels. So, this should be impossible.

@c-knowles do you have any way to recreate ?

cknowles · 2018-12-04T01:35:59Z

@douglas-reid was looking at this for a while yesterday but no luck so far. Interesting it's fallen over again last night with the same error so it is reproducible just not sure of the cause as a lot of moving pieces. Any advice where to start? Here is a gist of the helm values I am using - https://gist.github.com/c-knowles/f606fb0e0462759a0354ef737c9e7cc8. The ${} style parameters are substituted for real values.

Not sure if it's related but I found it interesting that the metric for request count seems to be an ever increasing value that occasionally dumps (just not what I'd expect):

douglas-reid · 2018-12-04T20:03:04Z

@c-knowles how are you generating that chart ? do the drop in request count totals correspond with restarts of the telemetry pods?

Can you post the output from a :9093/version lookup on one of the istio-telemetry pods here just to be sure we have the right version of everything?

@beorn7 any ideas here? This should be with:

  name = "github.com/prometheus/client_golang"
...
  revision = "1cafe34db7fdec6022e17e00e1c1ea501022f3e4"
  version = "v0.9.0"

beorn7 · 2018-12-04T21:02:28Z

I don't have any theory that hasn't been falsified already. I haven't found any way to create the above error with the Prometheus client library alone, and my look at the Istio code didn't yield any abusive usage that would explain the error. It must be something subtle...

douglas-reid · 2018-12-04T21:27:18Z

@beorn7 wild guess: is it possible that the metric.Write() in gather is racy with some other update to the label pairs (maybe in a Sort)? i can't quite see how at the moment.

beorn7 · 2018-12-04T21:33:08Z

That's what I tried to provoke in this unit test:
https://github.com/prometheus/client_golang/blob/master/prometheus/registry_test.go#L781

It all behaved as expected.

cknowles · 2018-12-05T07:17:03Z

@douglas-reid the chart is a screenshot from datadog, it's the average of reported values scraped from the prometheus endpoints of the telemetry service:

istio_mesh_endpoint: http://istio-telemetry.istio-system:42422/metrics
mixer_endpoint: http://istio-telemetry.istio-system:9093/metrics

Selecting max or sum gets different values but similar charts. Perhaps slightly less spikey with sum but still bottoms out after some time. Is it expected that this value continually increases or is it meant to act more like a request rate?

Still not sure if this is related to the original problem but looks suspicious nonetheless. Maybe this problem is the cause of the sudden drop.

Version info:

$ curl localhost:9093/version

root@8c2feba0b568-docker.io/istio-1.0.4-d5cb99f479ad9da88eebb8bb3637b17c323bc50b-Clean

bobbytables · 2018-12-05T17:08:13Z

Seeing this as well (I opened the original issue #8906 for this too).

Istio 1.0.4 on K8s 1.9.3. We're using SignalFx to scrape prometheus endpoints. I've annotated the telemetry pod with the prometheus annotations so SignalFx picks up the metrics. This works for a couple of hours and then we see the same issue.

This is a screenshot of it dying in SignalFx

The error we see now is:

* collected metric "istio_requests_total" { label:<name:"connection_security_policy" value:"unknown" > label:<name:"destination_app" value:"locator" > label:<name:"destination_service" value:"locator.applications.svc.cluster.local" > label:<name:"destination_service_name" value:"locator" > label:<name:"destination_service_namespace" value:"applications" > label:<name:"destination_version" value:"unknown" > label:<name:"destination_version" value:"unknown" > label:<name:"destination_workload" value:"locator-grpc" > label:<name:"destination_workload_namespace" value:"applications" > label:<name:"reporter" value:"source" > label:<name:"request_protocol" value:"grpc" > label:<name:"response_code" value:"200" > label:<name:"source_app" value:"hcm-web" > label:<name:"source_principal" value:"unknown" > label:<name:"source_version" value:"unknown" > label:<name:"source_workload" value:"hcm-web-public" > label:<name:"source_workload_namespace" value:"applications" > counter:<value:8 > } has two or more labels with the same name: destination_version

For now I think we'll be trying to just use the SignalFx adapter in Istio but this is super frustrating.

bobbytables · 2018-12-05T17:10:16Z

@douglas-reid would it be helpful to jump on a google hangout or something? I'd be happy to show you what we're doing instead of back and forth on this issue.

douglas-reid · 2018-12-05T18:14:39Z

@bobbytables perhaps -- thanks for the offer. maybe we should sync via a different mechanism about times, etc.

I have an idea I want to chase down to see if I can first repro and then fix the issue.

@c-knowles and @bobbytables, when you see the failure case, do you by any chance see a time-correlated log in istio-telemetry logs like: "Built new config.Snapshot: id=" or "Reloading adapter" or even "adapter did not close all the scheduled daemons {"adapter": "prometheus.istio-system"}" ?

douglas-reid · 2018-12-05T18:30:17Z

@beorn7 I modified that test by adding 16 labels instead of 3 and ran it for 60 seconds. I was able to repro the issue.

Sample logs:

 * collected metric "test_histogram" { label:<name:"one" value:"b" > label:<name:"five" value:"a" > label:<name:"foo" value:"bar" > label:<name:"fourteen" value:"b" > label:<name:"fourteen" value:"b" > label:<name:"fourteen" value:"b" > label:<name:"nine" value:"beta" > label:<name:"one" value:"b" > label:<name:"eight" value:"beta" > label:<name:"seven" value:"beta" > label:<name:"six" value:"beth" > label:<name:"sixteen" value:"c" > label:<name:"ten" value:"a" > label:<name:"thirteen" value:"beth" > label:<name:"three" value:"a" > label:<name:"twelve" value:"b" > label:<name:"two" value:"c" > histogram:<sample_count:1 sample_sum:0.10041906053093258 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:1 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > > } has two or more labels with the same name: fourteen
* collected metric "test_histogram" { label:<name:"eight" value:"alpha" > label:<name:"fifteen" value:"beta" > label:<name:"five" value:"b" > label:<name:"foo" value:"bar" > label:<name:"four" value:"beth" > label:<name:"fourteen" value:"alpha" > label:<name:"fourteen" value:"alpha" > label:<name:"nine" value:"c" > label:<name:"one" value:"gimel" > label:<name:"seven" value:"gamma" > label:<name:"six" value:"c" > label:<name:"sixteen" value:"aleph" > label:<name:"ten" value:"c" > label:<name:"thirteen" value:"gamma" > label:<name:"three" value:"gamma" > label:<name:"twelve" value:"beta" > label:<name:"two" value:"b" > histogram:<sample_count:1 sample_sum:0.1747269037010694 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:1 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > > } has two or more labels with the same name: fourteen
* collected metric "test_histogram" { label:<name:"eight" value:"c" > label:<name:"eleven" value:"gamma" > label:<name:"eleven" value:"gamma" > label:<name:"fifteen" value:"a" > label:<name:"five" value:"b" > label:<name:"foo" value:"bar" > label:<name:"fourteen" value:"b" > label:<name:"nine" value:"gamma" > label:<name:"one" value:"gimel" > label:<name:"seven" value:"beth" > label:<name:"six" value:"c" > label:<name:"sixteen" value:"aleph" > label:<name:"ten" value:"alpha" > label:<name:"thirteen" value:"b" > label:<name:"three" value:"beta" > label:<name:"twelve" value:"alpha" > label:<name:"two" value:"gimel" > histogram:<sample_count:1 sample_sum:0.2885442155939729 bucket:<cumulative_count:0 upper_bound:0.005 > bucket:<cumulative_count:0 upper_bound:0.01 > bucket:<cumulative_count:0 upper_bound:0.025 > bucket:<cumulative_count:0 upper_bound:0.05 > bucket:<cumulative_count:0 upper_bound:0.1 > bucket:<cumulative_count:0 upper_bound:0.25 > bucket:<cumulative_count:1 upper_bound:0.5 > bucket:<cumulative_count:1 upper_bound:1 > bucket:<cumulative_count:1 upper_bound:2.5 > bucket:<cumulative_count:1 upper_bound:5 > bucket:<cumulative_count:1 upper_bound:10 > > } has two or more labels with the same name: eleven

So, it appears that the number of labels seems to make a difference. I have to bounce around to a few meetings, but I'll follow up with more info later today.

beorn7 · 2018-12-05T19:04:33Z

@douglas-reid Perfect. I'll try to do the same. Once we can reproduce it in a test, we are half-way to solving this problem.

douglas-reid · 2018-12-05T21:30:32Z

@beorn7 I sent a provisional PR for your consideration and to help in tracking the issue: prometheus/client_golang#511.

cknowles · 2018-12-06T02:20:20Z

@douglas-reid I had a look through the last day or so. It seems every time that chart bottoms out we get several Built new config.Snapshot: id= logs.
Max ID is around 4 or 6 (it varies). Do not see any Reloading adapter logs at all for the last day. There are some adapter did not close all the scheduled daemons {"adapter": "prometheus.istio-system"} but the timings of these vary, some are aligned and some are not.

beorn7 · 2018-12-06T10:44:01Z

Good news: The issue in client_golang has been identified. prometheus/client_golang#513 should fix it.
It will be part of v0.9.2, which I'll release once the fix is merged.

mhamrah · 2018-12-07T02:23:32Z

Nice

…

On Thu, Dec 6, 2018, 5:44 AM Björn Rabenstein ***@***.***> wrote: Good news: The issue in client_golang has been identified. prometheus/client_golang#513 <prometheus/client_golang#513> should fix it. It will be part of v0.9.2, which I'll release once the fix is merged. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10236 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFkGuCgZKTylILNzfVYGNiBarx4MDGkks5u2PUQgaJpZM4Y96LM> .

douglas-reid · 2019-01-09T21:41:38Z

As this has been patched in the 1.0 and 1.1 branch, I think it is now safe to close this issue. Please re-open if you feel it is not appropriate.

mhamrah · 2019-01-09T21:51:44Z

thank you!

…

douglas-reid added the area/extensions and telemetry label Dec 3, 2018

douglas-reid added this to the 1.0 milestone Dec 3, 2018

douglas-reid assigned mandarjog, douglas-reid and geeknoid Dec 3, 2018

douglas-reid mentioned this issue Dec 5, 2018

Fix potential race with LabelPairs prometheus/client_golang#511

Closed

beorn7 mentioned this issue Dec 6, 2018

Fix the race in the registry prometheus/client_golang#513

Merged

douglas-reid mentioned this issue Dec 6, 2018

Update to latest Prometheus to grab critical fix #10321

Merged

douglas-reid mentioned this issue Dec 7, 2018

Update prom client_golang to catch needed fix #10348

Merged

geeknoid removed their assignment Jan 7, 2019

douglas-reid closed this as completed Jan 9, 2019

gbaufake mentioned this issue Feb 6, 2019

Multicluster Mixer Error with Single Control Plane #10915

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate labels on telemetry metric export #10236

Duplicate labels on telemetry metric export #10236

cknowles commented Dec 3, 2018

cknowles commented Dec 3, 2018

cknowles commented Dec 3, 2018

douglas-reid commented Dec 3, 2018

douglas-reid commented Dec 3, 2018

cknowles commented Dec 4, 2018

douglas-reid commented Dec 4, 2018

beorn7 commented Dec 4, 2018

douglas-reid commented Dec 4, 2018

beorn7 commented Dec 4, 2018

cknowles commented Dec 5, 2018

bobbytables commented Dec 5, 2018

bobbytables commented Dec 5, 2018

douglas-reid commented Dec 5, 2018

douglas-reid commented Dec 5, 2018 •

edited

beorn7 commented Dec 5, 2018

douglas-reid commented Dec 5, 2018

cknowles commented Dec 6, 2018

beorn7 commented Dec 6, 2018

mhamrah commented Dec 7, 2018 via email

douglas-reid commented Jan 9, 2019

mhamrah commented Jan 9, 2019 via email

Duplicate labels on telemetry metric export #10236

Duplicate labels on telemetry metric export #10236

Comments

cknowles commented Dec 3, 2018

cknowles commented Dec 3, 2018

cknowles commented Dec 3, 2018

douglas-reid commented Dec 3, 2018

douglas-reid commented Dec 3, 2018

cknowles commented Dec 4, 2018

douglas-reid commented Dec 4, 2018

beorn7 commented Dec 4, 2018

douglas-reid commented Dec 4, 2018

beorn7 commented Dec 4, 2018

cknowles commented Dec 5, 2018

bobbytables commented Dec 5, 2018

bobbytables commented Dec 5, 2018

douglas-reid commented Dec 5, 2018

douglas-reid commented Dec 5, 2018 • edited

beorn7 commented Dec 5, 2018

douglas-reid commented Dec 5, 2018

cknowles commented Dec 6, 2018

beorn7 commented Dec 6, 2018

mhamrah commented Dec 7, 2018 via email

douglas-reid commented Jan 9, 2019

mhamrah commented Jan 9, 2019 via email

douglas-reid commented Dec 5, 2018 •

edited