New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate labels on telemetry metric export #10236
Comments
Might be related to the same underlying issue - at the same time as this issue some app deployments including existing ones only had some of the replicas marked as healthy. The other pods had live or readiness check failures even though they were running fine. We have mtls set to false so permissive mode. i.e. no changes needed to the healthchecks. I suspected the sidecars mostly because the app logs looked fine, they appeared to be running and the app health checks have not changed in months. Since I was not sure entirely what was wrong after some initial debugging I killed all pods in the |
I found some comments that appear to be related in #9043 (comment) |
@c-knowles there was a report of a similar issue (that we were never able to repro) in #8906. However, after some back and forth with the Prometheus folks, we believed the issue to be fixed with the client-library updates for 1.0.4. It is strange/troubling to see you still experiencing this issue. @mandarjog @geeknoid do we possibly have an issue with instance creation? memory reuse, or similar? |
Nope. So, even if the instance has multiple of the same labels somehow, the prom adapter code uses a map for labels. So, this should be impossible. @c-knowles do you have any way to recreate ? |
@douglas-reid was looking at this for a while yesterday but no luck so far. Interesting it's fallen over again last night with the same error so it is reproducible just not sure of the cause as a lot of moving pieces. Any advice where to start? Here is a gist of the helm values I am using - https://gist.github.com/c-knowles/f606fb0e0462759a0354ef737c9e7cc8. The Not sure if it's related but I found it interesting that the metric for request count seems to be an ever increasing value that occasionally dumps (just not what I'd expect): |
@c-knowles how are you generating that chart ? do the drop in request count totals correspond with restarts of the telemetry pods? Can you post the output from a @beorn7 any ideas here? This should be with: name = "github.com/prometheus/client_golang"
...
revision = "1cafe34db7fdec6022e17e00e1c1ea501022f3e4"
version = "v0.9.0" |
I don't have any theory that hasn't been falsified already. I haven't found any way to create the above error with the Prometheus client library alone, and my look at the Istio code didn't yield any abusive usage that would explain the error. It must be something subtle... |
@beorn7 wild guess: is it possible that the metric.Write() in gather is racy with some other update to the label pairs (maybe in a Sort)? i can't quite see how at the moment. |
That's what I tried to provoke in this unit test: It all behaved as expected. |
@douglas-reid the chart is a screenshot from datadog, it's the average of reported values scraped from the prometheus endpoints of the telemetry service:
Selecting max or sum gets different values but similar charts. Perhaps slightly less spikey with sum but still bottoms out after some time. Is it expected that this value continually increases or is it meant to act more like a request rate? Still not sure if this is related to the original problem but looks suspicious nonetheless. Maybe this problem is the cause of the sudden drop. Version info:
|
Seeing this as well (I opened the original issue #8906 for this too). Istio 1.0.4 on K8s 1.9.3. We're using SignalFx to scrape prometheus endpoints. I've annotated the telemetry pod with the prometheus annotations so SignalFx picks up the metrics. This works for a couple of hours and then we see the same issue. This is a screenshot of it dying in SignalFx The error we see now is:
For now I think we'll be trying to just use the SignalFx adapter in Istio but this is super frustrating. |
@douglas-reid would it be helpful to jump on a google hangout or something? I'd be happy to show you what we're doing instead of back and forth on this issue. |
@bobbytables perhaps -- thanks for the offer. maybe we should sync via a different mechanism about times, etc. I have an idea I want to chase down to see if I can first repro and then fix the issue. @c-knowles and @bobbytables, when you see the failure case, do you by any chance see a time-correlated log in istio-telemetry logs like: "Built new config.Snapshot: id=" or "Reloading adapter" or even "adapter did not close all the scheduled daemons {"adapter": "prometheus.istio-system"}" ? |
@beorn7 I modified that test by adding 16 labels instead of 3 and ran it for 60 seconds. I was able to repro the issue. Sample logs:
So, it appears that the number of labels seems to make a difference. I have to bounce around to a few meetings, but I'll follow up with more info later today. |
@douglas-reid Perfect. I'll try to do the same. Once we can reproduce it in a test, we are half-way to solving this problem. |
@beorn7 I sent a provisional PR for your consideration and to help in tracking the issue: prometheus/client_golang#511. |
@douglas-reid I had a look through the last day or so. It seems every time that chart bottoms out we get several |
Good news: The issue in client_golang has been identified. prometheus/client_golang#513 should fix it. |
Nice
…On Thu, Dec 6, 2018, 5:44 AM Björn Rabenstein ***@***.***> wrote:
Good news: The issue in client_golang has been identified.
prometheus/client_golang#513
<prometheus/client_golang#513> should fix it.
It will be part of v0.9.2, which I'll release once the fix is merged.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#10236 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAFkGuCgZKTylILNzfVYGNiBarx4MDGkks5u2PUQgaJpZM4Y96LM>
.
|
As this has been patched in the 1.0 and 1.1 branch, I think it is now safe to close this issue. Please re-open if you feel it is not appropriate. |
thank you!
… |
Describe the bug
We just installed Istio 1.0.4 using the Helm chart instructions on k8s 1.11.4 and it has been running several days. We also hooked it up to Datadog metric collection using their instructions. Metrics were coming through ok for a couple of days and all of a sudden stopped. Looking into why, it appears the telemetry is returning dozens of the below error about duplicate labels. The only place I found such an error so far is at https://github.com/prometheus/node_exporter/blob/f9dd8e9b8c29f6c9da676036d8a8c587326bb710/vendor/github.com/prometheus/client_golang/prometheus/registry.go#L845
Expected behavior
HTTP 200 for metrics as per the first few days of collection.
Steps to reproduce the bug
Not totally sure how to reproduce yet, I thought I would file now in case anyone experiences the same issue. I'd like to help in whatever way I can to try to reproduce it.
Version
Istio 1.0.4
k8s 1.11.4
Installation
Official Helm chart
Environment
AWS, Container Linux (kube-aws installer)
The text was updated successfully, but these errors were encountered: