Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Stackdriver Sidecar error component="Prometheus reader" msg="target not found" #229

Open
dgdevops opened this issue Apr 27, 2020 · 3 comments

Comments

@dgdevops
Copy link

The stackdriver sidecar injected into prometheus pod (deployed by prometheus-operator) does not send application metrics to Stackdriver. With the default sidecar configuration I can see default metrics related to alertmanager, prometheus, go, kubedns in stackdriver but the application metrics are missing. In the sidecar container logs I can see such errors constantly (with logging level debug enabled):

level=debug ts=2020-04-27T08:58:48.172Z caller=series_cache.go:354 component="Prometheus reader" msg="target not found" labels="{name="cluster_quantile:apiserver_request_duration_seconds:histogram_quantile",component="apiserver",endpoint="https",group="scheduling.k8s.io",job="apiserver",namespace="default",quantile="0.9",resource="priorityclasses",scope="cluster",service="kubernetes",verb="WATCH",version="v1"}"

Technical details:

  • Sidecar version: 0.7.3
  • Prometheus operator version: v0.30.1
  • relabel_config is configured (multiple)
  • Sidecar args:
    • "--stackdriver.project-id=${GCP_PROJECT}"
    • "--prometheus.wal-directory=/prometheus/wal"
    • "--stackdriver.kubernetes.location=${GCP_REGION}"
    • "--stackdriver.kubernetes.cluster-name=${KUBE_CLUSTER}"
    • "--log.level=debug"
@arturgspb
Copy link

I have the same troubles. Any ideas?

@arturgspb
Copy link

I found #104 (comment)

@jinnovation
Copy link

jinnovation commented Feb 8, 2022

I assume, from the colons in the name, that this is a recording rule.

My team recently sunk ~3 days into debugging just this issue. I think our findings will be helpful here.

Basically: the sidecar makes assumptions about the relationship between scrape targets and metrics that don't universally hold when the metric in question is a recording rule.

The key here is targets/cache#targetMatch, reproduced below:

// targetMatch returns the first target in the entry that matches all labels of the input
// set iff it has them set.
// This way metric labels are skipped while consistent target labels are considered.
func targetMatch(targets []*Target, lset labels.Labels) (*Target, bool) {
Outer:
	for _, t := range targets {
		for _, tl := range t.Labels {
			if lset.Get(tl.Name) != tl.Value {
				continue Outer
			}
		}
		return t, true
	}
	return nil, false
}

Basically, for every metric, the sidecar tries to find at least one scrape target for which the metric's label set is a strict superset of the target. If it can't find one, the sidecar will refuse to write the metric to Stackdriver, resulting in the target not found error.

With "normal" metrics, this assumption might hold just fine. However, consider a scrape target target0 that provides label set {job, instance, foo, bar}. Say a metric metric0 gets scraped from target0, and we have recording rule recorded:metric0 that "removes" foo and bar from metric0 via some "downsampling" operation like sum by (job, instance) (metroc0).

When the sidecar tries to write recorded:metric0 to Cloud Monitoring (Stackdriver), it'll comb through the set of known scrape targets, see that the label set in target0 is not fully represented in recorded:metric0, emit the target not found error, and ultimately refuse to write recorded:metric0 to Cloud Monitoring.

The solution is to make sure that your recording rule is always "associated" (in the way defined here w.r.t. label sets) with at least one scrape target. If this scrape target only confers labels job and instance, then both the sidecar and Cloud Monitoring (which requires those two labels) are happy, and so are you. 😁

If any member of the sidecar maintenance team would like to chime in to confirm/refute/refine any of this, that'd be excellent. Otherwise, thanks team for enabling Prometheus-to-GCP metrics until today, and I look forward to migrating to GKE workload metrics as called out in #296.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants