Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus-operator stackdriver sidecar sharding events #233

Open
dgdevops opened this issue Apr 30, 2020 · 3 comments
Open

Prometheus-operator stackdriver sidecar sharding events #233

dgdevops opened this issue Apr 30, 2020 · 3 comments

Comments

@dgdevops
Copy link

dgdevops commented Apr 30, 2020

I am using service monitor k8s resources to add targets to Prometheus.
I keep receiving metrics in Stackdriver from the sidecar until I add a service monitor to my k8s cluster that adds 220 targets to my prometheus, once the targets come up ALL metrics in stackdriver stop at the same time and no new metric values appear in Stackdriver. Based on the sidecar container logs shard calculation takes place :

level=debug ts=2020-04-30T08:51:20.975Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=9.107276804519778e-05 upperBound=1.1
level=debug ts=2020-04-30T08:51:35.975Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=0.028438730446968884 samplesOut=0.035548413058711106 samplesOutDuration=27897.643824423412 timePerSample=784778.8810810816 sizeRate=70059.18401954918 offsetRate=260863.64812517414 desiredShards=7.020667105478262e-05

This keeps going for hours and hours but the metrics do not return to Stackdriver.
Could you please help in understanding the sharding?
Additionally, how could I speed up the process?

Thanks

@jmacd
Copy link

jmacd commented Feb 1, 2021

I strongly suspect this is due to particular data points causing an unrecoverable error that looks recoverable. This requires some kind of never-succeeding request to explain, but the sidecar logic absolutely can fall into a permanent retry loop and block the WAL reader when this happens. Documented in the downstream repository

lightstep/opentelemetry-prometheus-sidecar#88

also partly mitigated:

https://github.com/lightstep/opentelemetry-prometheus-sidecar/pulls/87

@jmacd
Copy link

jmacd commented Feb 1, 2021

This is the function that never returns:

// sendSamples to the remote storage with backoff for recoverable errors.
func (s *shardCollection) sendSamplesWithBackoff(client StorageClient, samples []*monitoring_pb.TimeSeries) {
	backoff := s.qm.cfg.MinBackoff
	for {
		begin := time.Now()
		err := client.Store(&monitoring_pb.CreateTimeSeriesRequest{TimeSeries: samples})

		sentBatchDuration.WithLabelValues(s.qm.queueName).Observe(time.Since(begin).Seconds())
		if err == nil {
			succeededSamplesTotal.WithLabelValues(s.qm.queueName).Add(float64(len(samples)))
			return
		}

		if _, ok := err.(recoverableError); !ok {
			level.Warn(s.qm.logger).Log("msg", "Unrecoverable error sending samples to remote storage", "err", err)
			break
		}
		time.Sleep(time.Duration(backoff))
		backoff = backoff * 2
		if backoff > s.qm.cfg.MaxBackoff {
			backoff = s.qm.cfg.MaxBackoff
		}
	}

	failedSamplesTotal.WithLabelValues(s.qm.queueName).Add(float64(len(samples)))
}

@varun-krishna
Copy link

I see the same behaviour with the same messages from the stack-driver sidecar

level=debug ts=2021-02-09T07:25:54.294Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=0.00173154100250915 samplesOut=0.00173154100250915 samplesOutDuration=5557.854004867412 timePerSample=3.2097732579324483e+06 sizeRate=4890.771316463715 offsetRate=2.134860677902194 desiredShards=0.019098805764792753

level=debug ts=2021-02-09T07:25:54.294Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.019098805764792753 upperBound=1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants