Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to throttle ingestion requests per minute #259

Open
sagar-infinitus-ai opened this issue Dec 15, 2020 · 0 comments
Open

Add flag to throttle ingestion requests per minute #259

sagar-infinitus-ai opened this issue Dec 15, 2020 · 0 comments

Comments

@sagar-infinitus-ai
Copy link

We had an incident (seemed to coincide with the recent Google outage on Dec 14) where, for a period of about 50 minutes, ALL requests to MetricService.CreateTimeSeries were failing.

When the api eventually recovered, the stackdriver sidecar attempted to send all outstanding data, hitting quota limits for Time series ingestion requests / minute.

Once this quota was it, it was never able to recover. Eventually, the stackdriver container just stopped (high CPU usage - statusz not responding). The final few log messages repeating:

QueueManager.updateShardsLoop
"Currently resharding, skipping"
QueueManager.calculateDesiredShards

At this point, there was no other option than to restart the whole pod (prometheus-server + stackdriver).

Is there anything we're missing? Is this situation recoverable other than by restarting the pod (and losing all unsent metrics)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant