Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a metric that can be used to notice stuck worker threads #70884

Merged
merged 8 commits into from Nov 13, 2018

Conversation

lavalamp
Copy link
Member

@lavalamp lavalamp commented Nov 9, 2018

"unfinished_work_microseconds" is added to the workqueue metrics; it can be used to detect stuck worker threads. (kube-controller-manager runs many workqueues.)

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 9, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2018
@lavalamp
Copy link
Member Author

lavalamp commented Nov 9, 2018

I can probably test this better (existing metrics don't seem to be tested at all?).

lmk if there's a problem with the approach.

/assign @mml
cc @logicalhan

unfinished := prometheus.NewGauge(prometheus.GaugeOpts{
Subsystem: name,
Name: "unfinished_work_microseconds",
Help: "How many microseconds of work has " + name + " done that is still in progress and hasn't yet been observed by work_duration.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is work_duration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a metric defined a few lines up.

@mml
Copy link
Contributor

mml commented Nov 9, 2018

Thinking about this... I prefer having a counter that counts the number of times the thread has "finished". This lets you notice when it's stuck (and for how long), but also lets you compute rates and deltas, even if you miss some samples. And the data downsamples elegantly.

With a "microseconds since" metric, stuck is pretty easy to detect, but you can't derive rates/deltas reliably. And downsampling tends to lose information about intermediate events.

@lavalamp
Copy link
Member Author

lavalamp commented Nov 9, 2018

We already track the number of completions as part of the procssing duration statistic. You can't tell from that how many things haven't completed, which is what I am worried about.

The problem is we don't know how many threads there are, it's not even guaranteed to be a constant number.

You can tell how many threads are stuck by the rate at which this new metric is increasing.

@lavalamp
Copy link
Member Author

lavalamp commented Nov 9, 2018

Looking for differences in the completion rate also doesn't work because the queue is probably far under capacity. If it is always processing everything that is added, then the completion rate will be the same whether it's 19 workers (1 stuck) or 20.

If it's not processing everything that's added, then there's a major problem :)

@lavalamp
Copy link
Member Author

lavalamp commented Nov 9, 2018

Also, you can't look at the difference between adds and completions, because adds are de-duped in the queue. The ratio could easily be 100:1 or more and that's fine, good even, as it means less work is done.

@lavalamp
Copy link
Member Author

lavalamp commented Nov 9, 2018

Hm, actually it looks like the add metric is incremented post-deduping. I'm not convinced the code is right. Maybe I'll add some more tests.

@mml
Copy link
Contributor

mml commented Nov 9, 2018

We already track the number of completions as part of the procssing duration statistic. You can't tell from that how many things haven't completed, which is what I am worried about.

To rephrase, deltas on the duration metric would tell you if you're not doing any work, but that could be either because you're stuck or because there's nothing to do?

@logicalhan
Copy link
Member

The logic seems okay to me, but I'm also not super familiar with how we consume prometheus data, so I don't have a bunch of helpful input, unfortunately. In general it seems to make sense to me though.

@logicalhan
Copy link
Member

logicalhan commented Nov 9, 2018

Re: @mml, yeah, I think the problem is that there would be no difference between not working and being stuck.

@lavalamp
Copy link
Member Author

lavalamp commented Nov 9, 2018

If all threads get stuck then the metric would go to zero and you'd notice that. If only some threads get stuck, then you wouldn't necessarily notice that.

The deduping actually does look to happen before incrementing the add metric, so in steady state the add and done metrics should be the same value. Unfortunately if they're off by one, you don't know if that's because of something that got added a second ago or a week ago.

var total float64
if m.processingStartTimes != nil {
for _, t := range m.processingStartTimes {
total += sinceInMicroseconds(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this is (items * time). This is kind of a confusing unit. Don't we really just care about the oldest thing in the queue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could report that, but you wouldn't be able to tell the difference between one thing stuck and two things stuck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the unit is a little confusing, but emitting this as raw data is probably okay. Whatever thing is responsible for scraping this data can post-process before turning it into an actual (and more useful and less confusing) metric datapoint.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a confusing unit, but I think all the things that we can do to make it less confusing are lossy (or require making assumptions about the duration of the things that are being done) and therefore best done in the monitoring system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we can't tell the difference between 1 thing stuck for an hour, or 2 things stuck for 30 minutes.

If we want information on both ages and the count of things at certain ages, we'd normally use a distribution/histogram (which I think is "summary" in prometheus?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github stuck my response in the wrong place:

If you plot this over time, the slope will tell you the difference between those scenarios.

I'm open to reporting a distribution, of course, but I don't have a single thing to observe and it wasn't clear how to get around that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might change the unit to be seconds, though. Microseconds is far more granular than necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit changed, and I expanded the help string for the metric since it wasn't obvious how to use it.

@lavalamp
Copy link
Member Author

OK, I greatly improved the testing, PTAL

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 10, 2018
@lavalamp
Copy link
Member Author

lavalamp commented Nov 10, 2018 via email

@logicalhan
Copy link
Member

It looks good to me. The clock injection for testing was a nice touch.

@lavalamp
Copy link
Member Author

lavalamp commented Nov 10, 2018 via email

metricsProvider: noopMetricsProvider{},
}

type metricsFactory struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to queueMetricsFactory or queueMetricsProvider ? I don't know whether we're actively making a distinction between factories & providers, I find it confusing to have metricsFactory with an embedded metricsProvider.


func (f *metricsFactory) newQueueMetrics(name string, clock clock.Clock) queueMetrics {
mp := f.metricsProvider
if len(name) == 0 || mp == (noopMetricsProvider{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does mp == (noopMetricsProvider{}) work? I would have though you'd have to do a type-check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, go does the comparison correctly (otherwise the test wouldn't pass / compile).

metricsFactory.setProviders.Do(func() {
metricsFactory.metricsProvider = metricsProvider
})
globalMetricsFactory.set(metricsProvider)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename set to setProvider, for consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -64,6 +82,9 @@ type Type struct {
shuttingDown bool

metrics queueMetrics

unfinishedWorkUpdatePeriod time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we don't need to store it in the struct, we could just pass it into updateUnfinishedWorkLoop. Not sure if that makes things better or worse in your mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer it when things are introspectable after the fact, although it doesn't matter in this case.

@@ -170,3 +191,22 @@ func (q *Type) ShuttingDown() bool {

return q.shuttingDown
}

func (q *Type) updateUnfinishedWorkLoop() {
t := q.clock.NewTicker(q.unfinishedWorkUpdatePeriod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably add a comment here that the reason we're doing this is because it's not safe to update metrics without holding the q.cond.L lock

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to defaultQueueMetrics.

@justinsb
Copy link
Member

A few code comments. I like the metric though - although we're packing a few failure modes into one metric, I think it gives us signal where we didn't have signal before. If we need to add more metrics to disambiguate later, we can do so. (Do we have any sort of guarantee on metric compatibility?)

@lavalamp
Copy link
Member Author

PTAL

I'm not sure if we've set a metrics deprecation policy or not, I think 2 releases / 6 months is probably the minimum deprecation time that could be reasonable.

change units to seconds
@lavalamp
Copy link
Member Author

/retest

@lavalamp lavalamp added this to the v1.13 milestone Nov 11, 2018
@lavalamp
Copy link
Member Author

/retest

@lavalamp
Copy link
Member Author

Added a second metric for @mml.

@lavalamp
Copy link
Member Author

Can I get an LGTM so I can squash? :)

func (prometheusMetricsProvider) NewLongestRunningProcessorMicrosecondsMetric(name string) workqueue.SettableGaugeMetric {
unfinished := prometheus.NewGauge(prometheus.GaugeOpts{
Subsystem: name,
Name: "longest_running_procesor_microseconds",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*processor

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@k8s-ci-robot
Copy link
Contributor

@logicalhan: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues.

In response to this:

/LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@deads2k
Copy link
Contributor

deads2k commented Nov 13, 2018

@p0lyn0mial fyi. This sort of thing would also be useful in controllers and perhaps in the server (if there's a spot we missed)

@deads2k
Copy link
Contributor

deads2k commented Nov 13, 2018

PTAL

I'm not sure if we've set a metrics deprecation policy or not, I think 2 releases / 6 months is probably the minimum deprecation time that could be reasonable.

I think it depends. We try to be polite, but I don't think we've ever enforced a hard rule on metric naming and the like

}
// Convert to seconds; microseconds is unhelpfully granular for this.
total /= 1000000
m.unfinishedWorkSeconds.Set(total)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just "Set(total / 1000000)"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, if I make another change I'll update this otherwise it doesn't seem worth it :)

Thanks for the review!

@mml
Copy link
Contributor

mml commented Nov 13, 2018

/lgtm

Thanks.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2018
@lavalamp lavalamp added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 13, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 13, 2018
@k8s-ci-robot k8s-ci-robot merged commit bc6aee1 into kubernetes:master Nov 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants