add a metric that can be used to notice stuck worker threads #70884

lavalamp · 2018-11-09T19:02:06Z

"unfinished_work_microseconds" is added to the workqueue metrics; it can be used to detect stuck worker threads. (kube-controller-manager runs many workqueues.)

k8s-ci-robot · 2018-11-09T19:02:18Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/OWNERS~~ [lavalamp]
~~staging/src/k8s.io/client-go/OWNERS~~ [lavalamp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lavalamp · 2018-11-09T19:05:05Z

I can probably test this better (existing metrics don't seem to be tested at all?).

lmk if there's a problem with the approach.

/assign @mml
cc @logicalhan

mml · 2018-11-09T19:12:44Z

pkg/util/workqueue/prometheus/prometheus.go

+	unfinished := prometheus.NewGauge(prometheus.GaugeOpts{
+		Subsystem: name,
+		Name:      "unfinished_work_microseconds",
+		Help:      "How many microseconds of work has " + name + " done that is still in progress and hasn't yet been observed by work_duration.",


What is work_duration?

It's a metric defined a few lines up.

mml · 2018-11-09T19:22:01Z

Thinking about this... I prefer having a counter that counts the number of times the thread has "finished". This lets you notice when it's stuck (and for how long), but also lets you compute rates and deltas, even if you miss some samples. And the data downsamples elegantly.

With a "microseconds since" metric, stuck is pretty easy to detect, but you can't derive rates/deltas reliably. And downsampling tends to lose information about intermediate events.

lavalamp · 2018-11-09T19:30:59Z

We already track the number of completions as part of the procssing duration statistic. You can't tell from that how many things haven't completed, which is what I am worried about.

The problem is we don't know how many threads there are, it's not even guaranteed to be a constant number.

You can tell how many threads are stuck by the rate at which this new metric is increasing.

lavalamp · 2018-11-09T19:33:49Z

Looking for differences in the completion rate also doesn't work because the queue is probably far under capacity. If it is always processing everything that is added, then the completion rate will be the same whether it's 19 workers (1 stuck) or 20.

If it's not processing everything that's added, then there's a major problem :)

lavalamp · 2018-11-09T19:46:30Z

Also, you can't look at the difference between adds and completions, because adds are de-duped in the queue. The ratio could easily be 100:1 or more and that's fine, good even, as it means less work is done.

lavalamp · 2018-11-09T19:49:18Z

Hm, actually it looks like the add metric is incremented post-deduping. I'm not convinced the code is right. Maybe I'll add some more tests.

mml · 2018-11-09T19:50:53Z

We already track the number of completions as part of the procssing duration statistic. You can't tell from that how many things haven't completed, which is what I am worried about.

To rephrase, deltas on the duration metric would tell you if you're not doing any work, but that could be either because you're stuck or because there's nothing to do?

logicalhan · 2018-11-09T19:52:17Z

The logic seems okay to me, but I'm also not super familiar with how we consume prometheus data, so I don't have a bunch of helpful input, unfortunately. In general it seems to make sense to me though.

logicalhan · 2018-11-09T19:52:46Z

Re: @mml, yeah, I think the problem is that there would be no difference between not working and being stuck.

lavalamp · 2018-11-09T19:54:32Z

If all threads get stuck then the metric would go to zero and you'd notice that. If only some threads get stuck, then you wouldn't necessarily notice that.

The deduping actually does look to happen before incrementing the add metric, so in steady state the add and done metrics should be the same value. Unfortunately if they're off by one, you don't know if that's because of something that got added a second ago or a week ago.

mml · 2018-11-09T19:59:57Z

staging/src/k8s.io/client-go/util/workqueue/metrics.go

+	var total float64
+	if m.processingStartTimes != nil {
+		for _, t := range m.processingStartTimes {
+			total += sinceInMicroseconds(t)


Ah, so this is (items * time). This is kind of a confusing unit. Don't we really just care about the oldest thing in the queue?

I could report that, but you wouldn't be able to tell the difference between one thing stuck and two things stuck.

I agree that the unit is a little confusing, but emitting this as raw data is probably okay. Whatever thing is responsible for scraping this data can post-process before turning it into an actual (and more useful and less confusing) metric datapoint.

It is a confusing unit, but I think all the things that we can do to make it less confusing are lossy (or require making assumptions about the duration of the things that are being done) and therefore best done in the monitoring system.

In this case, we can't tell the difference between 1 thing stuck for an hour, or 2 things stuck for 30 minutes.

If we want information on both ages and the count of things at certain ages, we'd normally use a distribution/histogram (which I think is "summary" in prometheus?).

Github stuck my response in the wrong place:

If you plot this over time, the slope will tell you the difference between those scenarios.

I'm open to reporting a distribution, of course, but I don't have a single thing to observe and it wasn't clear how to get around that.

I think I might change the unit to be seconds, though. Microseconds is far more granular than necessary.

Unit changed, and I expanded the help string for the metric since it wasn't obvious how to use it.

lavalamp · 2018-11-10T00:25:01Z

OK, I greatly improved the testing, PTAL

lavalamp · 2018-11-10T01:26:41Z

If you plot this over time, the slope will tell you the difference between those scenarios.

…

On Fri, Nov 9, 2018, 5:24 PM Matt Liggett ***@***.*** wrote: ***@***.**** commented on this pull request. ------------------------------ In staging/src/k8s.io/client-go/util/workqueue/metrics.go <#70884 (comment)> : > @@ -103,6 +114,23 @@ func (m *defaultQueueMetrics) done(item t) { } } +func (m *defaultQueueMetrics) updateUnfinishedWork() { + var total float64 + if m.processingStartTimes != nil { + for _, t := range m.processingStartTimes { + total += sinceInMicroseconds(t) In this case, we can't tell the difference between 1 thing stuck for an hour, or 2 things stuck for 30 minutes. If we want information on both ages and the *count of things at certain ages*, we'd normally use a distribution/histogram (which I think is "summary" in prometheus?). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70884 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAnglv_c03Bc7WK2EhFnZTsHHtDH3BuXks5utirFgaJpZM4YXV5H> .

logicalhan · 2018-11-10T01:30:23Z

It looks good to me. The clock injection for testing was a nice touch.

lavalamp · 2018-11-10T01:50:55Z

I'm open to reporting a distribution, of course, but I don't have a single thing to observe and it wasn't clear how to get around that.

…

On Fri, Nov 9, 2018, 5:31 PM Han Kang ***@***.*** wrote: It looks good to me. The clock injection for testing was a nice touch. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#70884 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAnglr82oCOwT1Nf_Mbotqth4KRNW-8wks5utixlgaJpZM4YXV5H> .

justinsb · 2018-11-11T00:11:08Z

staging/src/k8s.io/client-go/util/workqueue/metrics.go

+	metricsProvider: noopMetricsProvider{},
+}
+
+type metricsFactory struct {


Maybe rename to queueMetricsFactory or queueMetricsProvider ? I don't know whether we're actively making a distinction between factories & providers, I find it confusing to have metricsFactory with an embedded metricsProvider.

justinsb · 2018-11-11T00:16:30Z

staging/src/k8s.io/client-go/util/workqueue/metrics.go

+
+func (f *metricsFactory) newQueueMetrics(name string, clock clock.Clock) queueMetrics {
+	mp := f.metricsProvider
+	if len(name) == 0 || mp == (noopMetricsProvider{}) {


Does mp == (noopMetricsProvider{}) work? I would have though you'd have to do a type-check.

Surprisingly, go does the comparison correctly (otherwise the test wouldn't pass / compile).

justinsb · 2018-11-11T00:17:16Z

staging/src/k8s.io/client-go/util/workqueue/metrics.go

-	metricsFactory.setProviders.Do(func() {
-		metricsFactory.metricsProvider = metricsProvider
-	})
+	globalMetricsFactory.set(metricsProvider)


Maybe rename set to setProvider, for consistency?

justinsb · 2018-11-11T00:24:11Z

staging/src/k8s.io/client-go/util/workqueue/queue.go

@@ -64,6 +82,9 @@ type Type struct {
 	shuttingDown bool

 	metrics queueMetrics
+
+	unfinishedWorkUpdatePeriod time.Duration


Nit: we don't need to store it in the struct, we could just pass it into updateUnfinishedWorkLoop. Not sure if that makes things better or worse in your mind.

I think I prefer it when things are introspectable after the fact, although it doesn't matter in this case.

justinsb · 2018-11-11T00:24:35Z

staging/src/k8s.io/client-go/util/workqueue/queue.go

@@ -170,3 +191,22 @@ func (q *Type) ShuttingDown() bool {

 	return q.shuttingDown
 }
+
+func (q *Type) updateUnfinishedWorkLoop() {
+	t := q.clock.NewTicker(q.unfinishedWorkUpdatePeriod)


I'd probably add a comment here that the reason we're doing this is because it's not safe to update metrics without holding the q.cond.L lock

Added a note to defaultQueueMetrics.

justinsb · 2018-11-11T00:31:36Z

A few code comments. I like the metric though - although we're packing a few failure modes into one metric, I think it gives us signal where we didn't have signal before. If we need to add more metrics to disambiguate later, we can do so. (Do we have any sort of guarantee on metric compatibility?)

lavalamp · 2018-11-11T02:35:36Z

PTAL

I'm not sure if we've set a metrics deprecation policy or not, I think 2 releases / 6 months is probably the minimum deprecation time that could be reasonable.

change units to seconds

lavalamp · 2018-11-11T05:08:13Z

/retest

lavalamp · 2018-11-11T05:47:52Z

/retest

lavalamp · 2018-11-12T18:53:15Z

Added a second metric for @mml.

fix data race

lavalamp · 2018-11-12T23:46:06Z

Can I get an LGTM so I can squash? :)

logicalhan · 2018-11-13T00:47:20Z

pkg/util/workqueue/prometheus/prometheus.go

+func (prometheusMetricsProvider) NewLongestRunningProcessorMicrosecondsMetric(name string) workqueue.SettableGaugeMetric {
+	unfinished := prometheus.NewGauge(prometheus.GaugeOpts{
+		Subsystem: name,
+		Name:      "longest_running_procesor_microseconds",


logicalhan

/LGTM

k8s-ci-robot · 2018-11-13T00:49:33Z

@logicalhan: changing LGTM is restricted to assignees, and only kubernetes/kubernetes repo collaborators may be assigned issues.

In response to this:

/LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

deads2k · 2018-11-13T13:50:15Z

@p0lyn0mial fyi. This sort of thing would also be useful in controllers and perhaps in the server (if there's a spot we missed)

deads2k · 2018-11-13T13:52:09Z

PTAL

I'm not sure if we've set a metrics deprecation policy or not, I think 2 releases / 6 months is probably the minimum deprecation time that could be reasonable.

I think it depends. We try to be polite, but I don't think we've ever enforced a hard rule on metric naming and the like

mml · 2018-11-13T17:44:28Z

staging/src/k8s.io/client-go/util/workqueue/metrics.go

+	}
+	// Convert to seconds; microseconds is unhelpfully granular for this.
+	total /= 1000000
+	m.unfinishedWorkSeconds.Set(total)


Why not just "Set(total / 1000000)"?

Not sure, if I make another change I'll update this otherwise it doesn't seem worth it :)

Thanks for the review!

mml · 2018-11-13T17:45:44Z

/lgtm

Thanks.

add a metric that can be used to notice stuck worker threads

6195d10

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 9, 2018

k8s-ci-robot requested review from dchen1107 and soltysh November 9, 2018 19:02

k8s-ci-robot assigned mml Nov 9, 2018

mml reviewed Nov 9, 2018

View reviewed changes

lavalamp added 2 commits November 9, 2018 16:24

Test workqueue metrics

5a8444c

generated files

74c50c0

lavalamp force-pushed the workqueue branch from 67e205d to 74c50c0 Compare November 10, 2018 00:24

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 10, 2018

justinsb reviewed Nov 11, 2018

View reviewed changes

fixup! Test workqueue metrics

44a87ba

fixup! Test workqueue metrics

578962d

change units to seconds

lavalamp added this to the v1.13 milestone Nov 11, 2018

add longest_running_processor_microseconds metric

fd77aa5

fixup! add longest_running_processor_microseconds metric

680ddd4

fix data race

logicalhan reviewed Nov 13, 2018

View reviewed changes

logicalhan approved these changes Nov 13, 2018

View reviewed changes

fixup! add longest_running_processor_microseconds metric

980242c

mml reviewed Nov 13, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2018

lavalamp added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 13, 2018

k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 13, 2018

k8s-ci-robot merged commit bc6aee1 into kubernetes:master Nov 13, 2018

logicalhan mentioned this pull request Dec 8, 2018

REQUEST: New membership for @logicalhan kubernetes/org#292

Closed

6 tasks

add a metric that can be used to notice stuck worker threads #70884

add a metric that can be used to notice stuck worker threads #70884

Conversation

lavalamp commented Nov 9, 2018

k8s-ci-robot commented Nov 9, 2018

lavalamp commented Nov 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mml commented Nov 9, 2018

lavalamp commented Nov 9, 2018

lavalamp commented Nov 9, 2018

lavalamp commented Nov 9, 2018

lavalamp commented Nov 9, 2018

mml commented Nov 9, 2018

logicalhan commented Nov 9, 2018

logicalhan commented Nov 9, 2018 • edited

lavalamp commented Nov 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Nov 10, 2018

lavalamp commented Nov 10, 2018 via email

logicalhan commented Nov 10, 2018

lavalamp commented Nov 10, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsb commented Nov 11, 2018

lavalamp commented Nov 11, 2018

lavalamp commented Nov 11, 2018

lavalamp commented Nov 11, 2018

lavalamp commented Nov 12, 2018

lavalamp commented Nov 12, 2018

logicalhan Nov 13, 2018 • edited

Choose a reason for hiding this comment

logicalhan left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 13, 2018

deads2k commented Nov 13, 2018

deads2k commented Nov 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mml commented Nov 13, 2018

logicalhan commented Nov 9, 2018 •

edited

logicalhan Nov 13, 2018 •

edited