Start collecting etcd metrics in our performance tests #64030

shyamjvs · 2018-05-18T14:05:01Z

This is a discussion that came up few times already in the past. It'll be really helpful to have those in light of recent regressions we'd seen due to etcd changes, like:

Etcd exposes a bunch of metrics, out of which few that're most interesting to us IMO are:

backend_commit_duration_seconds
snapshot_save_total_duration_seconds
peer_round_trip_time_seconds (probably not so important for our scalability tests that're all single-node etcd setups, except gke-regional-performance one)
sth else?

We should:

collect those metrics through our test framework (similar to how we're doing for apiserver)
plot those metrics on our performance dashboard

cc @kubernetes/sig-scalability-misc @kubernetes/sig-api-machinery-misc @wojtek-t @jpbetz @jberkus

shyamjvs · 2018-05-18T14:05:22Z

/assign @krzysied
Who seemed interested in taking this up.

jberkus · 2018-05-18T16:28:04Z

Other metrics I'd like to track:

wal_fsync_duration_seconds: if at some scalability level disk contention is a factor, it would be nice to see it in the graphs.
proposals_committed_total: we really need a count of the number of writes being done to Etcd.
proposals_pending: we also want to see if etcd requests start queueing up.

I don't see any etcd metric for average size of a change. I think this is possibly significant; I can definitely imagine some recent patches that make object definitions longer on average could affect overall latency. Bytes are bytes. I'll see if I can track down an etcd engineer around this.

jberkus · 2018-05-18T16:30:58Z

Ah, here we go:

client_grpc_sent_bytes_total
client_grpc_received_bytes_total

Those give us request size information.

jennybuckley · 2018-05-24T20:13:18Z

/cc @jpbetz @wenjiaswe

jpbetz · 2018-05-24T20:55:20Z

etcd 3.1+ emits https://github.com/grpc-ecosystem/go-grpc-prometheus metrics when the --metrics extensive flag is set, grpc_server_started_total and similar might be helpful for answering questions about load (in conjunction with the client_grpc_*_bytes_total metrics @jberkus already mentioned above).

shyamjvs · 2018-05-29T11:47:15Z

Thanks @jberkus and @jpbetz - we'll keep those points in mind.

So one question that popped up while @krzysied was looking into it is - how do we reset the etcd metrics (so that our e2e tests are independently measured)?
And it seems to me like we have a few options here:

Restart etcd in b/w the tests (which will automatically reset the metrics)
Measure the values of metrics after test-1 and subtract it from the values after test-2 (this should work with the metrics mostly being histograms and counters)
Change upstream etcd to allow DELETE operation on the metrics handler
Not care about resetting at all

I'm inclined towards 1 or 3, but maybe someone has different ideas/concerns? Or maybe there's some totally different approach here?

jpbetz · 2018-05-29T17:32:55Z

I'm in favor of restarting etcd b/w tests. We arguably should do this anyhow for control purposes.

wojtek-t · 2018-05-30T07:48:57Z

There is significant drawback of "restarting etcd" approach. Which is: it's hard to do in managed environments (such as GKE) [though arguably, neither 1 nor 3 may be possible there...]

shyamjvs · 2018-05-30T10:23:28Z

I agree with your comment @wojtek-t . However, I think for testing purpose it might be ok to restart it (we just want to measure etcd performance in our tests). Wdyt?

wojtek-t · 2018-05-30T10:38:33Z

I agree with your comment @wojtek-t . However, I think for testing purpose it might be ok to restart it (we just want to measure etcd performance in our tests). Wdyt?

I think I wasn't clear - I'm not opposed to restarting etcd itself.
I was just trying to say, that in managed solutions (like GKE) you may not be able (have enough permissions) to restart etcd. Which means that you won't be able to run those tests the same way on GKE (and other managed solutions).

shyamjvs · 2018-05-30T10:46:23Z

Yeah, that's a good point that this won't work on managed solutions. That said, FMU we have even bigger problem there of even collecting the metrics (as neither can we proxy to the pod because master isn't registered, nor can we SSH to the master) - it's possible that we may not even be able to access those from user space.

wojtek-t · 2018-05-30T10:53:21Z

Yes - that's why I wrote " [though arguably, neither 1 nor 3 may be possible there...]" above :)

shyamjvs · 2018-05-30T11:01:53Z

Well.. then also 2, no? :)

krzysied · 2018-06-04T13:22:09Z

As you can see above, I've created PR for etcd metrics. However, I think we should discuss about which metrics we really want to collect. Currently I've added support only for etcd_disk_backend_commit_duration_seconds, etcd_debugging_snap_save_total_duration_seconds, etcd_disk_wal_fsync_duration_seconds, peer_round_trip_time_seconds.

Do we need all of those metrics?
Is there any other metric we want to collect?

shyamjvs · 2018-06-04T13:26:07Z

@krzysied Let's begin by collecting the ones you mentioned (I believe they're quite key ones). We can think of extending to further metrics later.

Automatic merge from submit-queue (batch tested with PRs 64695, 65982, 65908). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Collecting etcd metrics **What this PR does / why we need it**: Adding etcd metrics to performance test log. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: ref #64030 **Special notes for your reviewer**: **Release note**: ```release-note NONE ```

shyamjvs · 2018-07-11T10:10:41Z

@krzysied Your PR #64695 is merged but it seems like we're still not collecting the etcd metrics (i.e I don't see an EtcdMetrics file in artifacts). Could you PTAL and make sure it works?

krzysied · 2018-07-11T13:17:48Z

I see there is a problem. I'm working on it.
I'll write back as soon as I find the cause.

krzysied · 2018-07-11T15:33:11Z

I've found out why there is no etcd metrics collected. All of our executed runs don't contain my PR. I verified that etcd metrics are collected during presubmit checks, thus it should work for out normal test runs.

The issue here is that for some reason all of recent test runs are using kubernetes commit 7bf40d2 (which at this point is 2 days old). We should investigate why the newest one aren't used.

shyamjvs · 2018-07-11T15:35:35Z

The commit seems to be stationary for our gce-100 job for a long time. @krzyzacy @BenTheElder Any idea what's happening?

krzysied · 2018-07-11T15:39:02Z

cc @kubernetes/test-infra-maintainers

BenTheElder · 2018-07-11T16:01:06Z

The build job is failing, so it's getting the last good build. No idea why we didn't get an alert yet, investigating. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-build/41170/

ixdy · 2018-07-11T17:09:30Z

fix for the build job in #66076.

BenTheElder · 2018-07-11T20:52:46Z

fix is merged, next run (very soon) should pick it up and put up a new build
I'm also following up on getting a sig testing alerts group and expanding our alerting.

krzysied · 2018-07-12T08:43:01Z

@ixdy @BenTheElder Thanks for the fix!

@shyamjvs Etcd metrics are available in artifacts directory now.

shyamjvs · 2018-07-12T09:16:10Z

Thanks everyone for looking into it.

@krzysied - The values seem to be rounded to integers and a bit crude. Is it possible to make it more accurate, for e.g by using smaller units or sth?

krzysied · 2018-07-12T09:43:57Z

@shyamjvs I'm afraid that is impossible without making changes to the etcd. The problem is that we are using histograms as metrics. We have number of samples in a group but not samples distribution. We can only calculate in which buckets x'th percentile is.
Also, those histogram buckets start from 1ms and grow exponentially, so 1ms is our lower limit.

shyamjvs · 2018-07-12T10:27:33Z

You're right about that it's hard to do much with percentiles. We may try to improve it a bit using linear approximations for e.g, but maybe it's just better to keep it as histogram now that I think of it. We can try graphing 'fraction of count in each bucket' instead on perf-dash. Wdyt?

krzysied · 2018-07-12T11:14:42Z

We may try to improve it a bit using linear approximations

The result will look better (every number with . looks more reliable), but I don't think it will give as more information about actual distribution.

We can try graphing 'fraction of count in each bucket' instead on perf-dash

This one sounds really good. Especially the "fraction of count" part. Plotting this graphs on perf-dash is doable.

shyamjvs · 2018-07-12T11:18:54Z

Alright then, let's proceed that way. Btw - this approach in general would unlock a whole new class of metrics to be plottable on perf-dash.

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Collecting etcd histogram metrics **What this PR does / why we need it**: Changes collected etcd metrics from quantiles to histograms. ref #64030 **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: **Special notes for your reviewer**: **Release note**: ```release-note NONE ```

krzysied · 2018-07-17T11:00:14Z

PR for perfdash changes: kubernetes/perf-tests#160

shyamjvs · 2018-07-17T15:27:24Z

Thanks @krzysied. Once we have verified that the final graphs for the etcd metrics are available on perf-dash, feel free to close this issue.
For completeness, I'd suggest providing screenshots of the graphs.

krzysied · 2018-07-17T16:23:37Z

I've just updated perf-dash. It looks like it works fine.

stevekuznetsov · 2018-07-18T22:08:54Z

Why is that using an interpolated spline for the data? That's implying there are valid values for the metrics between test runs and that the values are continuous and somehow related run-to-run, but neither of those is true, right?

shyamjvs · 2018-07-19T09:25:49Z

You're right, there's no logical connection - it's just for aesthetics :)

fejta · 2018-07-20T03:42:45Z

What wwwww

…

On Thu, Jul 19, 2018, 02:27 Shyam JVS ***@***.***> wrote: You're right, there's no logical connection - it's just for aesthetics :) — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#64030 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA5ZNbIHAHGfR40q41T6mY-e7POQtVUFks5uIFDjgaJpZM4UEy7O> .

stevekuznetsov · 2018-07-20T05:40:32Z

Not sure I understand the comment @fejta

shyamjvs · 2018-07-23T09:46:46Z

Closing this issue with both action items done. Etcd metrics are now available on http://perf-dash.k8s.io . Thanks!
/close

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels May 18, 2018

k8s-ci-robot assigned krzysied May 18, 2018

shyamjvs added the area/etcd label May 18, 2018

krzysied mentioned this issue Jun 4, 2018

Collecting etcd metrics #64695

Merged

shyamjvs mentioned this issue Jun 21, 2018

Prevent apiserver's metrics from accidental registration. #63924

Closed

krzysied mentioned this issue Jul 13, 2018

Collecting etcd histogram metrics #66158

Merged

k8s-ci-robot closed this as completed Jul 23, 2018

Start collecting etcd metrics in our performance tests #64030

Start collecting etcd metrics in our performance tests #64030

Comments

shyamjvs commented May 18, 2018 • edited

shyamjvs commented May 18, 2018

jberkus commented May 18, 2018

jberkus commented May 18, 2018

jennybuckley commented May 24, 2018

jpbetz commented May 24, 2018

shyamjvs commented May 29, 2018 • edited

jpbetz commented May 29, 2018

wojtek-t commented May 30, 2018

shyamjvs commented May 30, 2018

wojtek-t commented May 30, 2018

shyamjvs commented May 30, 2018

wojtek-t commented May 30, 2018

shyamjvs commented May 30, 2018

krzysied commented Jun 4, 2018

shyamjvs commented Jun 4, 2018

shyamjvs commented Jul 11, 2018

krzysied commented Jul 11, 2018

krzysied commented Jul 11, 2018

shyamjvs commented Jul 11, 2018

krzysied commented Jul 11, 2018

BenTheElder commented Jul 11, 2018

ixdy commented Jul 11, 2018

BenTheElder commented Jul 11, 2018

krzysied commented Jul 12, 2018

shyamjvs commented Jul 12, 2018

krzysied commented Jul 12, 2018

shyamjvs commented Jul 12, 2018

krzysied commented Jul 12, 2018

shyamjvs commented Jul 12, 2018

krzysied commented Jul 17, 2018

shyamjvs commented Jul 17, 2018

krzysied commented Jul 17, 2018

stevekuznetsov commented Jul 18, 2018

shyamjvs commented Jul 19, 2018

fejta commented Jul 20, 2018 via email

stevekuznetsov commented Jul 20, 2018

shyamjvs commented Jul 23, 2018

shyamjvs commented May 18, 2018 •

edited

shyamjvs commented May 29, 2018 •

edited