Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start collecting etcd metrics in our performance tests #64030

Closed
2 tasks done
shyamjvs opened this issue May 18, 2018 · 37 comments
Closed
2 tasks done

Start collecting etcd metrics in our performance tests #64030

shyamjvs opened this issue May 18, 2018 · 37 comments
Assignees
Labels
area/etcd sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@shyamjvs
Copy link
Member

shyamjvs commented May 18, 2018

This is a discussion that came up few times already in the past. It'll be really helpful to have those in light of recent regressions we'd seen due to etcd changes, like:

Etcd exposes a bunch of metrics, out of which few that're most interesting to us IMO are:

  • backend_commit_duration_seconds
  • snapshot_save_total_duration_seconds
  • peer_round_trip_time_seconds (probably not so important for our scalability tests that're all single-node etcd setups, except gke-regional-performance one)
  • sth else?

We should:

  • collect those metrics through our test framework (similar to how we're doing for apiserver)
  • plot those metrics on our performance dashboard

cc @kubernetes/sig-scalability-misc @kubernetes/sig-api-machinery-misc @wojtek-t @jpbetz @jberkus

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels May 18, 2018
@shyamjvs
Copy link
Member Author

/assign @krzysied
Who seemed interested in taking this up.

@jberkus
Copy link

jberkus commented May 18, 2018

Other metrics I'd like to track:

  • wal_fsync_duration_seconds: if at some scalability level disk contention is a factor, it would be nice to see it in the graphs.
  • proposals_committed_total: we really need a count of the number of writes being done to Etcd.
  • proposals_pending: we also want to see if etcd requests start queueing up.

I don't see any etcd metric for average size of a change. I think this is possibly significant; I can definitely imagine some recent patches that make object definitions longer on average could affect overall latency. Bytes are bytes. I'll see if I can track down an etcd engineer around this.

@jberkus
Copy link

jberkus commented May 18, 2018

Ah, here we go:

  • client_grpc_sent_bytes_total
  • client_grpc_received_bytes_total

Those give us request size information.

@jennybuckley
Copy link

/cc @jpbetz @wenjiaswe

@jpbetz
Copy link
Contributor

jpbetz commented May 24, 2018

etcd 3.1+ emits https://github.com/grpc-ecosystem/go-grpc-prometheus metrics when the --metrics extensive flag is set, grpc_server_started_total and similar might be helpful for answering questions about load (in conjunction with the client_grpc_*_bytes_total metrics @jberkus already mentioned above).

@shyamjvs
Copy link
Member Author

shyamjvs commented May 29, 2018

Thanks @jberkus and @jpbetz - we'll keep those points in mind.

So one question that popped up while @krzysied was looking into it is - how do we reset the etcd metrics (so that our e2e tests are independently measured)?
And it seems to me like we have a few options here:

  • Restart etcd in b/w the tests (which will automatically reset the metrics)
  • Measure the values of metrics after test-1 and subtract it from the values after test-2 (this should work with the metrics mostly being histograms and counters)
  • Change upstream etcd to allow DELETE operation on the metrics handler
  • Not care about resetting at all

I'm inclined towards 1 or 3, but maybe someone has different ideas/concerns? Or maybe there's some totally different approach here?

@jpbetz
Copy link
Contributor

jpbetz commented May 29, 2018

I'm in favor of restarting etcd b/w tests. We arguably should do this anyhow for control purposes.

@wojtek-t
Copy link
Member

There is significant drawback of "restarting etcd" approach. Which is: it's hard to do in managed environments (such as GKE) [though arguably, neither 1 nor 3 may be possible there...]

@shyamjvs
Copy link
Member Author

I agree with your comment @wojtek-t . However, I think for testing purpose it might be ok to restart it (we just want to measure etcd performance in our tests). Wdyt?

@wojtek-t
Copy link
Member

I agree with your comment @wojtek-t . However, I think for testing purpose it might be ok to restart it (we just want to measure etcd performance in our tests). Wdyt?

I think I wasn't clear - I'm not opposed to restarting etcd itself.
I was just trying to say, that in managed solutions (like GKE) you may not be able (have enough permissions) to restart etcd. Which means that you won't be able to run those tests the same way on GKE (and other managed solutions).

@shyamjvs
Copy link
Member Author

Yeah, that's a good point that this won't work on managed solutions. That said, FMU we have even bigger problem there of even collecting the metrics (as neither can we proxy to the pod because master isn't registered, nor can we SSH to the master) - it's possible that we may not even be able to access those from user space.

@wojtek-t
Copy link
Member

Yes - that's why I wrote " [though arguably, neither 1 nor 3 may be possible there...]" above :)

@shyamjvs
Copy link
Member Author

Well.. then also 2, no? :)

@krzysied
Copy link
Contributor

krzysied commented Jun 4, 2018

As you can see above, I've created PR for etcd metrics. However, I think we should discuss about which metrics we really want to collect. Currently I've added support only for etcd_disk_backend_commit_duration_seconds, etcd_debugging_snap_save_total_duration_seconds, etcd_disk_wal_fsync_duration_seconds, peer_round_trip_time_seconds.

  • Do we need all of those metrics?
  • Is there any other metric we want to collect?

@shyamjvs
Copy link
Member Author

shyamjvs commented Jun 4, 2018

@krzysied Let's begin by collecting the ones you mentioned (I believe they're quite key ones). We can think of extending to further metrics later.

k8s-github-robot pushed a commit that referenced this issue Jul 10, 2018
Automatic merge from submit-queue (batch tested with PRs 64695, 65982, 65908). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Collecting etcd metrics

**What this PR does / why we need it**:
Adding etcd metrics to performance test log.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
ref #64030

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```
@shyamjvs
Copy link
Member Author

@krzysied Your PR #64695 is merged but it seems like we're still not collecting the etcd metrics (i.e I don't see an EtcdMetrics file in artifacts). Could you PTAL and make sure it works?

@krzysied
Copy link
Contributor

I see there is a problem. I'm working on it.
I'll write back as soon as I find the cause.

@krzysied
Copy link
Contributor

I've found out why there is no etcd metrics collected. All of our executed runs don't contain my PR. I verified that etcd metrics are collected during presubmit checks, thus it should work for out normal test runs.

The issue here is that for some reason all of recent test runs are using kubernetes commit 7bf40d2 (which at this point is 2 days old). We should investigate why the newest one aren't used.

@shyamjvs
Copy link
Member Author

The commit seems to be stationary for our gce-100 job for a long time. @krzyzacy @BenTheElder Any idea what's happening?

@krzysied
Copy link
Contributor

cc @kubernetes/test-infra-maintainers

@BenTheElder
Copy link
Member

The build job is failing, so it's getting the last good build. No idea why we didn't get an alert yet, investigating. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-build/41170/

@ixdy
Copy link
Member

ixdy commented Jul 11, 2018

fix for the build job in #66076.

@BenTheElder
Copy link
Member

fix is merged, next run (very soon) should pick it up and put up a new build
I'm also following up on getting a sig testing alerts group and expanding our alerting.

@krzysied
Copy link
Contributor

@ixdy @BenTheElder Thanks for the fix!

@shyamjvs Etcd metrics are available in artifacts directory now.

@shyamjvs
Copy link
Member Author

Thanks everyone for looking into it.

@krzysied - The values seem to be rounded to integers and a bit crude. Is it possible to make it more accurate, for e.g by using smaller units or sth?

@krzysied
Copy link
Contributor

@shyamjvs I'm afraid that is impossible without making changes to the etcd. The problem is that we are using histograms as metrics. We have number of samples in a group but not samples distribution. We can only calculate in which buckets x'th percentile is.
Also, those histogram buckets start from 1ms and grow exponentially, so 1ms is our lower limit.

@shyamjvs
Copy link
Member Author

You're right about that it's hard to do much with percentiles. We may try to improve it a bit using linear approximations for e.g, but maybe it's just better to keep it as histogram now that I think of it. We can try graphing 'fraction of count in each bucket' instead on perf-dash. Wdyt?

@krzysied
Copy link
Contributor

We may try to improve it a bit using linear approximations

The result will look better (every number with . looks more reliable), but I don't think it will give as more information about actual distribution.

We can try graphing 'fraction of count in each bucket' instead on perf-dash

This one sounds really good. Especially the "fraction of count" part. Plotting this graphs on perf-dash is doable.

@shyamjvs
Copy link
Member Author

Alright then, let's proceed that way. Btw - this approach in general would unlock a whole new class of metrics to be plottable on perf-dash.

k8s-github-robot pushed a commit that referenced this issue Jul 16, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Collecting etcd histogram metrics

**What this PR does / why we need it**:
Changes collected etcd metrics from quantiles to histograms.
ref #64030

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:

**Special notes for your reviewer**:

**Release note**:

```release-note
NONE
```
@krzysied
Copy link
Contributor

PR for perfdash changes: kubernetes/perf-tests#160

@shyamjvs
Copy link
Member Author

Thanks @krzysied. Once we have verified that the final graphs for the etcd metrics are available on perf-dash, feel free to close this issue.
For completeness, I'd suggest providing screenshots of the graphs.

@krzysied
Copy link
Contributor

I've just updated perf-dash. It looks like it works fine.
image

@stevekuznetsov
Copy link
Contributor

Why is that using an interpolated spline for the data? That's implying there are valid values for the metrics between test runs and that the values are continuous and somehow related run-to-run, but neither of those is true, right?

@shyamjvs
Copy link
Member Author

You're right, there's no logical connection - it's just for aesthetics :)

@fejta
Copy link
Contributor

fejta commented Jul 20, 2018 via email

@stevekuznetsov
Copy link
Contributor

Not sure I understand the comment @fejta

@shyamjvs
Copy link
Member Author

Closing this issue with both action items done. Etcd metrics are now available on http://perf-dash.k8s.io . Thanks!
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests