-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start collecting etcd metrics in our performance tests #64030
Comments
/assign @krzysied |
Other metrics I'd like to track:
I don't see any etcd metric for average size of a change. I think this is possibly significant; I can definitely imagine some recent patches that make object definitions longer on average could affect overall latency. Bytes are bytes. I'll see if I can track down an etcd engineer around this. |
Ah, here we go:
Those give us request size information. |
/cc @jpbetz @wenjiaswe |
etcd 3.1+ emits https://github.com/grpc-ecosystem/go-grpc-prometheus metrics when the |
Thanks @jberkus and @jpbetz - we'll keep those points in mind. So one question that popped up while @krzysied was looking into it is - how do we reset the etcd metrics (so that our e2e tests are independently measured)?
I'm inclined towards 1 or 3, but maybe someone has different ideas/concerns? Or maybe there's some totally different approach here? |
I'm in favor of restarting etcd b/w tests. We arguably should do this anyhow for control purposes. |
There is significant drawback of "restarting etcd" approach. Which is: it's hard to do in managed environments (such as GKE) [though arguably, neither 1 nor 3 may be possible there...] |
I agree with your comment @wojtek-t . However, I think for testing purpose it might be ok to restart it (we just want to measure etcd performance in our tests). Wdyt? |
I think I wasn't clear - I'm not opposed to restarting etcd itself. |
Yeah, that's a good point that this won't work on managed solutions. That said, FMU we have even bigger problem there of even collecting the metrics (as neither can we proxy to the pod because master isn't registered, nor can we SSH to the master) - it's possible that we may not even be able to access those from user space. |
Yes - that's why I wrote " [though arguably, neither 1 nor 3 may be possible there...]" above :) |
Well.. then also 2, no? :) |
As you can see above, I've created PR for etcd metrics. However, I think we should discuss about which metrics we really want to collect. Currently I've added support only for etcd_disk_backend_commit_duration_seconds, etcd_debugging_snap_save_total_duration_seconds, etcd_disk_wal_fsync_duration_seconds, peer_round_trip_time_seconds.
|
@krzysied Let's begin by collecting the ones you mentioned (I believe they're quite key ones). We can think of extending to further metrics later. |
Automatic merge from submit-queue (batch tested with PRs 64695, 65982, 65908). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Collecting etcd metrics **What this PR does / why we need it**: Adding etcd metrics to performance test log. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: ref #64030 **Special notes for your reviewer**: **Release note**: ```release-note NONE ```
I see there is a problem. I'm working on it. |
I've found out why there is no etcd metrics collected. All of our executed runs don't contain my PR. I verified that etcd metrics are collected during presubmit checks, thus it should work for out normal test runs. The issue here is that for some reason all of recent test runs are using kubernetes commit 7bf40d2 (which at this point is 2 days old). We should investigate why the newest one aren't used. |
The commit seems to be stationary for our gce-100 job for a long time. @krzyzacy @BenTheElder Any idea what's happening? |
cc @kubernetes/test-infra-maintainers |
The build job is failing, so it's getting the last good build. No idea why we didn't get an alert yet, investigating. https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-build/41170/ |
fix for the build job in #66076. |
fix is merged, next run (very soon) should pick it up and put up a new build |
@ixdy @BenTheElder Thanks for the fix! @shyamjvs Etcd metrics are available in artifacts directory now. |
@shyamjvs I'm afraid that is impossible without making changes to the etcd. The problem is that we are using histograms as metrics. We have number of samples in a group but not samples distribution. We can only calculate in which buckets x'th percentile is. |
You're right about that it's hard to do much with percentiles. We may try to improve it a bit using linear approximations for e.g, but maybe it's just better to keep it as histogram now that I think of it. We can try graphing 'fraction of count in each bucket' instead on perf-dash. Wdyt? |
The result will look better (every number with . looks more reliable), but I don't think it will give as more information about actual distribution.
This one sounds really good. Especially the "fraction of count" part. Plotting this graphs on perf-dash is doable. |
Alright then, let's proceed that way. Btw - this approach in general would unlock a whole new class of metrics to be plottable on perf-dash. |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Collecting etcd histogram metrics **What this PR does / why we need it**: Changes collected etcd metrics from quantiles to histograms. ref #64030 **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: **Special notes for your reviewer**: **Release note**: ```release-note NONE ```
PR for perfdash changes: kubernetes/perf-tests#160 |
Thanks @krzysied. Once we have verified that the final graphs for the etcd metrics are available on perf-dash, feel free to close this issue. |
I've just updated perf-dash. It looks like it works fine. |
Why is that using an interpolated spline for the data? That's implying there are valid values for the metrics between test runs and that the values are continuous and somehow related run-to-run, but neither of those is true, right? |
You're right, there's no logical connection - it's just for aesthetics :) |
What wwwww
…On Thu, Jul 19, 2018, 02:27 Shyam JVS ***@***.***> wrote:
You're right, there's no logical connection - it's just for aesthetics :)
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#64030 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA5ZNbIHAHGfR40q41T6mY-e7POQtVUFks5uIFDjgaJpZM4UEy7O>
.
|
Not sure I understand the comment @fejta |
Closing this issue with both action items done. Etcd metrics are now available on http://perf-dash.k8s.io . Thanks! |
This is a discussion that came up few times already in the past. It'll be really helpful to have those in light of recent regressions we'd seen due to etcd changes, like:
Etcd exposes a bunch of metrics, out of which few that're most interesting to us IMO are:
We should:
cc @kubernetes/sig-scalability-misc @kubernetes/sig-api-machinery-misc @wojtek-t @jpbetz @jberkus
The text was updated successfully, but these errors were encountered: