Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tokio metrics collector #5647

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Conversation

conradludgate
Copy link
Contributor

@conradludgate conradludgate commented Oct 24, 2023

Problem

It would be interesting to see runtime stats over time

Summary of changes

Adds a prometheus collector for tokio runtime metrics.

TODO:

  • Add more of the counters
  • Try and get tokio to use counters instead of gauges
  • Verify that it doesn't impact performance
  • Confirm that 15 seconds prometheus scrape is enough to get interesting stats - otherwise perform eager aggregations

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@github-actions
Copy link

github-actions bot commented Oct 24, 2023

2340 tests run: 2225 passed, 0 failed, 115 skipped (full report)


Flaky tests (2)

Postgres 16

  • test_crafted_wal_end[last_wal_record_crossing_segment]: debug

Postgres 15

  • test_crafted_wal_end[last_wal_record_crossing_segment]: release

Code coverage (full report)

  • functions: 53.5% (8633 of 16128 functions)
  • lines: 81.5% (50207 of 61595 lines)

The comment gets automatically updated with the latest test results
ac0b314 at 2023-10-26T22:06:47.381Z :recycle:

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is

https://docs.rs/tokio-metrics

which, I know, doesn't implement the prometheus::core::Collector.

But, it abstracts the idea of looking at the RuntimeMetrics in fixed intervals. Which I think is a very good idea, because many of the tokio::RuntimeMetrics fields are aggregations already.

For example, I think to detect periods of increased executor stalls due to, say, slow blocking IO, you'd want to sample&reset max_busy_duration at a given frequency. (I could be wrong about this though, haven't looked deeply at RuntimeMetrics because they seemed out of reach due to the cfg tokio_unstable)

.cargo/config.toml Outdated Show resolved Hide resolved
libs/metrics/src/tokio_metrics.rs Outdated Show resolved Hide resolved
@conradludgate
Copy link
Contributor Author

There is https://docs.rs/tokio-metrics

I did look into this, it seemed like managing a timer tasks would add more overhead and complexity than sampling the runtime on demand.

because many of the tokio::RuntimeMetrics fields are aggregations already.

Are they? I see only 2 metrics that fit this:

I'm not keen on enabling poll_count histograms as it's going to add more overhead:
https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.enable_metrics_poll_count_histogram

Task poll times are not instrumented by default as doing so requires calling Instant::now() twice per task poll, which could add measurable overhead

So really it's only 1 aggregate in this case. The rest of the quantities are counters and gauges.

For example, I think to detect periods of increased executor stalls due to, say, slow blocking IO, you'd want to sample&reset max_busy_duration at a given frequency

We can get worker_total_busy_duration as a counter, couldn't you use a sum(rate()) construction to measure this. Unless you think 15s granularity is far too low resolution.

@conradludgate
Copy link
Contributor Author

Regarding gauges and sampling rates, it's been requested to use counter pairs tokio-rs/tokio#4073 (comment) (which I know @koivunej will be happy about)

@koivunej
Copy link
Contributor

Regarding gauges and sampling rates, it's been requested to use counter pairs tokio-rs/tokio#4073 (comment) (which I know @koivunej will be happy about)

Excellent :) Yes, this brings joy for me :)

@problame
Copy link
Contributor

problame commented Nov 3, 2023

I'm not keen on enabling poll_count histograms as it's going to add more overhead:
https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.enable_metrics_poll_count_histogram

Well this is the primary use case that I have for these metrics in pageserver.
We want to detect tasks that are stalling the executor.

We're ready to take a constant overhead in exchange for getting that observability.

@conradludgate
Copy link
Contributor Author

Hmm, fair enough.

@conradludgate
Copy link
Contributor Author

conradludgate commented Nov 3, 2023

I was curious so I ran a quick HTTP benchmark and it only had a 1% degradation in performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants