add tokio metrics collector #5647

conradludgate · 2023-10-24T15:59:01Z

Problem

It would be interesting to see runtime stats over time

Summary of changes

Adds a prometheus collector for tokio runtime metrics.

TODO:

Add more of the counters
Try and get tokio to use counters instead of gauges
Verify that it doesn't impact performance
Confirm that 15 seconds prometheus scrape is enough to get interesting stats - otherwise perform eager aggregations

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2023-10-24T16:59:06Z

2340 tests run: 2225 passed, 0 failed, 115 skipped (full report)

Flaky tests (2)

Postgres 16

test_crafted_wal_end[last_wal_record_crossing_segment]: debug

Postgres 15

test_crafted_wal_end[last_wal_record_crossing_segment]: release

Code coverage (full report)

functions: 53.5% (8633 of 16128 functions)
lines: 81.5% (50207 of 61595 lines)

_{The comment gets automatically updated with the latest test results
ac0b314 at 2023-10-26T22:06:47.381Z :recycle:}

libs/metrics/src/tokio_metrics.rs

problame

There is

https://docs.rs/tokio-metrics

which, I know, doesn't implement the prometheus::core::Collector.

But, it abstracts the idea of looking at the RuntimeMetrics in fixed intervals. Which I think is a very good idea, because many of the tokio::RuntimeMetrics fields are aggregations already.

For example, I think to detect periods of increased executor stalls due to, say, slow blocking IO, you'd want to sample&reset max_busy_duration at a given frequency. (I could be wrong about this though, haven't looked deeply at RuntimeMetrics because they seemed out of reach due to the cfg tokio_unstable)

.cargo/config.toml

libs/metrics/src/tokio_metrics.rs

conradludgate · 2023-10-25T08:38:34Z

There is https://docs.rs/tokio-metrics

I did look into this, it seemed like managing a timer tasks would add more overhead and complexity than sampling the runtime on demand.

because many of the tokio::RuntimeMetrics fields are aggregations already.

Are they? I see only 2 metrics that fit this:

I'm not keen on enabling poll_count histograms as it's going to add more overhead:
https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.enable_metrics_poll_count_histogram

Task poll times are not instrumented by default as doing so requires calling Instant::now() twice per task poll, which could add measurable overhead

So really it's only 1 aggregate in this case. The rest of the quantities are counters and gauges.

For example, I think to detect periods of increased executor stalls due to, say, slow blocking IO, you'd want to sample&reset max_busy_duration at a given frequency

We can get worker_total_busy_duration as a counter, couldn't you use a sum(rate()) construction to measure this. Unless you think 15s granularity is far too low resolution.

conradludgate · 2023-10-25T11:25:04Z

Regarding gauges and sampling rates, it's been requested to use counter pairs tokio-rs/tokio#4073 (comment) (which I know @koivunej will be happy about)

koivunej · 2023-10-26T15:33:20Z

Regarding gauges and sampling rates, it's been requested to use counter pairs tokio-rs/tokio#4073 (comment) (which I know @koivunej will be happy about)

Excellent :) Yes, this brings joy for me :)

problame · 2023-11-03T09:26:37Z

I'm not keen on enabling poll_count histograms as it's going to add more overhead:
https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.enable_metrics_poll_count_histogram

Well this is the primary use case that I have for these metrics in pageserver.
We want to detect tasks that are stalling the executor.

We're ready to take a constant overhead in exchange for getting that observability.

conradludgate · 2023-11-03T09:31:48Z

Hmm, fair enough.

conradludgate · 2023-11-03T09:44:25Z

I was curious so I ran a quick HTTP benchmark and it only had a 1% degradation in performance

conradludgate added 3 commits October 24, 2023 16:58

add tokio metrics collector

ba16dfb

remove a troublesome test

84b389f

less duplication, more sharing

52479df

problame reviewed Oct 24, 2023

View reviewed changes

libs/metrics/src/tokio_metrics.rs Outdated Show resolved Hide resolved

problame reviewed Oct 24, 2023

View reviewed changes

libs/metrics/src/tokio_metrics.rs Outdated Show resolved Hide resolved

problame reviewed Oct 24, 2023

View reviewed changes

koivunej reviewed Oct 25, 2023

View reviewed changes

.cargo/config.toml Outdated Show resolved Hide resolved

libs/metrics/src/tokio_metrics.rs Outdated Show resolved Hide resolved

conradludgate added 4 commits October 25, 2023 14:55

stable runtime name

03e646c

update ci and register runtimes

755bff9

fix warnings

835dca6

disable test

a8805d8

immutable runtimes

ac0b314

conradludgate mentioned this pull request Oct 27, 2023

feat: add task counter pairs tokio-rs/tokio#6114

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tokio metrics collector #5647

add tokio metrics collector #5647

conradludgate commented Oct 24, 2023 •

edited

github-actions bot commented Oct 24, 2023 •

edited

Postgres 16

Postgres 15

problame left a comment

conradludgate commented Oct 25, 2023

conradludgate commented Oct 25, 2023

koivunej commented Oct 26, 2023

problame commented Nov 3, 2023

conradludgate commented Nov 3, 2023

conradludgate commented Nov 3, 2023 •

edited

add tokio metrics collector #5647

Are you sure you want to change the base?

add tokio metrics collector #5647

Conversation

conradludgate commented Oct 24, 2023 • edited

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

github-actions bot commented Oct 24, 2023 • edited

2340 tests run: 2225 passed, 0 failed, 115 skipped (full report)

Postgres 16

Postgres 15

Code coverage (full report)

problame left a comment

Choose a reason for hiding this comment

conradludgate commented Oct 25, 2023

conradludgate commented Oct 25, 2023

koivunej commented Oct 26, 2023

problame commented Nov 3, 2023

conradludgate commented Nov 3, 2023

conradludgate commented Nov 3, 2023 • edited

conradludgate commented Oct 24, 2023 •

edited

github-actions bot commented Oct 24, 2023 •

edited

conradludgate commented Nov 3, 2023 •

edited