pageserver: skip waiting for logical size on shard >0 #7744

jcsp · 2024-05-13T17:44:34Z

Problem

Shards with number >0 could hang waiting for await_initial_logical_size, as we don't calculate logical size on these shards. This causes them to hold onto semaphore units and starve other tenants out from proceeding with warmup activation.

That doesn't hurt availability (we still have on-demand activation), but it does mean that some background tasks like consumption metrics would omit some tenants.

Summary of changes

Skip waiting for logical size calculation on shards >0
Upgrade unexpected code paths to use debug_assert!(), which acts as an implicit regression test for this issue, and make the info() one into a warn()

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

koivunej

This Timeline::await_initial_logical_size does not seem to be called from anywhere except init, to my surprise. I cannot see how this could have any negative effect. Should go into the next release.

github-actions · 2024-05-13T18:37:29Z

3060 tests run: 2933 passed, 0 failed, 127 skipped (full report)

Flaky tests (1)

Postgres 16

test_statvfs_pressure_usage: debug

Code coverage* (full report)

functions: 31.4% (6338 of 20191 functions)
lines: 47.3% (47942 of 101267 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
4dc4c79 at 2024-05-14T13:17:55.639Z :recycle:}

jcsp · 2024-05-14T08:22:00Z

Delete resumption code path apparently relied on the "unexpected" code path, needs investigation.

koivunej · 2024-05-14T08:40:34Z

Delete resumption code path apparently relied on the "unexpected" code path, needs investigation.

I think it's all right, the error is:

AssertionError: First log error on pageserver_1: (594, "2024-05-13T18:22:00.935571Z  WARN attach{tenant_id=19908f9639a838f4482beedf286a8ee6 shard_id=0000 gen=00000002}: await_initial_logical_size: can't get semaphore cancel token, skipping\n")
Hint: use scripts/check_allowed_errors.sh to test any new allowed_error you add

Above in release builds, assertion failure in debug builds.

Now with the hint/ad to try out check_allowed_errors. This is what I was thinking of putting these into two different PRs, but I failed to put it in words yesterday.

However, these are not sharded tests, and the PR already contains a fix, which means that we are about to create another racy shutdown logging error situation. In #7733 I am growing tired of those (it'd be the second follow-up).

## Problem Shards with number >0 could hang waiting for `await_initial_logical_size`, as we don't calculate logical size on these shards. This causes them to hold onto semaphore units and starve other tenants out from proceeding with warmup activation. That doesn't hurt availability (we still have on-demand activation), but it does mean that some background tasks like consumption metrics would omit some tenants. ## Summary of changes - Skip waiting for logical size calculation on shards >0 - Upgrade unexpected code paths to use debug_assert!(), which acts as an implicit regression test for this issue, and make the info() one into a warn()

pageserver: skip waiting for logical size on shard >0

9408506

jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels May 13, 2024

Make unexpected cases more vocal

82b59c4

koivunej approved these changes May 13, 2024

View reviewed changes

jcsp marked this pull request as ready for review May 13, 2024 17:59

jcsp requested a review from a team as a code owner May 13, 2024 17:59

jcsp requested review from problame and removed request for problame May 13, 2024 17:59

pageserver: skip waiting for logical size on timeline delete resumption

4dc4c79

jcsp merged commit 82960b2 into main May 14, 2024
55 checks passed

jcsp deleted the jcsp/sharding-logical-size-wait branch May 14, 2024 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: skip waiting for logical size on shard >0 #7744

pageserver: skip waiting for logical size on shard >0 #7744

jcsp commented May 13, 2024 •

edited

koivunej left a comment

github-actions bot commented May 13, 2024 •

edited

Postgres 16

jcsp commented May 14, 2024

koivunej commented May 14, 2024 •

edited

pageserver: skip waiting for logical size on shard >0 #7744

pageserver: skip waiting for logical size on shard >0 #7744

Conversation

jcsp commented May 13, 2024 • edited

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

koivunej left a comment

Choose a reason for hiding this comment

github-actions bot commented May 13, 2024 • edited

3060 tests run: 2933 passed, 0 failed, 127 skipped (full report)

Postgres 16

Code coverage* (full report)

jcsp commented May 14, 2024

koivunej commented May 14, 2024 • edited

jcsp commented May 13, 2024 •

edited

github-actions bot commented May 13, 2024 •

edited

koivunej commented May 14, 2024 •

edited