Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: skip waiting for logical size on shard >0 #7744

Merged
merged 3 commits into from
May 14, 2024

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented May 13, 2024

Problem

Shards with number >0 could hang waiting for await_initial_logical_size, as we don't calculate logical size on these shards. This causes them to hold onto semaphore units and starve other tenants out from proceeding with warmup activation.

That doesn't hurt availability (we still have on-demand activation), but it does mean that some background tasks like consumption metrics would omit some tenants.

Summary of changes

  • Skip waiting for logical size calculation on shards >0
  • Upgrade unexpected code paths to use debug_assert!(), which acts as an implicit regression test for this issue, and make the info() one into a warn()

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels May 13, 2024
Copy link
Contributor

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Timeline::await_initial_logical_size does not seem to be called from anywhere except init, to my surprise. I cannot see how this could have any negative effect. Should go into the next release.

@jcsp jcsp marked this pull request as ready for review May 13, 2024 17:59
@jcsp jcsp requested a review from a team as a code owner May 13, 2024 17:59
@jcsp jcsp requested review from problame and removed request for problame May 13, 2024 17:59
Copy link

github-actions bot commented May 13, 2024

3060 tests run: 2933 passed, 0 failed, 127 skipped (full report)


Flaky tests (1)

Postgres 16

  • test_statvfs_pressure_usage: debug

Code coverage* (full report)

  • functions: 31.4% (6338 of 20191 functions)
  • lines: 47.3% (47942 of 101267 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
4dc4c79 at 2024-05-14T13:17:55.639Z :recycle:

@jcsp
Copy link
Contributor Author

jcsp commented May 14, 2024

Delete resumption code path apparently relied on the "unexpected" code path, needs investigation.

@koivunej
Copy link
Contributor

koivunej commented May 14, 2024

Delete resumption code path apparently relied on the "unexpected" code path, needs investigation.

I think it's all right, the error is:

AssertionError: First log error on pageserver_1: (594, "2024-05-13T18:22:00.935571Z  WARN attach{tenant_id=19908f9639a838f4482beedf286a8ee6 shard_id=0000 gen=00000002}: await_initial_logical_size: can't get semaphore cancel token, skipping\n")
Hint: use scripts/check_allowed_errors.sh to test any new allowed_error you add

Above in release builds, assertion failure in debug builds.

Now with the hint/ad to try out check_allowed_errors. This is what I was thinking of putting these into two different PRs, but I failed to put it in words yesterday.

However, these are not sharded tests, and the PR already contains a fix, which means that we are about to create another racy shutdown logging error situation. In #7733 I am growing tired of those (it'd be the second follow-up).

@jcsp jcsp merged commit 82960b2 into main May 14, 2024
55 checks passed
@jcsp jcsp deleted the jcsp/sharding-logical-size-wait branch May 14, 2024 15:39
a-masterov pushed a commit that referenced this pull request May 20, 2024
## Problem

Shards with number >0 could hang waiting for
`await_initial_logical_size`, as we don't calculate logical size on
these shards. This causes them to hold onto semaphore units and starve
other tenants out from proceeding with warmup activation.

That doesn't hurt availability (we still have on-demand activation), but
it does mean that some background tasks like consumption metrics would
omit some tenants.

## Summary of changes

- Skip waiting for logical size calculation on shards >0
- Upgrade unexpected code paths to use debug_assert!(), which acts as an
implicit regression test for this issue, and make the info() one into a
warn()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants