Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/utilization endpoint times out surprisingly often #7734

Open
jcsp opened this issue May 13, 2024 · 2 comments
Open

/utilization endpoint times out surprisingly often #7734

jcsp opened this issue May 13, 2024 · 2 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@jcsp
Copy link
Contributor

jcsp commented May 13, 2024

This endpoint is meant to be so lightweight that it is suitable for use as a heartbeat, but in practice under load we are seeing it time out (controller uses a 1000ms timeout) more often than expected.

@jcsp jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels May 13, 2024
@jcsp jcsp self-assigned this May 13, 2024
@jcsp
Copy link
Contributor Author

jcsp commented May 13, 2024

I can get these to happen in tests under somewhat heavy load like 256 concurrent timeline creations. But it's not very interesting: we're running hundreds of concurrent initdb processes, and if linux is timeslicing a core at 10ms/100ms granularity then yeah, we end up blocking and waiting.

FWIW, a normal run on my workstation of test_storage_controller_many_tenants in release mode doesn't generate any /utilization timeouts, so it's at least not ultra-noisy.

I see ~1s holes in the pageserver log around the same time that the controller is reporting an HTTP request timeout.

On a real region I can see roughly 1-2 timeouts per day per pageserver (no impact, as it gets retried immediately). These pageservers aren't heavily loaded, and there isn't even enough log noise to see a noticable "hole" around the time the utilization request times out. These could be something else, of course, e.g. a DNS glitch.

@jcsp
Copy link
Contributor Author

jcsp commented May 20, 2024

What can we do next:

  • We might need some more instrumentation in prod, particularly to detect if something is blocking the management runtime (where /utilization runs). The kinds of things that go on in a real pageserver but not so much in a test environment would be timestamp_to_lsn calls, although that particular case is a reasonable async function rather than something blocking.
  • Try to correlate with other API client complaints, e.g. other calls from the control plane into the pageserver management API that might occasionally fail with transport-level timeouts, to look for more clues about timing.
  • We could create a whole separate API server/runtime for heartbeat-like stuff: this is a relatively common approach in systems that need heartbeats to be reliable even if the system as a whole has some issue, although in our system we really

@jcsp jcsp added the triaged bugs that were already triaged label May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

1 participant