`/utilization` endpoint times out surprisingly often #7734

jcsp · 2024-05-13T12:29:39Z

This endpoint is meant to be so lightweight that it is suitable for use as a heartbeat, but in practice under load we are seeing it time out (controller uses a 1000ms timeout) more often than expected.

jcsp · 2024-05-13T16:30:02Z

I can get these to happen in tests under somewhat heavy load like 256 concurrent timeline creations. But it's not very interesting: we're running hundreds of concurrent initdb processes, and if linux is timeslicing a core at 10ms/100ms granularity then yeah, we end up blocking and waiting.

FWIW, a normal run on my workstation of test_storage_controller_many_tenants in release mode doesn't generate any /utilization timeouts, so it's at least not ultra-noisy.

I see ~1s holes in the pageserver log around the same time that the controller is reporting an HTTP request timeout.

On a real region I can see roughly 1-2 timeouts per day per pageserver (no impact, as it gets retried immediately). These pageservers aren't heavily loaded, and there isn't even enough log noise to see a noticable "hole" around the time the utilization request times out. These could be something else, of course, e.g. a DNS glitch.

jcsp · 2024-05-20T14:06:29Z

What can we do next:

We might need some more instrumentation in prod, particularly to detect if something is blocking the management runtime (where /utilization runs). The kinds of things that go on in a real pageserver but not so much in a test environment would be timestamp_to_lsn calls, although that particular case is a reasonable async function rather than something blocking.
Try to correlate with other API client complaints, e.g. other calls from the control plane into the pageserver management API that might occasionally fail with transport-level timeouts, to look for more clues about timing.
We could create a whole separate API server/runtime for heartbeat-like stuff: this is a relatively common approach in systems that need heartbeats to be reliable even if the system as a whole has some issue, although in our system we really

jcsp added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels May 13, 2024

jcsp self-assigned this May 13, 2024

jcsp added the triaged bugs that were already triaged label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`/utilization` endpoint times out surprisingly often #7734

`/utilization` endpoint times out surprisingly often #7734

jcsp commented May 13, 2024 •

edited

jcsp commented May 13, 2024

jcsp commented May 20, 2024 •

edited

/utilization endpoint times out surprisingly often #7734

/utilization endpoint times out surprisingly often #7734

Comments

jcsp commented May 13, 2024 • edited

jcsp commented May 13, 2024

jcsp commented May 20, 2024 • edited

`/utilization` endpoint times out surprisingly often #7734

`/utilization` endpoint times out surprisingly often #7734

jcsp commented May 13, 2024 •

edited

jcsp commented May 20, 2024 •

edited