Rewrite repository worker threads using message passing #22191

lberki · 2024-04-30T08:14:36Z

I spent a truly baffling amount on this, but I hope this is now deadlock-free and could be proven as such by a mere mortal.

That said, my confidence is zero in this so before making grand statements, let's see what the presubmits say.

@justinhorvitz

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.) I then tried various things to fix this; this PR is an attempt that actually seemed to eliminate the deadlock. Instead of waiting for a `DONE` signal to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.) Related: #22003

This makes it possible to simplify the code in a number of ways: * All work (including `setupRepoRoot()` can now be done from the worker thread * The error bubbling hack can now be removed. I surmise that the deadlock occurred because the compute state is closed before error bubbling happens (which otherwise happens only very infrequently). Now that the deadlock is (hopefully) impossible, we don't need that anymore. * Do not use the existence of the worker thread (or lack of it) as state. * Do not use "null" as a signal and denote every different state with a different object.

lberki · 2024-04-30T10:59:04Z

@Wyverald WDYT? This is the best I could do. It's not bad and I can almost keep the state in my head. I still hate Exchanger (but I have no better ideas) but at least now the interplay between the two threads seems to be almost understandable. I also ran the "bazel build --noenable_bzlmod configure_with_bazel_transitive:all in a loop" test (passed) and the test battery did also.

The log messages should probably be removed, but they provide a good insight into what I had to do to get this to work so I thought it'd be useful, at least in the first iteration of this third iteration on your "separate Thread" idea...

Wyverald · 2024-04-30T23:41:52Z

I tried to ingest the logic here but it's honestly extremely tricky (probably not a surprise to you) even with all the comments. I'll try to continue tomorrow, but meanwhile I can't help but wonder if there is a better way to do this ("better" as in more maintainable down the line). Ultimately, what we want isn't that complicated: 1) at any given point, at most one of the worker and the host is active, and they exchange some information when "switching"; 2) if the worker is interrupted at any point for any reason, the host needs to wait for its termination.

I'm tempted to try again with yet another approach: instead of using BlockingQueue#take(), use poll(1s) in a loop instead, and check for liveness of the worker thread. I seem to remember you were not a fan of this approach in initial code reviews, but can't remember what's bad about it -- it essentially allows us to "select" from multiple conditions. Before I spend any time on this, I'd like to hear your thoughts on it. (Either way, I'll continue trying to understand this PR tomorrow...) (EDIT: I ended up using a semaphore which turned out great. See the original PR!)

lberki · 2024-05-03T06:29:22Z

Discarding in favor of #22215 and #22100.

Wyverald and others added 4 commits April 24, 2024 14:04

Failed attempt to directly use a Thread for repo fetching

4a84f30

Fix a deadlock in Starlark repository workers.

36216c8

Use an Exchanger to rendezvous between the worker and Skyframe threads.

72707a9

github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. awaiting-review PR is awaiting review from an assigned reviewer labels Apr 30, 2024

lberki force-pushed the lberki-repoworker-message-passing branch from ca8f475 to 8d370b2 Compare April 30, 2024 10:10

lberki requested review from Wyverald and coeuvre April 30, 2024 10:55

lberki mentioned this pull request May 3, 2024

Use a Semaphore for signaling in repo fetching #22100

Closed

lberki closed this May 3, 2024

github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label May 3, 2024

lberki deleted the lberki-repoworker-message-passing branch May 3, 2024 06:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite repository worker threads using message passing #22191

Rewrite repository worker threads using message passing #22191

lberki commented Apr 30, 2024

lberki commented Apr 30, 2024

Wyverald commented Apr 30, 2024 •

edited

lberki commented May 3, 2024

Rewrite repository worker threads using message passing #22191

Rewrite repository worker threads using message passing #22191

Conversation

lberki commented Apr 30, 2024

lberki commented Apr 30, 2024

Wyverald commented Apr 30, 2024 • edited

lberki commented May 3, 2024

Wyverald commented Apr 30, 2024 •

edited