Rewrite Starlark repostory workers using an RPC paradigm... #22215

lberki · 2024-05-02T08:13:07Z

...where the worker thread can request things from the host thread, but not the other way round.

This makes it much easier to reason about what's happening between the threads and still allow us to share no state between the two.

@justinhorvitz

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.) I then tried various things to fix this; this PR is an attempt that actually seemed to eliminate the deadlock. Instead of waiting for a `DONE` signal to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.) Related: #22003

This makes it possible to simplify the code in a number of ways: * All work (including `setupRepoRoot()` can now be done from the worker thread * The error bubbling hack can now be removed. I surmise that the deadlock occurred because the compute state is closed before error bubbling happens (which otherwise happens only very infrequently). Now that the deadlock is (hopefully) impossible, we don't need that anymore. * Do not use the existence of the worker thread (or lack of it) as state. * Do not use "null" as a signal and denote every different state with a different object.

...java/com/google/devtools/build/lib/bazel/repository/starlark/StarlarkRepositoryFunction.java

.../com/google/devtools/build/lib/bazel/repository/starlark/RepoFetchingSkyKeyComputeState.java

This time, using an RPC paradigm where the worker thread can request things from the host thread (but not the other way round).

lberki · 2024-05-03T08:06:36Z

PTAL! I'm becoming less and less sure that this is preferable to #22100 but at least it looks principle now (if complicated)

coeuvre

LGTM! It's up to you and Xudong to decide which solution you want to move forward with.

coeuvre · 2024-05-03T12:51:44Z

.../com/google/devtools/build/lib/bazel/repository/starlark/RepoFetchingSkyKeyComputeState.java

+    Packet<Request> packet = new Packet(seq, request);
+    logMaybe(String.format("LOG %s/worker: sending request %s", repositoryName, packet));
+    requestQueue.put(packet);
+    // This loop is necessary because the worker and host threads can get out of sync:


As discussed offline, I think if we can start a new worker thread in case of RestartDecision(true), so that the worker thread can terminate after 3), this loop can be removed. Anyway, I think this is also good.

lberki · 2024-05-06T06:26:19Z

I think this would work, but @Wyverald decided to go with #22100 instead. I rate the two about equally in terms of complexity and probability that they fix the deadlock for good, so I think that's a reasonable decision.

@Wyverald : If #22100 doesn't work, let's take this pull request out from the closet. In the meantime, I'll close this since I hope it won't be necessary.

Wyverald and others added 5 commits April 24, 2024 14:04

Failed attempt to directly use a Thread for repo fetching

4a84f30

Fix a deadlock in Starlark repository workers.

36216c8

Use an Exchanger to rendezvous between the worker and Skyframe threads.

72707a9

lberki requested review from Wyverald and coeuvre May 2, 2024 08:13

github-actions bot added team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. awaiting-review PR is awaiting review from an assigned reviewer labels May 2, 2024

lberki force-pushed the lberki-repoworker-rpc branch 2 times, most recently from e918fe9 to 14f672b Compare May 2, 2024 12:10

coeuvre reviewed May 2, 2024

View reviewed changes

.../com/google/devtools/build/lib/bazel/repository/starlark/RepoFetchingSkyKeyComputeState.java Outdated Show resolved Hide resolved

lberki force-pushed the lberki-repoworker-rpc branch 2 times, most recently from fac78ab to 0035ddd Compare May 2, 2024 12:26

This was referenced May 2, 2024

Use a Semaphore for signaling in repo fetching #22100

Closed

Rewrite repository worker threads using message passing #22191

Closed

Use a Thread instead of an ExecutorService for Starlark repository workers #22139

Closed

Rewrite (again) repository worker threads.

a4c6aa8

This time, using an RPC paradigm where the worker thread can request things from the host thread (but not the other way round).

lberki force-pushed the lberki-repoworker-rpc branch from 0035ddd to a4c6aa8 Compare May 3, 2024 08:06

coeuvre approved these changes May 3, 2024

View reviewed changes

lberki closed this May 6, 2024

github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label May 6, 2024

Wyverald deleted the lberki-repoworker-rpc branch May 6, 2024 18:45

lberki mentioned this pull request May 17, 2024

[7.2.0-rc1] Bazel crashed due to an internal error: RejectedExecutionException #22393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite Starlark repostory workers using an RPC paradigm... #22215

Rewrite Starlark repostory workers using an RPC paradigm... #22215

lberki commented May 2, 2024

lberki commented May 3, 2024

coeuvre left a comment

coeuvre May 3, 2024

lberki commented May 6, 2024

Rewrite Starlark repostory workers using an RPC paradigm... #22215

Rewrite Starlark repostory workers using an RPC paradigm... #22215

Conversation

lberki commented May 2, 2024

lberki commented May 3, 2024

coeuvre left a comment

Choose a reason for hiding this comment

coeuvre May 3, 2024

Choose a reason for hiding this comment

lberki commented May 6, 2024