[still under investigation] loading phase degradation and timeouts in Bazel 7.1.1 #22003

tyler-french · 2024-04-15T17:52:51Z

Description of the bug:

At Uber, we had to revert from Bazel 7.1.1 to 7.0.2 because of a high number of instances of timeouts during the loading phase (cc @linzhp).

These timeouts did not always re-produce but seemed to mostly happen in CI environments.

While we continue to investigate this issue, I thought it would be helpful to open a GitHub issue to get any potential ideas.

I noticed the following commits which might be related:

If this issue is not helpful, and there are no other experiencing these degradations, we can close it.

Which category does this issue belong to?

Core, Performance

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No systematic way to reproduce is possible, but we were noticing this occurring while running bazel query //... as well as other bazel commands like bazel build some targets

Which operating system are you running Bazel on?

Linux AMD 64

What is the output of `bazel info release`?

7.1.1

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

Internal repo

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

The text was updated successfully, but these errors were encountered:

keith · 2024-04-15T17:55:04Z

potentially related thread https://bazelbuild.slack.com/archives/C014RARENH0/p1711487321624049

tjgq · 2024-04-23T09:21:54Z

A repro would help us make progress on this. Is anyone able to provide it?

@justinhorvitz

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.) I then tried various things to fix this; this PR is an attempt that actually seemed to eliminate the deadlock. Instead of waiting for a `DONE` signal to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.) Related: #22003

tyler-french · 2024-04-23T22:30:50Z

@tjgq I am pretty sure this is related to #22100

We have tried disabling this, and have not noticed any timeouts.

I am still monitoring, and will close this issue given a few more days of data showing this pattern.

@justinhorvitz

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.) I then tried various things to fix this; this PR is an attempt that actually seemed to eliminate the deadlock. Instead of waiting for a `DONE` signal to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.) Related: #22003

@justinhorvitz

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.) I then tried various things to fix this; this PR is an attempt that actually seemed to eliminate the deadlock. Instead of waiting for a `DONE` signal to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.) Related: #22003

keithl-stripe · 2024-05-01T16:44:32Z

why is this closed? #22100 has not been merged yet

tyler-french · 2024-05-01T17:49:00Z

why is this closed? #22100 has not been merged yet

I was able to resolve the issue with --experimental_worker_for_repo_fetching=off

@justinhorvitz

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.) This PR improves the situation in various ways: 1. Instead of using a `SynchronousQueue` for the signal queue, we now use a Semaphore for signaling. Semaphores have the crucial property that releasing a permit (ie. incrementing the counter) does not block, and thus cannot be interrupted. This means that the worker thread can now reliably send signals the host thread, even when it's interrupted. 2. Instead of using two signals for `DONE` and `RESTART`, we just use the one semaphore for both signals, and rely on `workerFuture.isDone()` to tell whether the worker has finished or is waiting for a fresh Environment. 3. The above requires another change: instead of signaling `DONE` in a `finally` block, we now use a `ListenableFuture` and signal to the semaphore in the worker future's listener. This makes sure that the signaling is performed _after_ the worker future's status changes. (Note that points 2 & 3 aren't the only way to handle this -- we could alternatively just use two semaphores.) 4. Instead of waiting for a `DONE` signal (or, in the new setup, the signal semaphore) to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.) Related: #22003 Fixes #21712.

@justinhorvitz

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.) This PR improves the situation in various ways: 1. Instead of using a `SynchronousQueue` for the signal queue, we now use a Semaphore for signaling. Semaphores have the crucial property that releasing a permit (ie. incrementing the counter) does not block, and thus cannot be interrupted. This means that the worker thread can now reliably send signals the host thread, even when it's interrupted. 2. Instead of using two signals for `DONE` and `RESTART`, we just use the one semaphore for both signals, and rely on `workerFuture.isDone()` to tell whether the worker has finished or is waiting for a fresh Environment. 3. The above requires another change: instead of signaling `DONE` in a `finally` block, we now use a `ListenableFuture` and signal to the semaphore in the worker future's listener. This makes sure that the signaling is performed _after_ the worker future's status changes. (Note that points 2 & 3 aren't the only way to handle this -- we could alternatively just use two semaphores.) 4. Instead of waiting for a `DONE` signal (or, in the new setup, the signal semaphore) to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.) And because I now create a separate worker executor service for each repo fetch, it doesn't really make sense to use this for platform threads anymore. So setting `--experimental_worker_for_repo_fetching` to any but `off` will cause us to use virtual threads. Related: #22003 Fixes #21712.

@justinhorvitz

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.) This PR improves the situation in various ways: 1. Instead of using a `SynchronousQueue` for the signal queue, we now use a Semaphore for signaling. Semaphores have the crucial property that releasing a permit (ie. incrementing the counter) does not block, and thus cannot be interrupted. This means that the worker thread can now reliably send signals the host thread, even when it's interrupted. 2. Instead of using two signals for `DONE` and `RESTART`, we just use the one semaphore for both signals, and rely on `workerFuture.isDone()` to tell whether the worker has finished or is waiting for a fresh Environment. 3. Instead of signaling `DONE` in a `finally` block, we now use a `ListenableFuture` and signal to the semaphore in the worker future's listener. This makes sure that the signaling is performed _after_ the worker future's status changes, and safeguards against the case where the submitted task never starts running before it gets cancelled. 4. Instead of waiting for a `DONE` signal (or, in the new setup, the signal semaphore) to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.) And because I now create a separate worker executor service for each repo fetch, it doesn't really make sense to use this for platform threads anymore. So setting `--experimental_worker_for_repo_fetching` to any but `off` will cause us to use virtual threads. Related: #22003 Fixes #21712. Closes #22100. PiperOrigin-RevId: 630534733 Change-Id: If989bf9cae76abb1579a2b1de896df8e5a63b88d

@justinhorvitz

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.) This PR improves the situation in various ways: 1. Instead of using a `SynchronousQueue` for the signal queue, we now use a Semaphore for signaling. Semaphores have the crucial property that releasing a permit (ie. incrementing the counter) does not block, and thus cannot be interrupted. This means that the worker thread can now reliably send signals the host thread, even when it's interrupted. 2. Instead of using two signals for `DONE` and `RESTART`, we just use the one semaphore for both signals, and rely on `workerFuture.isDone()` to tell whether the worker has finished or is waiting for a fresh Environment. 3. Instead of signaling `DONE` in a `finally` block, we now use a `ListenableFuture` and signal to the semaphore in the worker future's listener. This makes sure that the signaling is performed _after_ the worker future's status changes, and safeguards against the case where the submitted task never starts running before it gets cancelled. 4. Instead of waiting for a `DONE` signal (or, in the new setup, the signal semaphore) to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.) And because I now create a separate worker executor service for each repo fetch, it doesn't really make sense to use this for platform threads anymore. So setting `--experimental_worker_for_repo_fetching` to any but `off` will cause us to use virtual threads. Related: #22003 Fixes #21712. Closes #22100. PiperOrigin-RevId: 630534733 Change-Id: If989bf9cae76abb1579a2b1de896df8e5a63b88d

@justinhorvitz

I managed to reproduce some deadlocks during repo fetching with virtual worker threads. One notable trigger was some _other_ repo failing to fetch, which seems to cause Skyframe to try to interrupt other concurrent repo fetches. This _might_ be the cause for a deadlock where we submit a task to the worker executor service, but the task never starts running before it gets cancelled, which causes us to wait forever for a `DONE` signal that never comes. (The worker task puts a `DONE` signal in the queue in a `finally` block -- but we don't even enter the `try`.) This PR improves the situation in various ways: 1. Instead of using a `SynchronousQueue` for the signal queue, we now use a Semaphore for signaling. Semaphores have the crucial property that releasing a permit (ie. incrementing the counter) does not block, and thus cannot be interrupted. This means that the worker thread can now reliably send signals the host thread, even when it's interrupted. 2. Instead of using two signals for `DONE` and `RESTART`, we just use the one semaphore for both signals, and rely on `workerFuture.isDone()` to tell whether the worker has finished or is waiting for a fresh Environment. 3. Instead of signaling `DONE` in a `finally` block, we now use a `ListenableFuture` and signal to the semaphore in the worker future's listener. This makes sure that the signaling is performed _after_ the worker future's status changes, and safeguards against the case where the submitted task never starts running before it gets cancelled. 4. Instead of waiting for a `DONE` signal (or, in the new setup, the signal semaphore) to make sure the worker thread has finished, we now hold on to the executor service, which offers a `close()` method that essentially uninterruptibly waits for any scheduled tasks to terminate, whether or not they have started running. (@justinhorvitz had suggested a similar idea before.) To make sure distinct repo fetches don't interfere with each other, we start a separate worker executor service for each repo fetch instead of making everyone share the same worker executor service. (This is recommended for virtual threads; see https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-C0FEE349-D998-4C9D-B032-E01D06BE55F2 for example.) And because I now create a separate worker executor service for each repo fetch, it doesn't really make sense to use this for platform threads anymore. So setting `--experimental_worker_for_repo_fetching` to any but `off` will cause us to use virtual threads. Related: bazelbuild#22003 Fixes bazelbuild#21712. Closes bazelbuild#22100. PiperOrigin-RevId: 630534733 Change-Id: If989bf9cae76abb1579a2b1de896df8e5a63b88d

tyler-french added type: bug untriaged labels Apr 15, 2024

tyler-french assigned iancha1992, satyanandak and sgowroji Apr 15, 2024

github-actions bot added team-Core Skyframe, bazel query, BEP, options parsing, bazelrc team-Performance Issues for Performance teams labels Apr 15, 2024

iancha1992 unassigned sgowroji, iancha1992 and satyanandak Apr 16, 2024

Wyverald mentioned this issue Apr 23, 2024

Use a Semaphore for signaling in repo fetching #22100

Closed

tyler-french closed this as completed Apr 25, 2024

iancha1992 reopened this May 1, 2024

Wyverald mentioned this issue May 1, 2024

Hang in 'checking cached actions' #21712

Closed

Wyverald closed this as completed May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[still under investigation] loading phase degradation and timeouts in Bazel 7.1.1 #22003

[still under investigation] loading phase degradation and timeouts in Bazel 7.1.1 #22003

tyler-french commented Apr 15, 2024

keith commented Apr 15, 2024

tjgq commented Apr 23, 2024

tyler-french commented Apr 23, 2024 •

edited

keithl-stripe commented May 1, 2024

tyler-french commented May 1, 2024

[still under investigation] loading phase degradation and timeouts in Bazel 7.1.1 #22003

[still under investigation] loading phase degradation and timeouts in Bazel 7.1.1 #22003

Comments

tyler-french commented Apr 15, 2024

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

keith commented Apr 15, 2024

tjgq commented Apr 23, 2024

tyler-french commented Apr 23, 2024 • edited

keithl-stripe commented May 1, 2024

tyler-french commented May 1, 2024

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

tyler-french commented Apr 23, 2024 •

edited