Sandbox reuse for tests spends a while #22309

DavidZbarsky-at · 2024-05-10T01:54:51Z

Description of the bug:

These tests should have pretty much the same sandbox, but it looks like we are spending 1-2s per test on filtering the existing files.

Would it be possible for the stashed sandbox to store a list of its contents, so we can do this more efficiently?

Which category does this issue belong to?

No response

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

Linux

What is the output of `bazel info release`?

release 7.1.0

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

No response

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Related to e9022f6 ?

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

The text was updated successfully, but these errors were encountered:

fmeum · 2024-05-10T07:15:07Z

@DavidZbarsky-at Could you also test the performance with 7.0.0, which doesn't include e9022f6?

Cc @oquenchil

oquenchil · 2024-05-10T07:25:17Z

I actually implemented storing the contents of a stashed sandbox in memory but it uses noticeably more memory so I didn't check that in. Regardless of what caused the regression here (which I should fix) I'd want to keep the --reuse_sandbox_directories flag as an alternative that consumes as little memory as possible. For those people where memory isn't a problem then they can place their sandbox on tmpfs with --sandbox_base=/dev/shm.

Have you tried the latter? Could you comment on whether this helps you? If you try with tmpfs, you could try passing at the same time --noreuse_sandbox_directories and --experimental_sandbox_async_tree_delete_idle_threads=0.

oquenchil · 2024-05-10T07:26:03Z

Do you have a repro for what you are seeing in your build that I could use to debug?

fmeum · 2024-05-10T07:54:54Z

@bazel-io flag

DavidZbarsky-at · 2024-05-10T14:20:58Z

@DavidZbarsky-at Could you also test the performance with 7.0.0, which doesn't include e9022f6?

On 7.0.0 I was seeing around 600-900ms sandbox setup overhead. Note that we are using tmpfs sandbox base; without that it was around 20-30s per test.

I actually implemented storing the contents of a stashed sandbox in memory but it uses noticeably more memory so I didn't check that in.

Perhaps we can store a manifest file on disk? That would mean a single read vs a filesystem walk. I think you can stream the read + comparison if it's sorted?

Regardless of what caused the regression here (which I should fix) I'd want to keep the --reuse_sandbox_directories flag as an alternative that consumes as little memory as possible. For those people where memory isn't a problem then they can place their sandbox on tmpfs with --sandbox_base=/dev/shm.

Have you tried the latter? Could you comment on whether this helps you? If you try with tmpfs, you could try passing at the same time --noreuse_sandbox_directories and --experimental_sandbox_async_tree_delete_idle_threads=0.

I was hoping sandbox reuse would allow a further speedup on top of the tmpfs sandbox base (even with tmpfs, syscalls aren't free); it feels like the worker should be able to diff a list of 60K items fairly quickly :)

So to summarize, here is the set of timings I am seeing:

Bazel 7.1.0, sandbox reuse + tmpfs: 1-2s overheads
Bazel 7.1.0, no sandbox reuse + tmpfs: 1-1.5s overheads
Bazel 7.0.0, no sandbox reuse + tmpfs: .5-1s overheads

I can try to come up with a repro if that would be helpful?

keertk · 2024-05-10T15:15:35Z

@bazel-io fork 7.2.0

keertk · 2024-05-10T15:16:02Z

@fmeum is this a hard blocker for 7.2?

fmeum · 2024-05-10T16:57:27Z

@keertk Based on the slowdown reported in #22309 (comment) I would say yes, but I'll leave the decision to @oquenchil.

oquenchil · 2024-05-10T18:39:51Z

I can try to come up with a repro if that would be helpful?

Yes please. I don't understand why this would have regressed from 7.0.0 to 7.1.0 even if the culprit were e9022f6 (still have to verify). A repro would help understanding what's going on.

Regarding blocking 7.2.0, I'd be fine with the culprit being reverted once found and not blocking the release instead of a forward fix.

oquenchil · 2024-05-14T08:38:31Z

@DavidZbarsky-at can you provide the repro that shows the regression from 7.0.0 to 7.1.0?

DavidZbarsky-at · 2024-05-14T13:43:29Z

Check out https://github.com/DavidZbarsky-at/nodejs-repro
Repro with CC=false bazel test -k :all --cache_test_results=no

It looks like on linux both 7.0.0 and 7.1.0 take around 700-900ms in sandbox setup with the tmpfs sandbox base, though I did see some actions taking 2s+ on 7.1.0
On darwin I saw around 28s in 7.0.0 and 7s in 7.1.0, so even though there's a big improvement there (thanks!), there's still a ways to go to have the sandbox reuse work well on our Macs.

BTW on darwin I run with common --sandbox_base=/Volumes/RAMDisk which I created with

diskutil erasevolume APFS 'RAMDisk' `hdiutil attach -nobrowse -nomount ram://33554432`
echo 'common --sandbox_base=/Volumes/RAMDisk' > ~/.bazelrc

But given the timings I am not sure it's actually helping :)

Some profiles I collected if you're curious:
darwin_profile_7.0.0.gz
darwin_profile_7.1.0.gz
linux_profile_7.0.0_no_reuse.gz
linux_profile_7.0.0.gz
linux_profile_7.1.0_no_reuse.gz
linux_profile_7.1.0.gz

oquenchil · 2024-05-15T15:39:08Z

Hi David, so am I understanding correctly that you aren't seeing a regression in 7.1.0? I will certainly use your example to improve this further but we need to clarify if 7.1.0 introduced a regression with respect to 7.0.0.

dzbarsky · 2024-05-17T07:01:57Z

We chatted over slack, we experience a 1x-1.5x regression in sandbox creation time, but it is masking the enablement of async sandbox deletion. 7.1.0 is faster for us than 7.0.0 with non-async sandbox deletion, so it's not a regression overall.

oquenchil · 2024-05-21T17:38:38Z

Hi @DavidZbarsky-at, here I have the code for the in-memory stashes: https://github.com/oquenchil/bazel/tree/in_memory_reuse_sandbox_directory_stashes

You can build with USE_BAZEL_VERSION=7.1.1 bazelisk build //src:bazel

Please if you have time give it a try on Linux and macOS and let me know if you see an improvement.

DavidZbarsky-at · 2024-05-21T17:40:52Z

Ack, I am moving this week and will be mostly offline, but I'll give it a shot next week!

Wyverald · 2024-05-21T22:19:37Z

We chatted over slack, we experience a 1x-1.5x regression in sandbox creation time, but it is masking the enablement of async sandbox deletion. 7.1.0 is faster for us than 7.0.0 with non-async sandbox deletion, so it's not a regression overall.

Given this, I'm closing the 7.2.0 release blocker issue (#22319).

DavidZbarsky-at · 2024-05-22T01:03:31Z

@oquenchil I had a chance to give it a quick spin - on my repro, it takes the sandbox creation times from 7s to 200ms, this is amazing! Unfortunately I had issues with genrule not being able to write files, and even when I fixed that to write_file, I saw an issue with the default test.xml generation action, so it's a bit hard for me to run this in our more realistic setup. But this looks very promising, thank you!

oquenchil · 2024-05-22T10:19:13Z

Glad to hear there was an improvement. Let's fix the errors you are seeing then. Did you see these errors with the repro from https://github.com/DavidZbarsky-at/nodejs-repro? If you did, would you mind telling me the bazel commands you ran? I didn't see these errors myself.

If it wasn't with that repro, would it be possible to create an example with the errors you are seeing?

Thanks!

DavidZbarsky-at · 2024-05-22T14:17:10Z

Yes, it was the master branch of that repo, and it's the same build command I posted earlier. I wonder if my self-built bazel is weird somehow, - I've never tried building it myself before. I'll try on a clean branch of bazel and see if it exhibits the same brokenness

oquenchil · 2024-05-22T16:11:00Z

Alright thanks. I will see if I can get it to reproduce. I think it's unlikely that your custom bazel would introduce any differences.

oquenchil · 2024-05-23T11:47:57Z

I forgot to ask, did you see these errors on Linux or macOS?

DavidZbarsky-at · 2024-05-23T12:07:59Z

It was on macos

DavidZbarsky-at · 2024-05-24T04:04:15Z

Here's the first error I saw:

touch: bazel-out/darwin_arm64-fastbuild/bin/out_37: Operation not permitted
ERROR: /Users/david.zbarsky/nodejs-repro/BUILD.bazel:61:12: Executing genrule //:gen_out_43 failed: (Exit 1): bash failed: error executing Genrule command (from target //:gen_out_43) /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; touch bazel-out/darwin_arm64-fastbuild/bin/out_43'

I retried using the HEAD commit you branched from, and saw the same, so I don't think your PR breaks anything for me:

(env-18.16.0) david.zbarsky@JF6FQ9PXP9 nodejs-repro % /Users/david.zbarsky/bazel/bazel-bin/src/bazel test -k //:all --cache_test_results=no --profile=/tmp/profile.gz
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
WARNING: Option 'experimental_action_cache_store_output_metadata' is deprecated: Use --action_cache_store_output_metadata instead
WARNING: Option 'experimental_action_cache_store_output_metadata' is deprecated: Use --action_cache_store_output_metadata instead
ERROR: /Users/david.zbarsky/nodejs-repro/BUILD.bazel:61:12: Executing genrule //:gen_out_49 failed: (Exit 1): bash failed: error executing Genrule command (from target //:gen_out_49) /bin/bash -c 'source external/bazel_tools/tools/genrule/genrule-setup.sh; touch bazel-out/darwin_arm64-fastbuild/bin/out_49'

oquenchil · 2024-05-24T21:08:27Z

Hi David, I tried with the same code on macoS, with what I uploaded and the commit before and I can't get it to fail, it always works.

Since I can't reproduce this, would you mind running a bisection yourself with bazelisk?

Assuming it works with bazel 7.1.0 you can do the following (see here):

CC=false bazelisk --bisect=release-7.1.0..d831214a6d3ed7ae6b4c14ce4735f023b1da6e3c test -k :all --cache_test_results=no

I haven't tried the bisection command, from the docs I believe it can take a release branch name and a commit for the end of the range but if it doesn't work the last commit for 7.1.0 is 9b39ccaa33069c9f5688bef477abcd75e4378f04

Thank you.

DavidZbarsky-at · 2024-05-25T16:00:07Z

--- Bisect Result

first bad commit is https://github.com/bazelbuild/bazel/commit/3fddc7f38ace43981d839ed4558b8a457caf41fb

cc @fmeum

fmeum · 2024-05-25T16:21:01Z

@DavidZbarsky-at Could you test b0ed4ca? Pretty sure that fixes it.

DavidZbarsky-at · 2024-05-25T16:29:31Z

Yep, that commit fixes it, and the one before it (eebf5f97e33fa4abad128c029e5d254783d805ca) is broken. I think we should be all good now, and looking forward to getting these patches in!

oquenchil · 2024-05-27T09:15:07Z

Glad to hear. Will get it checked in this week. Do you need this checked in before you can confirm the performance benefits in a more realistic setup than the repro?

DavidZbarsky-at · 2024-05-27T16:49:25Z

@oquenchil Could you rebase your PR on top of main (or at least the commit @fmeum linked)? Then I'll be be able to test it out. I tried to rebase it myself but it doesn't apply cleanly

oquenchil · 2024-05-27T20:52:22Z

It's on top of main now. I guarded the new code path behind a flag so you will have to pass --experimental_inmemory_sandbox_stashes

DavidZbarsky-at · 2024-05-28T02:16:07Z

I tried with that flag but am seeing the following:

FATAL: bazel crashed due to an internal error. Printing stack trace:
java.lang.RuntimeException: Unrecoverable error while evaluating node 'UnshareableActionLookupData{actionLookupKey=ConfiguredTargetKey{label=//client/core:_lib/timing.test.tsx, config=BuildConfigurationKey[b265fd2d9ba179f8e3b8860b151f420e31b1d820dd94bac605c5a46666478121]}, actionIndex=5}' (requested by nodes 'TestCompletionKey{configuredTargetKey=ConfiguredTargetKey{label=//client/core:_lib/timing.test.tsx, config=BuildConfigurationKey[b265fd2d9ba179f8e3b8860b151f420e31b1d820dd94bac605c5a46666478121]}, topLevelArtifactContext=com.google.devtools.build.lib.analysis.TopLevelArtifactContext@90904c3b, exclusiveTesting=false}')
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:557)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:426)
	at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: java.lang.IllegalStateException
	at com.google.common.base.Preconditions.checkState(Preconditions.java:496)
	at com.google.devtools.build.lib.sandbox.SandboxStash.updateStashContentsAfterRunfilesMove(SandboxStash.java:561)
	at com.google.devtools.build.lib.sandbox.SandboxStash.takeStashedSandboxInternal(SandboxStash.java:153)
	at com.google.devtools.build.lib.sandbox.SandboxStash.takeStashedSandbox(SandboxStash.java:112)
	at com.google.devtools.build.lib.sandbox.SymlinkedSandboxedSpawn.filterInputsAndDirsToCreate(SymlinkedSandboxedSpawn.java:91)
	at com.google.devtools.build.lib.sandbox.AbstractContainerizingSandboxedSpawn.createFileSystem(AbstractContainerizingSandboxedSpawn.java:135)
	at com.google.devtools.build.lib.sandbox.DarwinSandboxedSpawnRunner$1.createFileSystem(DarwinSandboxedSpawnRunner.java:270)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:146)
	at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:113)
	at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:474)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:159)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:118)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.exec(SpawnStrategyResolver.java:45)

I retried and got a similar-looking crash on a different node.

DavidZbarsky-at · 2024-05-28T03:40:07Z

If you want to add some logging I'm happy to try again

oquenchil · 2024-05-28T08:39:01Z

Hi David, sorry and thanks for trying it. I don't understand why it gets triggered. I added some logging that will show the two paths, that will help me understand why it is legitimate for them to be different and I will simply remove the precondition check. So please whenever you can. Thanks again.

DavidZbarsky-at · 2024-05-28T15:02:22Z

No problem at all, happy to iterate on this.

Caused by: java.lang.IllegalStateException: _main/bazel-out/darwin_arm64-fastbuild/bin/client/core/_core_test_typecheck_test.sh.runfiles : _main/bazel-out/darwin_arm64-fastbuild/bin/client/core/_models/timeline_view_metadata.test.tsx_/_models/timeline_view_metadata.test.tsx.runfiles
	at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
	at com.google.devtools.build.lib.sandbox.SandboxStash.updateStashContentsAfterRunfilesMove(SandboxStash.java:561)

note the test.sh vs test.tsx

oquenchil · 2024-05-28T20:51:04Z

Now it should work. It was a silly mistake where I didn't take into account the package paths for the runfiles directory when recycling the stash contents.

DavidZbarsky-at · 2024-05-29T00:28:51Z

Nice, patch works great now. Check this out - without the flag, 15s sandbox reuse times

With the flag, 300ms sandbox reuse times

oquenchil · 2024-05-29T08:54:25Z

Wonderful news, thanks! I will write a test now and send it for review.

DavidZbarsky-at added type: bug untriaged labels May 10, 2024

DavidZbarsky-at assigned iancha1992, satyanandak and sgowroji May 10, 2024

sgowroji added the team-Performance Issues for Performance teams label May 10, 2024

sgowroji unassigned sgowroji, iancha1992 and satyanandak May 10, 2024

bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label May 10, 2024

bazel-io mentioned this issue May 10, 2024

[7.2.0] Sandbox reuse for tests spends a while #22319

Closed

bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label May 10, 2024

oquenchil added team-Local-Exec Issues and PRs for the Execution (Local) team P2 We'll consider working on this in future. (Assignee optional) and removed team-Performance Issues for Performance teams untriaged labels May 14, 2024

Sandbox reuse for tests spends a while #22309

Sandbox reuse for tests spends a while #22309

Comments

DavidZbarsky-at commented May 10, 2024

Description of the bug:

Which category does this issue belong to?

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse HEAD ?

Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

fmeum commented May 10, 2024 • edited

oquenchil commented May 10, 2024

oquenchil commented May 10, 2024

fmeum commented May 10, 2024

DavidZbarsky-at commented May 10, 2024

keertk commented May 10, 2024

keertk commented May 10, 2024

fmeum commented May 10, 2024

oquenchil commented May 10, 2024

oquenchil commented May 14, 2024

DavidZbarsky-at commented May 14, 2024 • edited

oquenchil commented May 15, 2024

dzbarsky commented May 17, 2024

oquenchil commented May 21, 2024

DavidZbarsky-at commented May 21, 2024

Wyverald commented May 21, 2024

DavidZbarsky-at commented May 22, 2024

oquenchil commented May 22, 2024

DavidZbarsky-at commented May 22, 2024

oquenchil commented May 22, 2024

oquenchil commented May 23, 2024

DavidZbarsky-at commented May 23, 2024

DavidZbarsky-at commented May 24, 2024

oquenchil commented May 24, 2024

DavidZbarsky-at commented May 25, 2024

fmeum commented May 25, 2024

DavidZbarsky-at commented May 25, 2024

oquenchil commented May 27, 2024

DavidZbarsky-at commented May 27, 2024

oquenchil commented May 27, 2024

DavidZbarsky-at commented May 28, 2024

DavidZbarsky-at commented May 28, 2024

oquenchil commented May 28, 2024

DavidZbarsky-at commented May 28, 2024 • edited

oquenchil commented May 28, 2024

DavidZbarsky-at commented May 29, 2024

oquenchil commented May 29, 2024

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse HEAD` ?

fmeum commented May 10, 2024 •

edited

DavidZbarsky-at commented May 14, 2024 •

edited

DavidZbarsky-at commented May 28, 2024 •

edited