-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sandbox reuse for tests spends a while #22309
Comments
@DavidZbarsky-at Could you also test the performance with 7.0.0, which doesn't include e9022f6? Cc @oquenchil |
I actually implemented storing the contents of a stashed sandbox in memory but it uses noticeably more memory so I didn't check that in. Regardless of what caused the regression here (which I should fix) I'd want to keep the Have you tried the latter? Could you comment on whether this helps you? If you try with tmpfs, you could try passing at the same time |
Do you have a repro for what you are seeing in your build that I could use to debug? |
@bazel-io flag |
On 7.0.0 I was seeing around 600-900ms sandbox setup overhead. Note that we are using tmpfs sandbox base; without that it was around 20-30s per test.
Perhaps we can store a manifest file on disk? That would mean a single read vs a filesystem walk. I think you can stream the read + comparison if it's sorted?
I was hoping sandbox reuse would allow a further speedup on top of the tmpfs sandbox base (even with tmpfs, syscalls aren't free); it feels like the worker should be able to diff a list of 60K items fairly quickly :) So to summarize, here is the set of timings I am seeing:
I can try to come up with a repro if that would be helpful? |
@bazel-io fork 7.2.0 |
@fmeum is this a hard blocker for 7.2? |
@keertk Based on the slowdown reported in #22309 (comment) I would say yes, but I'll leave the decision to @oquenchil. |
Yes please. I don't understand why this would have regressed from 7.0.0 to 7.1.0 even if the culprit were e9022f6 (still have to verify). A repro would help understanding what's going on. Regarding blocking 7.2.0, I'd be fine with the culprit being reverted once found and not blocking the release instead of a forward fix. |
@DavidZbarsky-at can you provide the repro that shows the regression from 7.0.0 to 7.1.0? |
Check out https://github.com/DavidZbarsky-at/nodejs-repro It looks like on linux both 7.0.0 and 7.1.0 take around 700-900ms in sandbox setup with the tmpfs sandbox base, though I did see some actions taking 2s+ on 7.1.0 BTW on darwin I run with
But given the timings I am not sure it's actually helping :) Some profiles I collected if you're curious: |
Hi David, so am I understanding correctly that you aren't seeing a regression in 7.1.0? I will certainly use your example to improve this further but we need to clarify if 7.1.0 introduced a regression with respect to 7.0.0. |
We chatted over slack, we experience a 1x-1.5x regression in sandbox creation time, but it is masking the enablement of async sandbox deletion. 7.1.0 is faster for us than 7.0.0 with non-async sandbox deletion, so it's not a regression overall. |
Hi @DavidZbarsky-at, here I have the code for the in-memory stashes: https://github.com/oquenchil/bazel/tree/in_memory_reuse_sandbox_directory_stashes You can build with Please if you have time give it a try on Linux and macOS and let me know if you see an improvement. |
Ack, I am moving this week and will be mostly offline, but I'll give it a shot next week! |
Given this, I'm closing the 7.2.0 release blocker issue (#22319). |
@oquenchil I had a chance to give it a quick spin - on my repro, it takes the sandbox creation times from 7s to 200ms, this is amazing! Unfortunately I had issues with genrule not being able to write files, and even when I fixed that to |
Glad to hear there was an improvement. Let's fix the errors you are seeing then. Did you see these errors with the repro from https://github.com/DavidZbarsky-at/nodejs-repro? If you did, would you mind telling me the bazel commands you ran? I didn't see these errors myself. If it wasn't with that repro, would it be possible to create an example with the errors you are seeing? Thanks! |
Yes, it was the master branch of that repo, and it's the same build command I posted earlier. I wonder if my self-built bazel is weird somehow, - I've never tried building it myself before. I'll try on a clean branch of bazel and see if it exhibits the same brokenness |
Alright thanks. I will see if I can get it to reproduce. I think it's unlikely that your custom bazel would introduce any differences. |
I forgot to ask, did you see these errors on Linux or macOS? |
It was on macos |
Here's the first error I saw:
I retried using the HEAD commit you branched from, and saw the same, so I don't think your PR breaks anything for me:
|
Hi David, I tried with the same code on macoS, with what I uploaded and the commit before and I can't get it to fail, it always works. Since I can't reproduce this, would you mind running a bisection yourself with bazelisk? Assuming it works with bazel
I haven't tried the bisection command, from the docs I believe it can take a release branch name and a commit for the end of the range but if it doesn't work the last commit for Thank you. |
cc @fmeum |
@DavidZbarsky-at Could you test b0ed4ca? Pretty sure that fixes it. |
Yep, that commit fixes it, and the one before it ( |
Glad to hear. Will get it checked in this week. Do you need this checked in before you can confirm the performance benefits in a more realistic setup than the repro? |
@oquenchil Could you rebase your PR on top of main (or at least the commit @fmeum linked)? Then I'll be be able to test it out. I tried to rebase it myself but it doesn't apply cleanly |
It's on top of main now. I guarded the new code path behind a flag so you will have to pass |
I tried with that flag but am seeing the following:
I retried and got a similar-looking crash on a different node. |
If you want to add some logging I'm happy to try again |
Hi David, sorry and thanks for trying it. I don't understand why it gets triggered. I added some logging that will show the two paths, that will help me understand why it is legitimate for them to be different and I will simply remove the precondition check. So please whenever you can. Thanks again. |
No problem at all, happy to iterate on this.
note the test.sh vs test.tsx |
Now it should work. It was a silly mistake where I didn't take into account the package paths for the runfiles directory when recycling the stash contents. |
Wonderful news, thanks! I will write a test now and send it for review. |
Description of the bug:
These tests should have pretty much the same sandbox, but it looks like we are spending 1-2s per test on filtering the existing files.
Would it be possible for the stashed sandbox to store a list of its contents, so we can do this more efficiently?
Which category does this issue belong to?
No response
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
No response
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?release 7.1.0
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
Related to e9022f6 ?
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: