Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement automatic garbage collection for the disk cache #5139

Open
buchgr opened this issue May 2, 2018 · 68 comments
Open

Implement automatic garbage collection for the disk cache #5139

buchgr opened this issue May 2, 2018 · 68 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request

Comments

@buchgr
Copy link
Contributor

buchgr commented May 2, 2018

Break out from #4870.

Bazel can use a local directory as a remote cache via the --disk_cache flag.
We want it to also be able to automatically clean the cache after a size threshold
has been reached. It probably makes sense to clean based on least recently used
semantics.

@RNabel would you want to work on this?

@RNabel @davido

@davido
Copy link
Contributor

davido commented May 2, 2018

I will look into implementing this, unless someone else is faster than me.

@RNabel
Copy link
Contributor

RNabel commented May 2, 2018

I don't have time to work on this right now. @davido, if you don't get around to working on this in the next 2-3 weeks, I'm happy to pick this up.

lucamilanesio pushed a commit to GerritCodeReview/gerrit that referenced this issue Jun 4, 2018
During migration from Buck to Bazel we lost action caches activation per
default. For one, local action cache wasn't implemented in Bazel, for
another, there was no option to specify HOME directory. I fixed both
problems and starting with Bazel 0.14.0 both features are included in
released Bazel version: [1], [2].

There is still one not implemented option, limit the cache directory
with max size: [3]. But for now the advantage to activate the caches per
default far outweigh the disadvantage of unlimited growth of size of
cache directory beyound imaginary max size of say 10 GB. In meantime we
add a warning to watch the size of the directory cache and periodically
clean cache directory:

  $ rm -rf ~/.gerritcodereview/bazel-cache/cas/*

[1] https://bazel-review.googlesource.com/#/c/bazel/+/16810
[2] bazelbuild/bazel#4852
[3] bazelbuild/bazel#5139

Change-Id: I42e8f6fb9770a5976751ffef286c0fe80b75cf93
@daghub
Copy link

daghub commented Sep 11, 2018

Hi, I would also very much like to see this feature implemented! @davido , @RNabel did you get anywhere with your experiments?

@RNabel
Copy link
Contributor

RNabel commented Sep 11, 2018

Not finished, but had an initial stab: RNabel/bazel@baseline-0.16.1...RNabel:feature/5139-implement-disk-cache-size (this is mostly plumbing and figuring out where to put the logic it definitely doesn't work)

I figured the simplest solution is an LRU relying on the file system for access times and modification times. Unfortunately, access times are not available on windows through Bazel's file system abstraction. One alternative would be a simple database, but that feels like overkill here. @davido, what do you think is the best solution here? Also happy to write up a brief design doc for discussion.

@buchgr
Copy link
Contributor Author

buchgr commented Sep 11, 2018

What do you guys think about just running a local proxy service that has this functionality already implemented? For exampe: https://github.com/Asana/bazels3cache or https://github.com/buchgr/bazel-remote? One could then point Bazel to it using --remote_http_cache=http://localhost:XXX. We could even think about Bazel automatically launching such a service if it is not running already.

@ittaiz
Copy link
Member

ittaiz commented Sep 11, 2018 via email

@aehlig
Copy link
Contributor

aehlig commented Sep 11, 2018

I think @aehlig solved this problem for the repository cache. Maybe you can borrow his implementation here as well.

@ittaiz, what solution are you talking about? What we have so far for the repository cache is that the file gets touched on every cache hit (see e0d8035), so that deleting the oldest files would be a cleanup; the latter, however, is not yet implemented, for lack of time.

For the repository cache, it is also a slightly different story, as clean up should always be manual; upstream might have disappeared, to the cache might be last copy of the archive available to the user—and we don't want to remove that on the fly.

@buchgr
Copy link
Contributor Author

buchgr commented Sep 11, 2018

outsourcing it isn’t the right direction

I would be interested to learn more about why you think so.

@ittaiz
Copy link
Member

ittaiz commented Sep 11, 2018 via email

@buchgr
Copy link
Contributor Author

buchgr commented Sep 13, 2018

@ittaiz
the disk cache is indeed a leaky abstraction that was mainly added because
it was easy to do so. I agree that if Bazel should have a disk cache in the long
term, then it should also support read/write through to a remote cache and
garbage collection.

However, I am not convinced that Bazel should have a disk cache built in but
instead this functionality could also be handled by another program running
locally. So I am trying to better understand why this should be part of Bazel.
Please note that there are no immediate plans to remove it and we will not do
so without a design doc of an alternative. I am mainly interested in kicking off
a discussion.

@buchgr buchgr self-assigned this Sep 13, 2018
@ittaiz
Copy link
Member

ittaiz commented Sep 13, 2018 via email

@buchgr
Copy link
Contributor Author

buchgr commented Sep 14, 2018

I think that users don’t want to operate many different tools and servers locally.

I partly agree. I'd argue in many companies that would change as you would typically have an IT department configuring workstations and laptops.

The main disadvantage I see is that it sounds like you’re offering a cleaner design at the user’s expense.

I think that also depends. I'd say that if one only wants to use the local disk cache then I agree that providing two flags is as friction less as it gets. However, I think it's possible that most disk cache users will also want to do remote caching/execution and that for them this might not be noteworthy additional work.

So I think there are two possible future scenarios for the disk cache:

  1. Add garbage collection to the disk cache and be done with it.
  2. Add garbage collection, remote read fallback, remote write and async remote writes.

I think 1) makes sense if we think that the disk cache will be a standalone feature that a lot of people will find useful on its own and if so I think its worth the effort to implement this in Bazel. For 2) I am not so sure as I can see several challenges that might be better solved in a separate process:

  • Async remote writes are the idea that Bazel writes blobs to the disk cache and then asynchronously (to the build) writes them to the remote cache thereby removing the upload time from the build's critical path. This is difficult to implement in Bazel, partly because there are no guarantees about the lifetime of the server process and partly because of lots of edge cases.
  • We might want to move authentication for remote caching/execution out of Bazel in the long term. We currently support Google Cloud authentication, we are about to add AWS and if we are successful I think its likely that we will need to add many more in the future and these authentication SDKs are quite large and increase the binary size. So we might end up with a separate proxy process anyway.
  • It's unconventional and potentially insecure that one has to pass authentication flags and secrets to Bazel itself. It seems to me that a separate process running as a different user that hides the authentication secrets from the rest of the system using OS security mechanisms is a better idea.
  • Once we implement a virtual remote filesystem (planned for Q4) in Bazel and then Bazel does not need to download cached artifacts anymore then the combination of a local disk cache and remote cache might become less attractive because downloads should no longer be a bottleneck (if it works out as expected).

So I think it might make sense for us to think about having a standard local caching proxy that's a separate process and that can be operated independently and/or that Bazel can launch automatically for improved usability might be an idea worth thinking about.

@buchgr buchgr added type: feature request team-Remote-Exec Issues and PRs for the Execution (Remote) team P2 We'll consider working on this in future. (Assignee optional) and removed category: http caching labels Jan 16, 2019
@bayareabear
Copy link

Is there any plan to roll out the "virtual remote filesystem" soon? I am interested to learn more about it and can help if needed. We are hitting network speed bottleneck.

@buchgr
Copy link
Contributor Author

buchgr commented Jan 24, 2019

yep, please follow #6862

@thekyz
Copy link

thekyz commented Feb 11, 2020

any plan of implementing the max size feature or a garbage collector for the local cache?

@brentleyjones
Copy link
Contributor

This is a much needed feature in order to use Remote Builds without the Bytes, since naively cleaning up the disk cache results in build failures.

@tjgq
Copy link
Contributor

tjgq commented Jan 2, 2024

Any updates on this?

I'm working on this now. I expect this feature to ship in 7.1 or (more likely) 7.2.

@nikhilkalige
Copy link
Contributor

I recently made a change in the DiskCache class to write to a pipe all the files that bazel uses and then have a separate process clean that up, was working well for my use case.

@tjgq tjgq changed the title local caching: implement --disk_cache_size=GiB Implement automatic garbage collection for the disk cache Jan 17, 2024
@tjgq
Copy link
Contributor

tjgq commented Jan 22, 2024

I've published a design doc for disk cache garbage collection at https://docs.google.com/document/d/16aGm4u9EgW199M1WjjbVbVCJSfa8RApWPcKnZYnVbrI. Comments are welcome!

@tjgq
Copy link
Contributor

tjgq commented Mar 13, 2024

The design doc at https://docs.google.com/document/d/16aGm4u9EgW199M1WjjbVbVCJSfa8RApWPcKnZYnVbrI/edit has been significantly reworked. The most notable change is that we're switching to an "online" garbage collection strategy, in response to feedback that "offline" collection would not satisfy the need to keep the cache under the target size at all times, including during builds.

Although no code has been submitted yet, I've been working on a prototype for this new strategy and I'm still expecting to get it productionized and submitted in time for the 7.2.0 release.

@dkashyn-sfdc
Copy link

@tjgq are there any updates on this? It is still not listed as 7.2.0 deliverable according to https://github.com/bazelbuild/bazel/milestone/68

copybara-service bot pushed a commit that referenced this issue Apr 25, 2024
This will be used by the implementation of garbage collection for the disk cache, as discussed in #5139 and the linked design doc. I judge this to be preferred over https://github.com/xerial/sqlite-jdbc for the following reasons:

1. It's a much smaller dependency.
2. The JDBC API is too generic and becomes awkward to use when dealing with the peculiarities of SQLite.
3. We can (more easily) compile it from source for all host platforms, including the BSDs.

PiperOrigin-RevId: 628046749
Change-Id: I17bd0547876df460f48af24944d3f7327069375f
@meisterT meisterT added this to the 7.2.0 release blockers milestone Apr 26, 2024
@meisterT
Copy link
Member

@dkashyn-sfdc I added it to the milestone just now - there is some progress and we expect to land a first usable version by the 7.2 release the latest.

@iancha1992 iancha1992 removed this from the 7.2.0 release blockers milestone Apr 26, 2024
@iancha1992
Copy link
Member

@bazel-io fork 7.2.0

tjgq added a commit to tjgq/bazel that referenced this issue Apr 26, 2024
This will be used by the implementation of garbage collection for the disk cache, as discussed in bazelbuild#5139 and the linked design doc. I judge this to be preferred over https://github.com/xerial/sqlite-jdbc for the following reasons:

1. It's a much smaller dependency.
2. The JDBC API is too generic and becomes awkward to use when dealing with the peculiarities of SQLite.
3. We can (more easily) compile it from source for all host platforms, including the BSDs.

PiperOrigin-RevId: 628046749
Change-Id: I17bd0547876df460f48af24944d3f7327069375f
github-merge-queue bot pushed a commit that referenced this issue Apr 26, 2024
This will be used by the implementation of garbage collection for the
disk cache, as discussed in
#5139 and the linked design
doc. I judge this to be preferred over
https://github.com/xerial/sqlite-jdbc for the following reasons:

1. It's a much smaller dependency.
2. The JDBC API is too generic and becomes awkward to use when dealing
with the peculiarities of SQLite.
3. We can (more easily) compile it from source for all host platforms,
including the BSDs.

PiperOrigin-RevId: 628046749
Change-Id: I17bd0547876df460f48af24944d3f7327069375f
@Wyverald
Copy link
Member

How far are we from getting this done? rc1 is scheduled for next Monday, but judging by the urgency and remaining work, we can push it out a bit.

Kila2 pushed a commit to Kila2/bazel that referenced this issue May 13, 2024
This will be used by the implementation of garbage collection for the disk cache, as discussed in bazelbuild#5139 and the linked design doc. I judge this to be preferred over https://github.com/xerial/sqlite-jdbc for the following reasons:

1. It's a much smaller dependency.
2. The JDBC API is too generic and becomes awkward to use when dealing with the peculiarities of SQLite.
3. We can (more easily) compile it from source for all host platforms, including the BSDs.

PiperOrigin-RevId: 628046749
Change-Id: I17bd0547876df460f48af24944d3f7327069375f
@tjgq
Copy link
Contributor

tjgq commented May 13, 2024

Unfortunately, I ran into some difficulties and this is not ready yet. I'm aiming to build up to a minimally useful implementation within the next few days.

If this FR is the only reason we would delay rc1, it would be fine to get it out today, under the (not so unreasonable?) assumption that there will be an rc2 that the remaining changes can still make it into.

@peaceiris
Copy link

Here is my workaround.

find /path/to/bazel-cache -amin +1440 -delete 2>/dev/null || true

The command searches for files in the /path/to/bazel-cache directory that have not been accessed in the last 1440 minutes (24 hours) and deletes them.

@dkashyn-sfdc
Copy link

It won't help you to "trim to size" if you need to fit cache below a certain threshold of disk space.

@ceejatec
Copy link

@peaceiris I don't know for certain, but our experience strongly suggests that any process that deletes files directly from the cache like that is doomed to cause strange failures sooner or later. It may depend on specifically what tasks bazel runs; we've found that code coverage jobs are especially brittle.

@peaceiris
Copy link

Yes, I think that's right. So we're looking forward to implementing this feature in Bazel.

@Wyverald
Copy link
Member

Unfortunately we won't have enough time for this in 7.2.0; postponing to 7.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: feature request
Projects
None yet
Development

No branches or pull requests