Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Remote caching support #3971

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

grossag
Copy link
Contributor

@grossag grossag commented Jul 13, 2021

This change introduces remote caching support to SCons. Currently SCons has support for caching built objects in directories, but remote caching adds support for sharing built objects across multiple machines. For example at VMware, we have our official builds push to cache and all developers will fetch from cache.

The end result of this is that developers are able to only build what they have changed locally. All other cacheable build actions will be fetched from the remote cache server, dramatically speeding up builds.

One piece that I introduced as part of this change is a new scheduler as a new class ParallelV2 that is used if remote caching is enabled or if a developer passes --use-scheduler-v2. Whereas the existing parallel scheduler waits on jobs if the job queue is full, this new scheduler will attempt to continue scanning for new jobs whenever possible. In addition, it supports remote caching by draining the "pending remote cache fetches" queue just like it drains the "pending jobs" queue. I believe that this scheduler is an
alternative to #3386 that should have the same effect in keeping jobs as high as possible. I introduced it as a separate class for now, but it could replace the Parallel class if we wanted.

The new RemoteCache class owns the job of fetch from and pushing to a Bazel remote cache server or any other similar server that supports /ac/ and /cas/ GET and PUT requests using SHA-256 file names. See https://github.com/buchgr/bazel-remote for more details on the server. This class uses urllib3 for network requests (see below acknowledgement that this violates SCons guidelines). I chose it because it has good support for concurrency across threads using its ConnectionPool class. If remote caching is requested and urllib3 is not present, it raises an exception. If remote caching is not requested, urllib3 is not required.

The change to the scheduler is needed because cache fetches necessarily must be asynchronous due to network latency. If a request can take 200ms, we can't hang the main thread while waiting on it. And we also can't do the requests inside of job threads (like CacheDir does) because we want to have more active network requests than job threads given that network requests are bound by network I/O, not disk I/O or CPU.

Given the aforementioned change to the scheduler, I don't see a way to make this into a Tool, which was a possibility mentioned on Discord a while back. That is, I wasn't able to find a good way to make this self-contained enough to be a Tool.

As part of implementing this functionality, the following new parameters are introduced:

  1. --remote-cache-fetch-enabled: Enables fetch of build output from the server
  2. --remote-cache-push-enabled: Enables push of build output to the server
  3. --remote-cache-url: Required if fetch or push is enabled
  4. --remote-cache-connections: Connection count (defaults to 100)

I understand that this patch violates the SCons guidelines in two main ways:

  1. The new functionality requires urllib3 when the guidelines say that we must not require any packages other than pywin32. I don't believe that I can redo this change without urllib3, so the [WiP] tag is about vetting this approach with the team to see whether it is a fatal flaw with this approach.
  2. There is no documentation for the new functionality. If the approach is accepted, I would resolve this as part of removing the [WiP] tag.

Contributor Checklist:

  • I have created a new test or updated the unit tests to cover the new/changed functionality.
  • I have updated CHANGES.txt (and read the README.rst)
  • I have updated the appropriate documentation

This change introduces remote caching support to SCons. Currently SCons has
support for caching built objects in directories, but remote caching adds
support for sharing built objects across multiple machines. For example at
VMware, we have our official builds push to cache and all developers will
fetch from cache.

The end result of this is that developers are able to only build what they
have changed locally. All other cacheable build actions will be fetched
from the remote cache server, dramatically speeding up builds.

One piece that I introduced as part of this change is a new scheduler as a new
class ParallelV2 that is used if remote caching is enabled or if a developer
passes --use-scheduler-v2. Whereas the existing parallel scheduler waits on
jobs if the job queue is full, this new scheduler will attempt to continue
scanning for new jobs whenever possible. In addition, it supports remote
caching by draining the "pending remote cache fetches" queue just like it
drains the "pending jobs" queue. I believe that this scheduler is an
alternative to SCons#3386 that should have the
same effect in keeping jobs as high as possible. I introduced it as a separate
class for now, but it could replace the Parallel class if we wanted.

The new RemoteCache class owns the job of fetch from and pushing to a Bazel
remote cache server or any other similar server that supports /ac/ and /cas/
GET and PUT requests using SHA-256 file names. See
https://github.com/buchgr/bazel-remote for more details on the server.
This class uses urllib3 for network requests. I chose it because it has good
support for concurrency across threads using its ConnectionPool class.

As part of implementing this functionality, the following new parameters are
introduced:
	--remote-cache-fetch-enabled: Enables fetch of build output from the server
	--remote-cache-push-enabled: Enables push of build output to the server
	--remote-cache-url: Required if fetch or push is enabled
	--remote-cache-connections: Connection count (defaults to 100)
Some are inherited simply from moving code around, but this fixes the issues
my change introduced.
try:
if ((remote_cache and remote_cache.fetch_enabled) or
use_scheduler_v2):
self.job = ParallelV2(taskmaster, num, stack_size, remote_cache)
Copy link
Contributor

@dmoody256 dmoody256 Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be isolated into a tool and duck typed in when the tool is loaded? Then scons would not be forced to take on urllib3 dependency. Is the dependency needed if you are just using the scheduler aspect?

Something similar was done with the ninja tool: https://github.com/SCons/scons/blob/master/SCons/Tool/ninja/__init__.py#L382

its a bit hacky but these internal class don't really have any other exposure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW urllib3 isn't strictly required by SCons with my change; it is still only required if you enable remote caching. Otherwise, you can still run SCons without urllib3.

I am honestly not sure about the idea of putting the new scheduler in a tool. It may also depend on whether people want this ParallelV2 to replace Parallel as a new-and-improved scheduler.

@luzpaz
Copy link

luzpaz commented Jun 11, 2023

Any traction on this ?

@bdbaddog
Copy link
Contributor

Any traction on this ?

Not yet. It's on the list, just requires a big chunk of time to review and likely modify before being able to pull into the repo.

@grossag
Copy link
Contributor Author

grossag commented Jun 12, 2023

@bdbaddog I am going to hold off on doing any future updates or merges until you think there is a shot of finalizing a design and landing this. Let me know if you want me to start doing any of that, but I am fine holding off or even closing this if the team doesn't want it.

I think the biggest reason to not include remote caching in SCons is that SCons has no dependency enforcement functionality. We implemented dependency enforcement internally at VMware but it was done in such a VMware-specific way that I don't have any way to upstream it without significant support from the SCons team. I am open to finding a way to upstream it by providing the code we have written in VMware-specific layers, but it would take a decent amount of effort from VMware+SCons maintainers to adapt and integrate. That said, the benefit of this logic is that you can run SCons in a mode where it actually fails compilation if (1) a build action reads from a file that is not in its dependencies or side effects or (2) a build action writes to a file that is not in its targets or side effects. Right now SCons can easily be wrong about its dependencies.

Remote caching without dependency enforcement is risky in that if SCons is missing a dependency (e.g. the scanner didn't find a .h file), if that dependency is changed we will continue to get remote cache hits when they should be cache misses. So in the example I just mentioned, if you change that .h file that SCons doesn't know about, you'll still get cache hits when you shouldn't. This could cause a lot of confusion because the SCons build output is different than you expect.

Notably, this is also the case with the existing SCons "cache dir" functionality. But that is why we never used cache dir functionality internally.

@bdbaddog
Copy link
Contributor

@bdbaddog I am going to hold off on doing any future updates or merges until you think there is a shot of finalizing a design and landing this. Let me know if you want me to start doing any of that, but I am fine holding off or even closing this if the team doesn't want it.

I think the biggest reason to not include remote caching in SCons is that SCons has no dependency enforcement functionality. We implemented dependency enforcement internally at VMware but it was done in such a VMware-specific way that I don't have any way to upstream it without significant support from the SCons team. I am open to finding a way to upstream it by providing the code we have written in VMware-specific layers, but it would take a decent amount of effort from VMware+SCons maintainers to adapt and integrate. That said, the benefit of this logic is that you can run SCons in a mode where it actually fails compilation if (1) a build action reads from a file that is not in its dependencies or side effects or (2) a build action writes to a file that is not in its targets or side effects. Right now SCons can easily be wrong about its dependencies.

Remote caching without dependency enforcement is risky in that if SCons is missing a dependency (e.g. the scanner didn't find a .h file), if that dependency is changed we will continue to get remote cache hits when they should be cache misses. So in the example I just mentioned, if you change that .h file that SCons doesn't know about, you'll still get cache hits when you shouldn't. This could cause a lot of confusion because the SCons build output is different than you expect.

Notably, this is also the case with the existing SCons "cache dir" functionality. But that is why we never used cache dir functionality internally.

Dependency enforcement hasn't been an issue for most users, or at least that anyone has let the project know.
In general if the scanner's missing things, it should get fixed.

That said, for some environments, it's worth the cost to do so. (VMWare being one of those environments I'm guessing.. ;)

Does your dependency enforcement implementation work on windows, linux & mac? or only some of those? Or require admin type access on some of those platforms?

Probably worth moving any discussion of dependency enforcement to a discussion or user's mailing list as IMHO it's not really a required part of this remote caching functionality.

Ideally you'd push whatever updates you have to this functionality to this PR.
Spending time reviewing outdated code would be a waste.

As an aside, it's likely the new parallel scheduler will become the default in the near future.

Realistically reviewing this code will take a solid day or two, ideally with access to you @grossag to answer questions. Any chance we can get you to join our Discord server for such consultation?

@grossag
Copy link
Contributor Author

grossag commented Jul 12, 2023

Sorry for the delay in responding. I've been on the Discord for a while, although I don't pay real close attention. I'm on Discord as @ adamgross .

For the dependency enforcement, it depends on the platform:

  1. Windows uses tracker.exe from Visual Studio to wrap command outputs.
  2. Linux uses a kernel driver which will hopefully be open sourced at some point.
  3. Mac uses a wrapper script which invokes dtrace; the wrapper script needs a sudoers entry to work properly.

Dependency enforcement would be really hard to upstream, so I am fine not focusing on that. But you said "In general if the scanner's missing things, it should get fixed." and the issue is that it's nearly impossible to figure out if the scanner is really missing things without any guarantees from SCons.

I am open to updating the pull request if you think this is mostly unblocked and worth proceeding on. But I vaguely recall from our last conversation that you wanted to wait until some of the Task logic was refactored until this could be considered. I am also unsure on whether it's a dealbreaker that this uses urllib3. I just don't want to burn a day updating this PR if it's not worth proceeding on. Let me know how you'd like me to proceed and I'll do my best to accommodate, thanks.

@bdbaddog
Copy link
Contributor

bdbaddog commented Aug 3, 2023

What do we need to be able to test against remote caching server?
How do we set one up,etc?

@grossag
Copy link
Contributor Author

grossag commented Mar 10, 2024

What do we need to be able to test against remote caching server?

How do we set one up,etc?

Sorry I never responded to your question @bdbaddog . This PR is designed to use https://github.com/buchgr/bazel-remote with no additional modifications to that project. That repo's README has a "Docker" section which describes how to install and run it. It's effectively content-addressable storage where the ac/ folder stores the action metadata and cas/ folder stores the binary. ac/ is addressed by a unique action signature and cas/ is addressed by SHA-256 hash of the binary.

@bdbaddog
Copy link
Contributor

What do we need to be able to test against remote caching server?
How do we set one up,etc?

Sorry I never responded to your question @bdbaddog . This PR is designed to use https://github.com/buchgr/bazel-remote with no additional modifications to that project. That repo's README has a "Docker" section which describes how to install and run it. It's effectively content-addressable storage where the ac/ folder stores the action metadata and cas/ folder stores the binary. ac/ is addressed by a unique action signature and cas/ is addressed by SHA-256 hash of the binary.

Thanks! Any chance you'd be willing to resolve the conflicts in this PR?
I'd also like to see if there's a way to use bazel's remote execution protocol with SCons as well.
Seems like leveraging all the work other projects do for SCons would make sense if it's workable.

@grossag
Copy link
Contributor Author

grossag commented Mar 14, 2024

@bdbaddog : I was able to resolve all other conflicts other than the big one: integrating remote cache functionality into the NewParallel class. It makes sense for me to drop the new scheduler I implemented and fold functionality into NewParallel, but the leader/follower pattern is throwing me off.

When remote caching is enabled, any time a task is pulled off of the todo list (code line task = self.taskmaster.next_task() of SCons/Taskmaster/Job.py), we first need to dispatch it to the RemoteCache for evaluation as to whether it can be fulfilled with a cache hit. This is different than the existing CacheDir class, which can be a synchronous disk call.

In my PR this was implemented using local variable fetch_response_queue. Tasks are sent to RemoteCache class and then when they are confirmed as cache hits or misses, they end up as results in fetch_response_queue. Then the rest of the implementation was related to managing two queues: fetch_response_queue (RemoteCache responses) and self.tp.resultsQueue (executions of tasks that were cache misses).

Can you let me know if you have any thoughts on how best to implement asynchronous RemoteCache requests into NewParallel scheduler?

@bdbaddog
Copy link
Contributor

@acmorrow - perhaps you can assist @grossag with NewParallel?

@acmorrow
Copy link
Contributor

@bdbaddog and @grossag - Yes, definitely, but it'll be next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants