Skip to content

Latest commit

 

History

History
254 lines (219 loc) · 12.4 KB

2021-02-09-remote-output-service.md

File metadata and controls

254 lines (219 loc) · 12.4 KB
created last updated status reviewers title authors
2021-02-10
2021-04-28
To be reviewed
coeuvre
philwo
alexjski
janakdr
Remote Output Service: place bazel-out/ on a FUSE file system
EdSchouten

Abstract

This document describes an extension to Bazel, allowing it to host its bazel-out/ directory on a FUSE file system. The goal of this change is to reduce the amount of network that Bazel generates when remote execution is used. The end result is similar to what's offered by "Remote Builds without the Bytes", with the difference that outputs remain accessible.

Background

Bazel can use the Remote Execution protocol to offload the execution of build actions to a remote build cluster. When executing actions remotely, Bazel performs the following three tasks successively:

  1. Uploading inputs: Bazel uploads individual files and Directory, Command and Action messages into the Content Addressable Storage (CAS).
  2. Execution: Bazel requests that the build cluster executes the Action that was uploaded into the CAS. The build cluster returns an ActionResult.
  3. Downloading outputs: Bazel downloads all files referenced by the ActionResult. In case of directory outputs, Bazel also downloads all files referenced by Tree objects referenced by the ActionResult.

In the common case, the first two tasks consume little bandwidth. By using an RPC method named FindMissingBlobs(), Bazel is capable of scanning which objects already exist in the CAS, thereby allowing it to skip unnecessary uploads of objects. Assuming the retention rate of the remote build cluster is adequate, you will see that almost all network bandwidth generated by Bazel is caused by the third task. In 2018, Jakob Buchgraber measured that when building Bazel itself using remote execution, at least 95.4% of network bandwidth was caused by the downloading of outputs.

The reason Bazel downloads output files is as follows:

  • To allow the user to inspect and use the results of a build, e.g. to run the software that was built.
  • Bazel supports mixed local and remote execution. Targets may either be annotated explicitly to specify where they are run (using the "no-remote" tag), or features such as the dynamic spawn scheduler may be used to let Bazel automatically decide where actions are run. If a locally executing action depends on files that were built remotely, Bazel needs to download those files to satisfy the execution requirements of the locally executing action.
  • To serve as Bazel's bookkeeping across server restarts. A fresh Bazel server in an existing workspace can load its action cache from disk and then skip executing actions if the output files on disk match the records in the action cache.
  • To guarantee forward progress of the build, even if objects were to disappear from the remote CAS. By having the files present locally, Bazel can reupload them if they were to disappear.

Based on the conclusions of his measurement results, Jakob added a series of command line flags (--remote_download_minimal, --remote_download_toplevel, etc.) that permit users to skip the downloading step when possible. When --remote_download_minimal is enabled, outputs will only be downloaded if successive actions depend on them, or if bazel run is invoked and the output is an execution dependency. During every invocation, Bazel constructs an additional ActionInputMap that stores metadata of all files that are only present remotely, thereby allowing input roots to be constructed without having files present locally. Various remote CAS implementations have been improved to guarantee the retention of recently accessed objects, thereby ensuring forward progress.

Though --remote_download_minimal has helped many users of Bazel's remote execution to scale, some inherent downsides remain:

  • Bazel's memory usage has increased significantly. The ActionInputMap that gets created during build may easily consume multiple gigabytes of space for a sufficiently large project.
  • Incremental builds after a Bazel server restart are significantly slower. Because the ActionInputMap does not outlive the Bazel server, a new server has no output metadata, and thus has to regenerate all outputs from scratch via a clean build.
  • Users now need to make a conscious choice whether they want to download output files or not. This makes it harder to do ad hoc exploration.

Proposal

In a nutshell, the proposal is to optionally let Bazel create a bazel-out/ directory that contains the same layout as generated by a plain build (without --remote_download_minimal), but somehow delay downloads of remote files until their contents are actually read (either by successive no-remote actions, or when the user accesses bazel-out/ manually). This prevents the need for the additional ActionInputMap and once again grants the user the ability to explore build outputs on demand.

One challenge with this approach is that it relies on special operating system features to create such lazy-loading files. It can't be built on top of the plain POSIX API. Unfortunately, no standards seem to exist in this space:

  • Linux provides FUSE, which can be used to create a mountpoint for which all operations are forwarded through a character device to a userspace process.
  • For macOS there is macFUSE (also known as OSXFUSE). It implements a slightly older version of the FUSE protocol, with minor macOS-specific protocol extensions in place. In terms of robustness and performance, it fares worse than Linux's FUSE implementation. Though still hosted on GitHub, this project is no longer Open Source. The last Open Source version no longer runs on modern versions of macOS.
  • Windows provides an API called Projected File System (ProjFS) that makes it possible to instantiate files underneath a virtualization root whose contents are backed by a promise.
  • On operating systems that don't offer the features above, but do support networked file systems (e.g., NFS, SMB), it may be desirable to let the system mount a fictive volume managed by a userspace process.

Some of these APIs also require administrative privileges to work properly. Though FUSE can on most Linux systems be used without any special privileges through the use of the fusermount utility (setuid), it does require elevated privileges to get it to work inside a Docker container.

Because of these limitations, it makes sense to let a daemon other than Bazel manage these directories. In order to manage the lifecycle of these bazel-out/ directories and to instruct the creation of lazy-loading files, Bazel may communicate with this daemon over gRPC. This approach has a couple of advantages:

  • It's a proven strategy. Google already has such a system internally for Blaze. Their daemon is called objfsd and uses FUSE.
  • By having a well-defined and succint protocol, it is possible for people to easily design their own daemons that use custom protocols or have custom storage policies (e.g., snapshotting and preserving the results of builds). These daemons may have a release cadence that is independent from Bazel.
  • A single daemon may be run on a system, caching build results for multiple users and multiple projects. The lifetime of the daemon may be managed separately from the Bazel server, which may be useful for shared CI setups.

The question then becomes what the schema needs to look like that Bazel and the daemon will use. Google already has a schema for this internally. Unfortunately, this protocol is not a perfect match:

  • It doesn't support REv2 specific concepts such as instance names, user-configurable hashing functions, etc.
  • It makes no use of REv2 Protobuf messages, even though there are many that seem useful in literal form (Digest, OutputFile, OutputDirectory, OutputSymlink).
  • The semantics around preserving files seems different. Google's internal protocol has support for explicitly marking files that need to be retained, while REv2 provides no such mechanism. It is assumed that clients call FindMissingBlobs() to instruct a build cluster to keep files present, at least for the remainder of a build.

Because of that, Bazel PR #12823 provides an experimental implementation that uses a custom gRPC protocol that looks as follows:

service RemoteOutputService {
  // Remove a bazel-out/ directory.
  rpc Clean(CleanRequest) returns (google.protobuf.Empty);

  // Indicate that a build is starting or has completed. May create a
  // bazel-out/ if none exists for the current workspace.
  rpc StartBuild(StartBuildRequest) returns (StartBuildResponse);
  rpc FinalizeBuild(FinalizeBuildRequest) returns (google.protobuf.Empty);

  // Create one or more lazy-loading files or directories inside a
  // bazel-out/ directory, backed by objects in the CAS.
  rpc BatchCreate(BatchCreateRequest) returns (google.protobuf.Empty);
  // Reobtain the metadata of files stored in the bazel-out/ directory.
  rpc BatchStat(BatchStatRequest) returns (BatchStatResponse);
}

This protocol is strongly based on the API of Bazel's OutputService class. The Protobuf file in the PR contains more complete documentation. An example server-side implementation has been published as part of the Buildbarn project.

Alternatives considered

In the months before this proposal was written, various experiments were performed:

  • An attempt was made to add an integrated FUSE file system to Bazel itself. This project was eventually abandoned due to the limitations mentioned previously (i.e., the inability to run Bazel without elevated privileges).
  • Letting Bazel manage the bazel-out/ directory itself, emitting symbolic links to a FUSE file system that provides a flat view of the CAS (PR #11703 and PR #11622). Though this worked fine from Bazel's point of view, the additional symlink indirection confused many build actions. It completely broke dynamic linkage.
  • Letting Bazel write its contents into a tmpfs-like FUSE daemon, where files may be hardlinked from a directory that provides a flat view of the CAS. This worked, but performance was very bad due to a high amount of context switching. The gRPC-based solution solves this by supporting batched requests.

Backward-compatibility

The goal is to implement this feature in such a way that local execution, plain remote execution, remote execution with --remote_download_minimal, the use of a local disk cache, etc. all remain functional.

Future work

The initial goal of this proposal is to let the Remote Output Service offer a file system to Bazel that is fully writable, but simply augmented to support the creation of lazy-loading files. In the future it may be of interest to allow this file system to be read-only, requiring all changes to be made by Bazel through gRPC. This reduces the need for tracking changes in the file system, or scanning the file system when doing incremental builds. The following things will need to be kept in mind when implementing this:

  • Not all changes to Bazel's output directory are made through the RemoteOutputService interface. Additioning wrapping of FileSystem will need to be performed to prevent Bazel from making direct writes.
  • Making the output path read-only makes it impossible to run actions with sandboxing disabled. This may be acceptable for many users, though.

Such a change could be decomposed into two separate parts:

  1. Adding a gRPC method that Bazel can use to write files into the file system, as opposed to writing files directly. The protocol as currently defined only allows the creation of files that are backed by a remote CAS.
  2. Extending StartBuild() to allow Bazel to request the creation of a read-only output directory.