Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cgroups] Extended cgroup support #21322

Closed

Conversation

AlessandroPatti
Copy link
Contributor

@AlessandroPatti AlessandroPatti commented Feb 13, 2024

This extends the cgroup support in bazel to the cpu controller. Limits can be specified with the flag --experimental_sandbox_limits=<resource>=<limit>. Additionally, it is possible to get finer grained resource limits with --experimental_sandbox_enforce_resources=<regex>, which will cause all actions whose mnemonic matches the regex to use their allocated resource as limits, particularly useful for tests where resource allocation is controlled by the resources:*:* tags.

@AlessandroPatti
Copy link
Contributor Author

Opened this as a draft to gather feedback before I go ahead and rebase/push tests. We've been using this change for a while now and have a couple more follow prs relative to resource utilization collection, which I'll open if this gets enough traction. cc @larsrc-google and @zhengwei143 who have worked in this area in the past.

@fmeum
Copy link
Collaborator

fmeum commented Feb 19, 2024

I learned that Lars is no longer working in this area, so cc @wilwell as well.

@@ -234,7 +234,8 @@ static void ParseCommandLine(unique_ptr<vector<char *>> args) {
break;
case 'C':
ValidateIsAbsolutePath(optarg, args->front(), static_cast<char>(c));
opt.cgroups_dir.assign(optarg);
opt.cgroups_dirs.emplace_back(optarg);
opt.writable_files.emplace_back(optarg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to make the cgroups dir writable? We wouldn't want spawns to be able to write to the cgroups dir, circumventing any existing cgroup limits placed on the spawn by Bazel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, I believe this comes from the original implementation, I wasn't sure why it was added to the writable dirs so I kept it around. Will remove as I rebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, this was removed recently in da8c080 which clamps down on the write access to the cgroup dir (previously the spawn could actually modify its own limits).

// We cannot leave the cgroups around and delete them only when we delete the sandboxes
// because linux has a hard limit of 65535 memory controllers.
// Ref. https://github.com/torvalds/linux/blob/58d4e450a490d5f02183f6834c12550ba26d3b47/include/linux/memcontrol.h#L69
cgroups.remove(context.getId()).ifPresent(VirtualCGroup::delete);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting, I don't think we've encountered this in our internal builds presumably because we've not hit this limit for the number of sandboxed spawns created during a build. (I think this is fine though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, took some time for this one to manifest.

reporter.handle(Event.warn("Found non-writable cgroup v1 at " + cgroup));
continue;
}
cgroup = cgroup.resolve("bazel_" + ProcessHandle.current().pid() + ".slice");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong - I believe this comes from systemd with the concepts of scopes and slices. From what I checked, I don't think its mandatory for either v1 / v2 to use systemd; but we don't we don't do a good job in CgroupsInfo distinguishing this atm either (the current implementation assumes systemd for v2 and without for v1).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this comes from systemd. In general the name can be arbitrary, regardless of it being using systemd or not, so I kept the syntax. I don't have a strong opinion about that, so let me know if you have any preference

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I think we should follow convention to the best of our ability - I think we can infer this from Mount.path() == "/sys/fs/cgroup" (for systemd),

return !getPath().resolve("cgroup.controllers").toFile().exists();
}

static <T extends Controller> T getDefault(Class<T> clazz) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to implement a simple NO-OP class to handle the case when cgroups are not available.

See:

public static class InvalidCgroupsInfo extends CgroupsInfo {

Reason being, we do want some control in the code when working with controllers where cgroups are not available (either due to permissions or setup) to handle this gracefully - e.g. ignore the flag entirely with a warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was considering adding a flag for the user to decide (e.g. --cgroup_fail_if_unavailable), but can be discussed/added separately

@zhengwei143
Copy link
Contributor

Overall I do like many aspects in the structuring of this code w.r.t cgroups, but I would like this to eventually be upstreamed into CgroupsInfo and CgroupsInfo{V1,V2}. That'll involve some restructuring of the existing code because the current assumption is that we only care about the memory controller.

I appreciate the effort here in modelling the classes to reflect the different implementations in hierarchy between V1 and V2. Some overall directions to work towards if you're attempting to upstream this into CgroupsInfo (you've already done so in most cases):

  • A single CgroupsInfo should encapsulate all controllers regardless of the implementation (v1 or v2), even if they have different paths.
  • Try to separate as much v1 / v2 cgroups creation logic as possible for future maintainability.
  • We want to be able to handle failures gracefully, ignore errors with a warning; this is quite susceptible to permissions issues, e.g. if the Bazel process were to start up in a root (user) cgroup, we won't have the proper write permissions to create any sub-cgroups. Since this is mainly build performance related, I would rather not break the build due to such issues.

Left some side-comments throughout the code just to start a discussion first (don't need to spend time working on them atm).

@@ -359,15 +362,32 @@ public ImmutableSet<Path> getInaccessiblePaths(FileSystem fs) {
public boolean sandboxHermeticTmp;

@Option(
name = "experimental_sandbox_memory_limit_mb",
defaultValue = "0",
name = "experimental_sandbox_limits",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing a separate mechanism for sandboxed spawns, I wonder if it's better to unify this with the specified resource_set of the sandboxed action. Then just use --experimental_sandbox_enforce_resources_regexp to control for which actions to enforce this on.

Is there a reason to have this as a separate mechanism?

cc @wilwell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of reasons why I find this flag very useful:

  • You cannot override the resource set of arbitrary actions, only tests (see Consistent resource management across actions and resource types #19679) and for most actions the default resource set is underestimated
  • I found it particularly useful to pass --experimental_sandbox_enforce_resources_regexp=DO_NOT_MATCH --experimental_sandbox_limits=cpu=<n> <target> to test <target> with a different limit other then the one specified at the target level. Without this flag, I'd have to go and change the BUILD file, which makes this harder to iterate on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot override the resource set of arbitrary actions

By this I assume you mean actions outside of your control - e.g. in build rules not owned by your project / default starlark resources.

for most actions the default resource set is underestimated

While I don't think fast testing / iteration is a good reason to introduce this (whatever value specified here should be relatively stable without users having to tinker frequently), I can see a legitimate case for this in the event that resources are underestimated. Since the ResourceManager makes reservations purely off of estimated values (and kind-of makes a claim that the action will have some minimum amount of resources1), this serves as a mechanism to upper bound resources.


1 Not entirely true since Bazel assumes it can use all the resources available to the machine and doesn't actively monitor actually available resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this as a separate flag that overrides --experimental_sandbox_memory_limit_mb (and document that behavior) so we don't make this a breaking change. Internally I don't believe we use this flag, but I'm not sure about external users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@AlessandroPatti AlessandroPatti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • A single CgroupsInfo should encapsulate all controllers regardless of the implementation (v1 or v2), even if they have different paths.

Yes, this is opaque to all the users of VirtualCgroup, which see all the controller as if they come from the same (virtual) cgroup :)

  • Try to separate as much v1 / v2 cgroups creation logic as possible for future maintainability.

I think this is already the case, with the exception of the VirtualCgroup class, which has to expose both v1 and v2 cgroups transparently so has to deal a bit with their differences. I can try and break out some if/else, but I don't think it will be possible to break it out much more than that.

  • We want to be able to handle failures gracefully, ignore errors with a warning; this is quite susceptible to permissions issues, e.g. if the Bazel process were to start up in a root (user) cgroup, we won't have the proper write permissions to create any sub-cgroups. Since this is mainly build performance related, I would rather not break the build due to such issues.

Makes sense. I've also faced similar challenges in setting up the right enviroment for this feature to work. I've been considering adding a new startup option that will have the client start the server in the right cgroup (something on the lines of bazel --cgroup-parent /foo/bar` build ...), so that you just need to worry about ensuring that user-writable cgroup exists. WDYT?

Left some side-comments throughout the code just to start a discussion first (don't need to spend time working on them atm).

Thank you! I'll try rebasing and updating the code to match the current state of the cgroup feature to make the conversation easier!

// We cannot leave the cgroups around and delete them only when we delete the sandboxes
// because linux has a hard limit of 65535 memory controllers.
// Ref. https://github.com/torvalds/linux/blob/58d4e450a490d5f02183f6834c12550ba26d3b47/include/linux/memcontrol.h#L69
cgroups.remove(context.getId()).ifPresent(VirtualCGroup::delete);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, took some time for this one to manifest.

@@ -359,15 +362,32 @@ public ImmutableSet<Path> getInaccessiblePaths(FileSystem fs) {
public boolean sandboxHermeticTmp;

@Option(
name = "experimental_sandbox_memory_limit_mb",
defaultValue = "0",
name = "experimental_sandbox_limits",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of reasons why I find this flag very useful:

  • You cannot override the resource set of arbitrary actions, only tests (see Consistent resource management across actions and resource types #19679) and for most actions the default resource set is underestimated
  • I found it particularly useful to pass --experimental_sandbox_enforce_resources_regexp=DO_NOT_MATCH --experimental_sandbox_limits=cpu=<n> <target> to test <target> with a different limit other then the one specified at the target level. Without this flag, I'd have to go and change the BUILD file, which makes this harder to iterate on.

reporter.handle(Event.warn("Found non-writable cgroup v1 at " + cgroup));
continue;
}
cgroup = cgroup.resolve("bazel_" + ProcessHandle.current().pid() + ".slice");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this comes from systemd. In general the name can be arbitrary, regardless of it being using systemd or not, so I kept the syntax. I don't have a strong opinion about that, so let me know if you have any preference

@@ -234,7 +234,8 @@ static void ParseCommandLine(unique_ptr<vector<char *>> args) {
break;
case 'C':
ValidateIsAbsolutePath(optarg, args->front(), static_cast<char>(c));
opt.cgroups_dir.assign(optarg);
opt.cgroups_dirs.emplace_back(optarg);
opt.writable_files.emplace_back(optarg);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, I believe this comes from the original implementation, I wasn't sure why it was added to the writable dirs so I kept it around. Will remove as I rebase.

return !getPath().resolve("cgroup.controllers").toFile().exists();
}

static <T extends Controller> T getDefault(Class<T> clazz) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was considering adding a flag for the user to decide (e.g. --cgroup_fail_if_unavailable), but can be discussed/added separately

@zhengwei143
Copy link
Contributor

Yes, this is opaque to all the users of VirtualCgroup, which see all the controller as if they come from the same (virtual) cgroup :)

That's what I love about your solution :) Really abstracts away the nitty gritty of cgroups from the API.

I think this is already the case, with the exception of the VirtualCgroup class, which has to expose both v1 and v2 cgroups transparently so has to deal a bit with their differences. I can try and break out some if/else, but I don't think it will be possible to break it out much more than that.

W.r.t cgroups creation I think that CgroupsInfo{V1,V2} handles that well. TBH, we don't need the complexity of modelling it after a generic tree, the structure of cgroups hierarchy should be pretty much fixed without callers being able to do anything crazy with it.

I think moving forward, we could work the concept of Controller(s) into each CgroupInfo instance. How does that sound?

Makes sense. I've also faced similar challenges in setting up the right enviroment for this feature to work. I've been considering adding a new startup option that will have the client start the server in the right cgroup (something on the lines of bazel --cgroup-parent /foo/bar` build ...), so that you just need to worry about ensuring that user-writable cgroup exists. WDYT?

I've actually been wanting to do that as well! One idea was using systemd-run --scope --user, but I haven't figured out the case where systemd isnt being used yet.

@AlessandroPatti AlessandroPatti force-pushed the apatti/cgroups branch 8 times, most recently from 37fda84 to c668e0e Compare February 22, 2024 09:25
@AlessandroPatti
Copy link
Contributor Author

I think this is already the case, with the exception of the VirtualCgroup class, which has to expose both v1 and v2 cgroups transparently so has to deal a bit with their differences. I can try and break out some if/else, but I don't think it will be possible to break it out much more than that.

W.r.t cgroups creation I think that CgroupsInfo{V1,V2} handles that well. TBH, we don't need the complexity of modelling it after a generic tree, the structure of cgroups hierarchy should be pretty much fixed without callers being able to do anything crazy with it.

I tried to get it working with CgroupsInfo{V1,V2} but the problem is that this already specialises the cgroup for one or the other version. When dealing with multiple controllers, we could have an hybrid system, where some controllers are attached to the v2 hierarchy and other are instead v1, which would not fit this model. It works fine instead with a single VirtualCgroup class that defer the differences to the Controller interface and its implementations.
I finally ended up replacing CgroupInfo{,V1,V2} with VirtualCgroup{,Factory} completely and extend it to cover the current logic because much easier than the other way around. I moved more logic into the controllers to handle the difference in v1 and v2, so the only part that still mixed the two version is the creation of the virtual cgroup (iterate over mounts and hierarchies and create the controllers), which I don't think can be improved further.

@AlessandroPatti AlessandroPatti marked this pull request as ready for review February 22, 2024 19:42
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Local-Exec Issues and PRs for the Execution (Local) team awaiting-review PR is awaiting review from an assigned reviewer labels Feb 22, 2024
@AlessandroPatti AlessandroPatti force-pushed the apatti/cgroups branch 3 times, most recently from 7e45094 to dbf3b51 Compare February 27, 2024 08:33
@zhengwei143
Copy link
Contributor

Will review this next week - have more pressing work upgrading Bazel's Java language level to 21 (which will be very soon!).

Will also need to get this tested internally since we are relying on cgroups features in production but this refactor completely revamps the implementation - so it might take a while. Also considering whether we need to keep the original implementation and guard this behind a flag, will make a decision after I have a closer look at the code.

@@ -306,4 +306,7 @@ public enum WorkerProtocolFormat {
* segment from the paths of all inputs and outputs.
*/
public static final String SUPPORTS_PATH_MAPPING = "supports-path-mapping";

/** Disables cgroups for a sandbox spawn */
public static final String NO_SUPPORTS_CGROUPS = "no-supports-cgroups";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for having a per-spawn disabling of cgroups? (Not saying that it isn't a good idea, I'm just interested to know).

Can we split this out into another PR? It is diverging from the original PR of adding a new cgroups implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for having a per-spawn disabling of cgroups? (Not saying that it isn't a good idea, I'm just interested to know).

Gives a bit more control over where this feature is applied. This is not only per-spawn, but also per-action (rules owners can apply this in the rule implementation).

Can we split this out into another PR? It is diverging from the original PR of adding a new cgroups implementation.

Done

@@ -35,15 +36,20 @@ public static CgroupsInfoCollector instance() {
return instance;
}

public ResourceSnapshot collectResourceUsage(Map<Long, CgroupsInfo> pidToCgroups, Clock clock) {
public ResourceSnapshot collectResourceUsage(Map<Long, VirtualCGroup> pidToCgroups, Clock clock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace VirtualCgroup with some interface that implements methods to retrieve the memory usage from a cgroup object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not remove these files first and introduce a flag to flip between the two cgroups implementations. We have internal prod systems relying on the current code and I would rather flip the behavior with a flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -359,15 +362,32 @@ public ImmutableSet<Path> getInaccessiblePaths(FileSystem fs) {
public boolean sandboxHermeticTmp;

@Option(
name = "experimental_sandbox_memory_limit_mb",
defaultValue = "0",
name = "experimental_sandbox_limits",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this as a separate flag that overrides --experimental_sandbox_memory_limit_mb (and document that behavior) so we don't make this a breaking change. Internally I don't believe we use this flag, but I'm not sure about external users.

converter = RegexPatternConverter.class,
help =
"If true, actions whose mnemonic matches the input regex"
+ " will have their resources request enforced as limits, ovverriding"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/ovverriding/overriding/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


static public VirtualCGroup NULL = new AutoValue_VirtualCGroup(null, null, ImmutableSet.of());

public static VirtualCGroup create() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename this to something like createRootCgroup so that we can distinguish what its actually creating? Also, use package private access since this is the exposed API here should be getInstance. (same for the methods below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

this.paths().stream().map(Path::toFile).filter(File::exists).forEach(File::delete);
}

public VirtualCGroup child(String name) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/child/createChild/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


public interface Controller {
default boolean isLegacy() throws IOException {
return !getPath().resolve("cgroup.controllers").toFile().exists();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make a filesystem call here everytime we want to check this method? Since the controllers are already specialized into Legacy and Unified, can we just override this there with true and false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out, updated

@@ -176,7 +184,7 @@ public void buildStarting(BuildStartingEvent event) {
}

// Override the flag value if we can't actually use cgroups so that we at least fallback to ps.
boolean useCgroupsOnLinux = options.useCgroupsOnLinux && CgroupsInfo.isSupported();
boolean useCgroupsOnLinux = options.useCgroupsOnLinux && rootCgroup.memory() != null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's use some method like hasMemoryController() instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to change anyways if we want to start collecting cpu usage as well

Copy link
Contributor

@zhengwei143 zhengwei143 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, with a few more minor changes.

I will be on PTO for the next 2-3 weeks, so will be unable to test internally and merge these changes until then. If this is urgent, perhaps @wilwell can help to merge this.

@zhengwei143
Copy link
Contributor

(I'm working on cleaning this up internally now).

@AlessandroPatti
Copy link
Contributor Author

@zhengwei143 do you have an update?

@zhengwei143
Copy link
Contributor

I managed to clean most things up and am still going through internal review (will need to do more refactoring), that has stalled due to other more urgent work but I hope to pick it up soon. Sorry this has taken so long...

@zhengwei143
Copy link
Contributor

Update: Will push this through at the start of next week. I was working on replacing nio.file.Path with our internal vfs.Path but encountered some issues that I needed to work through. In the end, we decided it would be fine to submit with this current implementation (along with some clean-ups) and have the nio.file.Path -> vfs.Path migration come in a follow-up PR (just a heads up for you).

@github-actions github-actions bot removed the awaiting-review PR is awaiting review from an assigned reviewer label May 27, 2024
@AlessandroPatti
Copy link
Contributor Author

Thank you @zhengwei143!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Local-Exec Issues and PRs for the Execution (Local) team team-Performance Issues for Performance teams
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants