runtime: Improve vCPU allocation for the VMMs #7623

fidencio · 2023-08-11T13:35:19Z

First of all, this is a controversial piece, and I know that.

In this commit we're trying to make a less greedy approach regards the
amount of vCPUs we allocate for the VMM, which will be advantageous
mainly when using the static_sandbox_resource_mgmt feature, which is
used by the confidential guests.

The current approach we have basically does:

Gets the amount of vCPUs set in the config (an integer)
Gets the amount of vCPUs set as limit (an integer)
Sum those up
Starts / Updates the VMM to use that total amount of vCPUs

The fact we're dealing with integers is logical, as we cannot request
500m vCPUs to the VMMs. However, it leads us to, in several cases, be
wasting one vCPU.

Let's take the example that we know the VMM requires 500m vCPUs to be
running, and the workload sets 250m vCPUs as a resource limit.

In that case, we'd do:

Gets the amount of vCPUs set in the config: 1
Gets the amount of vCPUs set as limit: ceil(0.25)
1 + ceil(0.25) = 1 + 1 = 2 vCPUs
Starts / Updates the VMM to use 2 vCPUs

With the logic changed here, what we're doing is considering everything
as float till just before we start / update the VMM. So, the flow
describe above would be:

Gets the amount of vCPUs set in the config: 0.5
Gets the amount of vCPUs set as limit: 0.25
ceil(0.5 + 0.25) = 1 vCPUs
Starts / Updates the VMM to use 1 vCPUs

In the way I've written this patch we introduce zero regressions, as
the default values set are still the same, and those will only be
changed for the TEE use cases (although I can see firecracker, or any
other user of static_sandbox_resource_mgmt=true taking advantage of
this).

There's, though, an implicit assumption in this patch that we'd need to
make explicit, and that's that the default_vcpus / default_memory is the
amount of vcpus / memory required by the VMM, and absolutely nothing
else. Also, the amount set there should be reflected in the
podOverhead for the specific runtime class.

One other possible approach, which I am not that much in favour of
taking as I think it's less clear, is that we could actually get the
podOverhead amount, subtract it from the default_vcpus (treating the
result as a float), then sum up what the user set as limit (as a float),
and finally ceil the result. It could work, but IMHO this is less
clear, and less explicit on what we're actually doing, and how the
default_vcpus / default_memory should be used.

Fixes: #6909

fidencio · 2023-08-11T13:37:11Z

/test

dcmiddle

LGTM.
For my review I searched for areas where there might be an integer to float conversion problem but couldn't find any.
For example, it looks like the toml parser[1] will convert automatically for our existing configs e.g. default_vcpus = 1 [2].

[1] BurntSushi/toml#325

[2] https://github.com/kata-containers/kata-containers/blob/main/src/runtime/config/configuration-qemu-tdx.toml.in#L104

ms-mahuber · 2023-08-14T18:55:44Z

Thanks you Fabiano, this patch looks very good. One thought: There is a good amount of uint32(math.Ceil(... occurrences in the PR. How about turning this logic into a single function that is accessible from relevant files? That function pretty much takes a float of vCPU shares as a parameter and converts the input to an integer for the VMM to use. While the function is still super simple and nothing really changes, this makes the change/logic a bit more conceivable. This way we can also add a line of documentation that explains that we get CPU shares from the config and from limits (from kubernetes layer) and we turn those into an integer because the VMM/HV uses absolute numbers and not fractions of CPU shares like in the kubernetes world.

jongwu · 2023-08-17T02:35:48Z

/test-arm-dragonball

bergwolf

Makes sense, thanks!

bergwolf · 2023-08-25T09:35:34Z

@fidencio Thanks for raising this! I wonder if it is possible to add an E2E test to validate the new allocation algorithm?

src/runtime/pkg/katautils/config.go

fidencio · 2023-08-29T10:17:24Z

@fidencio Thanks for raising this! I wonder if it is possible to add an E2E test to validate the new allocation algorithm?

I definitely can add an e2e test here, @bergwolf, and I will go ahead and do that while addressing the other comments raised.

gkurz · 2023-08-29T14:40:27Z

Hi @fidencio !

Using your example :

Gets the amount of vCPUs set in the config: 1

Gets the amount of vCPUs set as limit: ceil(0.25)

1 + ceil(0.25) = 1 + 1 = 2 vCPUs

Starts / Updates the VMM to use 2 vCPUs

What about keeping milli vCPUs all the way down to the VMM start/update ?

Gets the amount of milli vCPUs set in the config: 500
Gets the amount of milli vCPUs set as limit: 250
CalculateVCpusFromMilliCpus(500 + 250) = (500 + 250 + 999) / 1000 = 1
Starts / Updates the VMM to use 1 vCPU

This would allow to keep the int32 fields and avoid all the castings. Makes sense ?

fidencio · 2023-08-29T14:43:48Z

This would allow to keep the int32 fields and avoid all the castings. Makes sense ?

One problem that I have with this approach is that we need to change the semantics of what's exposed in the configuration file, I think.

By this, I mean, a user would have to expose 500 as 0.5 CPUs, and the "default 1" would have to change. While I'm not much against those changes, I think this is a big one that could get users confused.

gkurz · 2023-08-30T10:13:12Z

This would allow to keep the int32 fields and avoid all the castings. Makes sense ?

One problem that I have with this approach is that we need to change the semantics of what's exposed in the configuration file, I think.

By this, I mean, a user would have to expose 500 as 0.5 CPUs, and the "default 1" would have to change. While I'm not much against those changes, I think this is a big one that could get users confused.

As discussed privately on slack, it seems that this PR is trying to address two separate issues :

allow the user to specify fractions of CPU in the config file
improve the accuracy of vCPU allocation

Turning the int32 into a float32 would allow to address (1) .

My proposal is to fix (2) by translating the float32 from the config into milli CPUs as an int32 and to only use the latter for all the rest of the computation. This would improve precision without having to rely on float64 and Ceil() everywhere.

The float32 representation is sufficient to reliably represent a number of vCPUs for all the cases we will care about in the foreseeable future: we can compute more than 2^20 vCPUs and the precision of the computation is below 1 part per million. A few more changes were made to Fabiano's original approach to avoid having to do the same rounding operation everywhere. The `NumVCPUs` field was replaced with a `float32` equivalent, `NumVCPUsF`, and a helper function, `NumVCPUs`, can be used to generate a `uint32`. There was a concern about having a multiplicity of casts, but in practice, most of the casts were really from `uint32` as returned by `NumVCPUs()` to some other integer type, typically `int` or `int32`. Doing this work exposed a latent issue in the previous commit, where some pre-existing casts would have rounded the number of VCPUs [down instead of up][1] [1]: kata-containers#7623 (comment) Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>

fidencio · 2023-09-25T07:23:43Z

Folks, I'm taking some time Today to go through Greg's suggestion and hopefully have it closed / merged. :-)

fidencio · 2023-11-08T10:56:43Z

I've rebased this atop of main, squashed @c3d's changes on my PR, added @c3d's Signed-off-by, and I'll write tests for it.
Please, do not review before I have the tests written.

fidencio · 2023-11-08T21:17:10Z

When writing tests for this I noticed that we need to have something like #8404 present, so I just went ahead an opened a PR for that.

Once that one gets merged I will update this one with tests.

We don't have to do this since we're relying on the `static_sandbox_resource_mgmt` feature, which gives us the correct amount of memory and CPUs to be allocated. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>

fidencio · 2023-11-10T13:28:51Z

/test

src/runtime/pkg/katautils/config.go

src/runtime/pkg/oci/utils.go

src/runtime/virtcontainers/sandbox.go

src/runtime/pkg/oci/utils_test.go

src/runtime/virtcontainers/utils/utils.go

src/runtime/virtcontainers/utils/utils_test.go

c3d

Agree with @gkurz that there is a problem with the squashing. Other than that, looks good. Suggested a minor change.

src/runtime/pkg/oci/utils_test.go

src/runtime/pkg/oci/utils.go

src/runtime/virtcontainers/utils/utils.go

First of all, this is a controversial piece, and I know that. In this commit we're trying to make a less greedy approach regards the amount of vCPUs we allocate for the VMM, which will be advantageous mainly when using the `static_sandbox_resource_mgmt` feature, which is used by the confidential guests. The current approach we have basically does: * Gets the amount of vCPUs set in the config (an integer) * Gets the amount of vCPUs set as limit (an integer) * Sum those up * Starts / Updates the VMM to use that total amount of vCPUs The fact we're dealing with integers is logical, as we cannot request 500m vCPUs to the VMMs. However, it leads us to, in several cases, be wasting one vCPU. Let's take the example that we know the VMM requires 500m vCPUs to be running, and the workload sets 250m vCPUs as a resource limit. In that case, we'd do: * Gets the amount of vCPUs set in the config: 1 * Gets the amount of vCPUs set as limit: ceil(0.25) * 1 + ceil(0.25) = 1 + 1 = 2 vCPUs * Starts / Updates the VMM to use 2 vCPUs With the logic changed here, what we're doing is considering everything as float till just before we start / update the VMM. So, the flow describe above would be: * Gets the amount of vCPUs set in the config: 0.5 * Gets the amount of vCPUs set as limit: 0.25 * ceil(0.5 + 0.25) = 1 vCPUs * Starts / Updates the VMM to use 1 vCPUs In the way I've written this patch we introduce zero regressions, as the default values set are still the same, and those will only be changed for the TEE use cases (although I can see firecracker, or any other user of `static_sandbox_resource_mgmt=true` taking advantage of this). There's, though, an implicit assumption in this patch that we'd need to make explicit, and that's that the default_vcpus / default_memory is the amount of vcpus / memory required by the VMM, and absolutely nothing else. Also, the amount set there should be reflected in the podOverhead for the specific runtime class. One other possible approach, which I am not that much in favour of taking as I think it's **less clear**, is that we could actually get the podOverhead amount, subtract it from the default_vcpus (treating the result as a float), then sum up what the user set as limit (as a float), and finally ceil the result. It could work, but IMHO this is **less clear**, and **less explicit** on what we're actually doing, and how the default_vcpus / default_memory should be used. Fixes: kata-containers#6909 Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com> Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>

With the change done in the last commit, instead of calculating milli cpus, we're actually converting the CPUs to a fraction number, a float. Let's update the function name (and associated vars) to represent that change. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>

As we've done some changes in the VMM vcpu allocation, let's introduce basic tests to make sure that we're getting the expected behaviour. The test consists in checking 3 scenarios: * default_vcpus = 0 | no limits set * this should allocate 1 vcpu * default_vcpus = 0.75 | limits set to 0.25 * this should allocate 1 vcpu * default_vcpus = 0.75 | limits set to 1.2 * this should allocate 2 vcpus The tests are very basic, but they do ensure we're rounding things up to what the new logic is supposed to do. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>

fidencio · 2023-11-10T17:39:57Z

@bergwolf, I've removed your approval as I'd like you to, at least, take a look at the last commit, the one adding the tests as you wanted to have it added.

gkurz

Code changes LGTM. Thanks @fidencio !

I don't have cycles to review the tests commit though.

fidencio · 2023-11-13T11:28:57Z

/test

c3d · 2023-11-13T13:36:32Z

src/runtime/virtcontainers/hypervisor_config_linux.go

@@ -54,11 +54,6 @@ func validateHypervisorConfig(conf *HypervisorConfig) error {
 		conf.DefaultMaxVCPUs = defaultMaxVCPUs
 	}

-	if conf.ConfidentialGuest && conf.NumVCPUs != conf.DefaultMaxVCPUs {


Interesting addition to the series.

c3d · 2023-11-13T13:38:27Z

src/runtime/pkg/oci/utils.go

+		if err != nil || float64Value < 0 {
+			return fmt.Errorf(errAnnotationPositiveNumericKey, a.key)
+		}
+		if float64Value > math.MaxFloat32 {


Thanks for this one.

c3d

Looks good to me. Thanks @fidencio !

bergwolf · 2023-11-14T11:23:23Z

tests/integration/kubernetes/k8s-sandbox-vcpus-allocation.bats

+load "${BATS_TEST_DIRNAME}/tests_common.sh"
+
+setup() {
+	[ "${KATA_HYPERVISOR}" == "dragonball" ] && \


we need something more than dragonball to identify runtime-rs as it now supports clh and qemu/fc are on the way.

We can expand it later when other vmms are actually being tested.

Perfect, and this will end up covering more than just this test.
Thanks for the review, @bergwolf!

bergwolf

lgtm!

fidencio requested review from bergwolf and egernst August 11, 2023 13:35

katacontainersbot added the size/large Task of significant size label Aug 11, 2023

fidencio added the ok-to-test label Aug 11, 2023

dcmiddle approved these changes Aug 12, 2023

View reviewed changes

bergwolf approved these changes Aug 25, 2023

View reviewed changes

c3d reviewed Aug 25, 2023

View reviewed changes

src/runtime/pkg/katautils/config.go Outdated Show resolved Hide resolved

fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch from 41bf243 to 28634ca Compare November 8, 2023 10:53

fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch from 28634ca to ddeb9b1 Compare November 8, 2023 16:57

fidencio requested a review from a team as a code owner November 8, 2023 16:57

fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch 2 times, most recently from 2b0619a to 6d1eb8a Compare November 10, 2023 11:09

runtime: confidential: Do not set the max_vcpu to cpu

b0157ad

We don't have to do this since we're relying on the `static_sandbox_resource_mgmt` feature, which gives us the correct amount of memory and CPUs to be allocated. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>

fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch 4 times, most recently from bc5e34a to 63bbcd3 Compare November 10, 2023 13:12

fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch from c020e58 to 8fa500c Compare November 10, 2023 15:24