Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Improve vCPU allocation for the VMMs #7623

Conversation

fidencio
Copy link
Member

@fidencio fidencio commented Aug 11, 2023

First of all, this is a controversial piece, and I know that.

In this commit we're trying to make a less greedy approach regards the
amount of vCPUs we allocate for the VMM, which will be advantageous
mainly when using the static_sandbox_resource_mgmt feature, which is
used by the confidential guests.

The current approach we have basically does:

  • Gets the amount of vCPUs set in the config (an integer)
  • Gets the amount of vCPUs set as limit (an integer)
  • Sum those up
  • Starts / Updates the VMM to use that total amount of vCPUs

The fact we're dealing with integers is logical, as we cannot request
500m vCPUs to the VMMs. However, it leads us to, in several cases, be
wasting one vCPU.

Let's take the example that we know the VMM requires 500m vCPUs to be
running, and the workload sets 250m vCPUs as a resource limit.

In that case, we'd do:

  • Gets the amount of vCPUs set in the config: 1
  • Gets the amount of vCPUs set as limit: ceil(0.25)
  • 1 + ceil(0.25) = 1 + 1 = 2 vCPUs
  • Starts / Updates the VMM to use 2 vCPUs

With the logic changed here, what we're doing is considering everything
as float till just before we start / update the VMM. So, the flow
describe above would be:

  • Gets the amount of vCPUs set in the config: 0.5
  • Gets the amount of vCPUs set as limit: 0.25
  • ceil(0.5 + 0.25) = 1 vCPUs
  • Starts / Updates the VMM to use 1 vCPUs

In the way I've written this patch we introduce zero regressions, as
the default values set are still the same, and those will only be
changed for the TEE use cases (although I can see firecracker, or any
other user of static_sandbox_resource_mgmt=true taking advantage of
this).

There's, though, an implicit assumption in this patch that we'd need to
make explicit, and that's that the default_vcpus / default_memory is the
amount of vcpus / memory required by the VMM, and absolutely nothing
else. Also, the amount set there should be reflected in the
podOverhead for the specific runtime class.

One other possible approach, which I am not that much in favour of
taking as I think it's less clear, is that we could actually get the
podOverhead amount, subtract it from the default_vcpus (treating the
result as a float), then sum up what the user set as limit (as a float),
and finally ceil the result. It could work, but IMHO this is less
clear
, and less explicit on what we're actually doing, and how the
default_vcpus / default_memory should be used.

Fixes: #6909

@katacontainersbot katacontainersbot added the size/large Task of significant size label Aug 11, 2023
@fidencio
Copy link
Member Author

/test

Copy link
Contributor

@dcmiddle dcmiddle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
For my review I searched for areas where there might be an integer to float conversion problem but couldn't find any.
For example, it looks like the toml parser[1] will convert automatically for our existing configs e.g. default_vcpus = 1 [2].

[1] BurntSushi/toml#325

[2] https://github.com/kata-containers/kata-containers/blob/main/src/runtime/config/configuration-qemu-tdx.toml.in#L104

@ms-mahuber
Copy link

Thanks you Fabiano, this patch looks very good. One thought: There is a good amount of uint32(math.Ceil(... occurrences in the PR. How about turning this logic into a single function that is accessible from relevant files? That function pretty much takes a float of vCPU shares as a parameter and converts the input to an integer for the VMM to use. While the function is still super simple and nothing really changes, this makes the change/logic a bit more conceivable. This way we can also add a line of documentation that explains that we get CPU shares from the config and from limits (from kubernetes layer) and we turn those into an integer because the VMM/HV uses absolute numbers and not fractions of CPU shares like in the kubernetes world.

@jongwu
Copy link
Contributor

jongwu commented Aug 17, 2023

/test-arm-dragonball

Copy link
Member

@bergwolf bergwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

@bergwolf
Copy link
Member

@fidencio Thanks for raising this! I wonder if it is possible to add an E2E test to validate the new allocation algorithm?

@fidencio
Copy link
Member Author

@fidencio Thanks for raising this! I wonder if it is possible to add an E2E test to validate the new allocation algorithm?

I definitely can add an e2e test here, @bergwolf, and I will go ahead and do that while addressing the other comments raised.

@gkurz
Copy link
Member

gkurz commented Aug 29, 2023

Hi @fidencio !

Using your example :

  • Gets the amount of vCPUs set in the config: 1
  • Gets the amount of vCPUs set as limit: ceil(0.25)
  • 1 + ceil(0.25) = 1 + 1 = 2 vCPUs
  • Starts / Updates the VMM to use 2 vCPUs

What about keeping milli vCPUs all the way down to the VMM start/update ?

  • Gets the amount of milli vCPUs set in the config: 500
  • Gets the amount of milli vCPUs set as limit: 250
  • CalculateVCpusFromMilliCpus(500 + 250) = (500 + 250 + 999) / 1000 = 1
  • Starts / Updates the VMM to use 1 vCPU

This would allow to keep the int32 fields and avoid all the castings. Makes sense ?

@fidencio
Copy link
Member Author

This would allow to keep the int32 fields and avoid all the castings. Makes sense ?

One problem that I have with this approach is that we need to change the semantics of what's exposed in the configuration file, I think.

By this, I mean, a user would have to expose 500 as 0.5 CPUs, and the "default 1" would have to change. While I'm not much against those changes, I think this is a big one that could get users confused.

@gkurz
Copy link
Member

gkurz commented Aug 30, 2023

This would allow to keep the int32 fields and avoid all the castings. Makes sense ?

One problem that I have with this approach is that we need to change the semantics of what's exposed in the configuration file, I think.

By this, I mean, a user would have to expose 500 as 0.5 CPUs, and the "default 1" would have to change. While I'm not much against those changes, I think this is a big one that could get users confused.

As discussed privately on slack, it seems that this PR is trying to address two separate issues :

  1. allow the user to specify fractions of CPU in the config file
  2. improve the accuracy of vCPU allocation

Turning the int32 into a float32 would allow to address (1) .

My proposal is to fix (2) by translating the float32 from the config into milli CPUs as an int32 and to only use the latter for all the rest of the computation. This would improve precision without having to rely on float64 and Ceil() everywhere.

c3d added a commit to c3d/kata-containers that referenced this pull request Aug 30, 2023
The float32 representation is sufficient to reliably represent a
number of vCPUs for all the cases we will care about in the
foreseeable future: we can compute more than 2^20 vCPUs and the
precision of the computation is below 1 part per million.

A few more changes were made to Fabiano's original approach to avoid
having to do the same rounding operation everywhere. The `NumVCPUs`
field was replaced with a `float32` equivalent, `NumVCPUsF`, and a
helper function, `NumVCPUs`, can be used to generate a `uint32`.

There was a concern about having a multiplicity of casts, but in
practice, most of the casts were really from `uint32` as returned by
`NumVCPUs()` to some other integer type, typically `int` or `int32`.
Doing this work exposed a latent issue in the previous commit, where
some pre-existing casts would have rounded the number of VCPUs
[down instead of up][1]

[1]: kata-containers#7623 (comment)

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
@fidencio
Copy link
Member Author

Folks, I'm taking some time Today to go through Greg's suggestion and hopefully have it closed / merged. :-)

@fidencio fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch from 41bf243 to 28634ca Compare November 8, 2023 10:53
@fidencio
Copy link
Member Author

fidencio commented Nov 8, 2023

I've rebased this atop of main, squashed @c3d's changes on my PR, added @c3d's Signed-off-by, and I'll write tests for it.
Please, do not review before I have the tests written.

@fidencio fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch from 28634ca to ddeb9b1 Compare November 8, 2023 16:57
@fidencio fidencio requested a review from a team as a code owner November 8, 2023 16:57
@fidencio
Copy link
Member Author

fidencio commented Nov 8, 2023

When writing tests for this I noticed that we need to have something like #8404 present, so I just went ahead an opened a PR for that.

Once that one gets merged I will update this one with tests.

@fidencio fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch 2 times, most recently from 2b0619a to 6d1eb8a Compare November 10, 2023 11:09
We don't have to do this since we're relying on the
`static_sandbox_resource_mgmt` feature, which gives us the correct
amount of memory and CPUs to be allocated.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
@fidencio fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch 4 times, most recently from bc5e34a to 63bbcd3 Compare November 10, 2023 13:12
@fidencio
Copy link
Member Author

/test

@fidencio fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch from c020e58 to 8fa500c Compare November 10, 2023 15:24
src/runtime/pkg/oci/utils.go Outdated Show resolved Hide resolved
Copy link
Member

@c3d c3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @gkurz that there is a problem with the squashing. Other than that, looks good. Suggested a minor change.

src/runtime/pkg/oci/utils_test.go Outdated Show resolved Hide resolved
src/runtime/pkg/oci/utils.go Outdated Show resolved Hide resolved
First of all, this is a controversial piece, and I know that.

In this commit we're trying to make a less greedy approach regards the
amount of vCPUs we allocate for the VMM, which will be advantageous
mainly when using the `static_sandbox_resource_mgmt` feature, which is
used by the confidential guests.

The current approach we have basically does:
* Gets the amount of vCPUs set in the config (an integer)
* Gets the amount of vCPUs set as limit (an integer)
* Sum those up
* Starts / Updates the VMM to use that total amount of vCPUs

The fact we're dealing with integers is logical, as we cannot request
500m vCPUs to the VMMs.  However, it leads us to, in several cases, be
wasting one vCPU.

Let's take the example that we know the VMM requires 500m vCPUs to be
running, and the workload sets 250m vCPUs as a resource limit.

In that case, we'd do:
* Gets the amount of vCPUs set in the config: 1
* Gets the amount of vCPUs set as limit: ceil(0.25)
* 1 + ceil(0.25) = 1 + 1 = 2 vCPUs
* Starts / Updates the VMM to use 2 vCPUs

With the logic changed here, what we're doing is considering everything
as float till just before we start / update the VMM. So, the flow
describe above would be:
* Gets the amount of vCPUs set in the config: 0.5
* Gets the amount of vCPUs set as limit: 0.25
* ceil(0.5 + 0.25) = 1 vCPUs
* Starts / Updates the VMM to use 1 vCPUs

In the way I've written this patch we introduce zero regressions, as
the default values set are still the same, and those will only be
changed for the TEE use cases (although I can see firecracker, or any
other user of `static_sandbox_resource_mgmt=true` taking advantage of
this).

There's, though, an implicit assumption in this patch that we'd need to
make explicit, and that's that the default_vcpus / default_memory is the
amount of vcpus / memory required by the VMM, and absolutely nothing
else.  Also, the amount set there should be reflected in the
podOverhead for the specific runtime class.

One other possible approach, which I am not that much in favour of
taking as I think it's **less clear**, is that we could actually get the
podOverhead amount, subtract it from the default_vcpus (treating the
result as a float), then sum up what the user set as limit (as a float),
and finally ceil the result.  It could work, but IMHO this is **less
clear**, and **less explicit** on what we're actually doing, and how the
default_vcpus / default_memory should be used.

Fixes: kata-containers#6909

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
With the change done in the last commit, instead of calculating milli
cpus, we're actually converting the CPUs to a fraction number, a float.

Let's update the function name (and associated vars) to represent that
change.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we've done some changes in the VMM vcpu allocation, let's introduce
basic tests to make sure that we're getting the expected behaviour.

The test consists in checking 3 scenarios:
* default_vcpus = 0 | no limits set
  * this should allocate 1 vcpu
* default_vcpus = 0.75 | limits set to 0.25
  * this should allocate 1 vcpu
* default_vcpus = 0.75 | limits set to 1.2
  * this should allocate 2 vcpus

The tests are very basic, but they do ensure we're rounding things up to
what the new logic is supposed to do.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
@fidencio fidencio force-pushed the topic/runtime-improve-vcpu-allocation-on-host-side branch from 8fa500c to 849253e Compare November 10, 2023 17:26
@fidencio
Copy link
Member Author

@bergwolf, I've removed your approval as I'd like you to, at least, take a look at the last commit, the one adding the tests as you wanted to have it added.

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes LGTM. Thanks @fidencio !

I don't have cycles to review the tests commit though.

@fidencio
Copy link
Member Author

/test

@@ -54,11 +54,6 @@ func validateHypervisorConfig(conf *HypervisorConfig) error {
conf.DefaultMaxVCPUs = defaultMaxVCPUs
}

if conf.ConfidentialGuest && conf.NumVCPUs != conf.DefaultMaxVCPUs {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting addition to the series.

if err != nil || float64Value < 0 {
return fmt.Errorf(errAnnotationPositiveNumericKey, a.key)
}
if float64Value > math.MaxFloat32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this one.

Copy link
Member

@c3d c3d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @fidencio !

load "${BATS_TEST_DIRNAME}/tests_common.sh"

setup() {
[ "${KATA_HYPERVISOR}" == "dragonball" ] && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need something more than dragonball to identify runtime-rs as it now supports clh and qemu/fc are on the way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can expand it later when other vmms are actually being tested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, and this will end up covering more than just this test.
Thanks for the review, @bergwolf!

Copy link
Member

@bergwolf bergwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@fidencio fidencio merged commit fd9b6d6 into kata-containers:main Nov 14, 2023
138 of 142 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/large Task of significant size
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resource allocation: Potential over-allocation of vCPUs in kata-shim leading to UVM bring-up failure
8 participants