-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: Improve vCPU allocation for the VMMs #7623
runtime: Improve vCPU allocation for the VMMs #7623
Conversation
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
For my review I searched for areas where there might be an integer to float conversion problem but couldn't find any.
For example, it looks like the toml parser[1] will convert automatically for our existing configs e.g. default_vcpus = 1
[2].
Thanks you Fabiano, this patch looks very good. One thought: There is a good amount of |
/test-arm-dragonball |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks!
@fidencio Thanks for raising this! I wonder if it is possible to add an E2E test to validate the new allocation algorithm? |
Hi @fidencio ! Using your example :
What about keeping milli vCPUs all the way down to the VMM start/update ?
This would allow to keep the int32 fields and avoid all the castings. Makes sense ? |
One problem that I have with this approach is that we need to change the semantics of what's exposed in the configuration file, I think. By this, I mean, a user would have to expose 500 as 0.5 CPUs, and the "default 1" would have to change. While I'm not much against those changes, I think this is a big one that could get users confused. |
As discussed privately on slack, it seems that this PR is trying to address two separate issues :
Turning the int32 into a float32 would allow to address (1) . My proposal is to fix (2) by translating the float32 from the config into milli CPUs as an int32 and to only use the latter for all the rest of the computation. This would improve precision without having to rely on float64 and |
The float32 representation is sufficient to reliably represent a number of vCPUs for all the cases we will care about in the foreseeable future: we can compute more than 2^20 vCPUs and the precision of the computation is below 1 part per million. A few more changes were made to Fabiano's original approach to avoid having to do the same rounding operation everywhere. The `NumVCPUs` field was replaced with a `float32` equivalent, `NumVCPUsF`, and a helper function, `NumVCPUs`, can be used to generate a `uint32`. There was a concern about having a multiplicity of casts, but in practice, most of the casts were really from `uint32` as returned by `NumVCPUs()` to some other integer type, typically `int` or `int32`. Doing this work exposed a latent issue in the previous commit, where some pre-existing casts would have rounded the number of VCPUs [down instead of up][1] [1]: kata-containers#7623 (comment) Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
Folks, I'm taking some time Today to go through Greg's suggestion and hopefully have it closed / merged. :-) |
41bf243
to
28634ca
Compare
28634ca
to
ddeb9b1
Compare
When writing tests for this I noticed that we need to have something like #8404 present, so I just went ahead an opened a PR for that. Once that one gets merged I will update this one with tests. |
2b0619a
to
6d1eb8a
Compare
We don't have to do this since we're relying on the `static_sandbox_resource_mgmt` feature, which gives us the correct amount of memory and CPUs to be allocated. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
bc5e34a
to
63bbcd3
Compare
/test |
c020e58
to
8fa500c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @gkurz that there is a problem with the squashing. Other than that, looks good. Suggested a minor change.
First of all, this is a controversial piece, and I know that. In this commit we're trying to make a less greedy approach regards the amount of vCPUs we allocate for the VMM, which will be advantageous mainly when using the `static_sandbox_resource_mgmt` feature, which is used by the confidential guests. The current approach we have basically does: * Gets the amount of vCPUs set in the config (an integer) * Gets the amount of vCPUs set as limit (an integer) * Sum those up * Starts / Updates the VMM to use that total amount of vCPUs The fact we're dealing with integers is logical, as we cannot request 500m vCPUs to the VMMs. However, it leads us to, in several cases, be wasting one vCPU. Let's take the example that we know the VMM requires 500m vCPUs to be running, and the workload sets 250m vCPUs as a resource limit. In that case, we'd do: * Gets the amount of vCPUs set in the config: 1 * Gets the amount of vCPUs set as limit: ceil(0.25) * 1 + ceil(0.25) = 1 + 1 = 2 vCPUs * Starts / Updates the VMM to use 2 vCPUs With the logic changed here, what we're doing is considering everything as float till just before we start / update the VMM. So, the flow describe above would be: * Gets the amount of vCPUs set in the config: 0.5 * Gets the amount of vCPUs set as limit: 0.25 * ceil(0.5 + 0.25) = 1 vCPUs * Starts / Updates the VMM to use 1 vCPUs In the way I've written this patch we introduce zero regressions, as the default values set are still the same, and those will only be changed for the TEE use cases (although I can see firecracker, or any other user of `static_sandbox_resource_mgmt=true` taking advantage of this). There's, though, an implicit assumption in this patch that we'd need to make explicit, and that's that the default_vcpus / default_memory is the amount of vcpus / memory required by the VMM, and absolutely nothing else. Also, the amount set there should be reflected in the podOverhead for the specific runtime class. One other possible approach, which I am not that much in favour of taking as I think it's **less clear**, is that we could actually get the podOverhead amount, subtract it from the default_vcpus (treating the result as a float), then sum up what the user set as limit (as a float), and finally ceil the result. It could work, but IMHO this is **less clear**, and **less explicit** on what we're actually doing, and how the default_vcpus / default_memory should be used. Fixes: kata-containers#6909 Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com> Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
With the change done in the last commit, instead of calculating milli cpus, we're actually converting the CPUs to a fraction number, a float. Let's update the function name (and associated vars) to represent that change. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we've done some changes in the VMM vcpu allocation, let's introduce basic tests to make sure that we're getting the expected behaviour. The test consists in checking 3 scenarios: * default_vcpus = 0 | no limits set * this should allocate 1 vcpu * default_vcpus = 0.75 | limits set to 0.25 * this should allocate 1 vcpu * default_vcpus = 0.75 | limits set to 1.2 * this should allocate 2 vcpus The tests are very basic, but they do ensure we're rounding things up to what the new logic is supposed to do. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
8fa500c
to
849253e
Compare
@bergwolf, I've removed your approval as I'd like you to, at least, take a look at the last commit, the one adding the tests as you wanted to have it added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes LGTM. Thanks @fidencio !
I don't have cycles to review the tests commit though.
/test |
@@ -54,11 +54,6 @@ func validateHypervisorConfig(conf *HypervisorConfig) error { | |||
conf.DefaultMaxVCPUs = defaultMaxVCPUs | |||
} | |||
|
|||
if conf.ConfidentialGuest && conf.NumVCPUs != conf.DefaultMaxVCPUs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting addition to the series.
if err != nil || float64Value < 0 { | ||
return fmt.Errorf(errAnnotationPositiveNumericKey, a.key) | ||
} | ||
if float64Value > math.MaxFloat32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks @fidencio !
load "${BATS_TEST_DIRNAME}/tests_common.sh" | ||
|
||
setup() { | ||
[ "${KATA_HYPERVISOR}" == "dragonball" ] && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need something more than dragonball to identify runtime-rs as it now supports clh and qemu/fc are on the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can expand it later when other vmms are actually being tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, and this will end up covering more than just this test.
Thanks for the review, @bergwolf!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
First of all, this is a controversial piece, and I know that.
In this commit we're trying to make a less greedy approach regards the
amount of vCPUs we allocate for the VMM, which will be advantageous
mainly when using the
static_sandbox_resource_mgmt
feature, which isused by the confidential guests.
The current approach we have basically does:
The fact we're dealing with integers is logical, as we cannot request
500m vCPUs to the VMMs. However, it leads us to, in several cases, be
wasting one vCPU.
Let's take the example that we know the VMM requires 500m vCPUs to be
running, and the workload sets 250m vCPUs as a resource limit.
In that case, we'd do:
With the logic changed here, what we're doing is considering everything
as float till just before we start / update the VMM. So, the flow
describe above would be:
In the way I've written this patch we introduce zero regressions, as
the default values set are still the same, and those will only be
changed for the TEE use cases (although I can see firecracker, or any
other user of
static_sandbox_resource_mgmt=true
taking advantage ofthis).
There's, though, an implicit assumption in this patch that we'd need to
make explicit, and that's that the default_vcpus / default_memory is the
amount of vcpus / memory required by the VMM, and absolutely nothing
else. Also, the amount set there should be reflected in the
podOverhead for the specific runtime class.
One other possible approach, which I am not that much in favour of
taking as I think it's less clear, is that we could actually get the
podOverhead amount, subtract it from the default_vcpus (treating the
result as a float), then sum up what the user set as limit (as a float),
and finally ceil the result. It could work, but IMHO this is less
clear, and less explicit on what we're actually doing, and how the
default_vcpus / default_memory should be used.
Fixes: #6909