Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4622: New TopologyManager Policy: max-allowable-numa-nodes #4624

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cyclinder
Copy link

  • One-line PR description: New TopologyManager Policy: max-allowable-numa-nodes

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label May 8, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 8, 2024
@cyclinder cyclinder marked this pull request as draft May 8, 2024 08:22
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 8, 2024
@cyclinder
Copy link
Author

Hi @klueska @ffromani, Here is a draft KEP, I would appreciate it if you could review it for me!

We expect no non-infra related flakes in the last month as a GA graduation criteria.
-->

TBD
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need a e2e test here?

Copy link
Contributor

@ffromani ffromani May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for beta level it is strongly encouraged if not actually required, need to doublecheck

@cyclinder cyclinder marked this pull request as ready for review May 8, 2024 08:34
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 8, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp May 8, 2024 08:34
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 8, 2024
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 8, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cyclinder
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the area/enhancements Issues or PRs related to the Enhancements subproject label May 8, 2024
@@ -45,7 +45,7 @@ cd "${ROOT}"
RES=0
echo "Checking spelling..."
ERROR_LOG="${TMP_DIR}/errors.log"
git ls-files | grep -v vendor | xargs misspell > "${ERROR_LOG}"
git ls-files | grep -v vendor | xargs misspell -i $(grep -v '#' hack/.spelling_ignorewords) > "${ERROR_LOG}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change belongs there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to open another PR for it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I don't get why do we need this change in this PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fix a CI failure, my name (cyclinder) did not pass the misspell check, so I made these changes, but I am not sure if I need to open a new PR for it, I put it on this pr for now. I can open another PR if it can be confirmed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thing is, usernames should not be spell-checked in the first place :\ I'll have a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's the failure: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/enhancements/4624/pull-enhancements-verify/1788125378728955904 - it strongly believe it should be a separate PR. Pending conversation, I think kep.yaml should not be spell-checked in the first place.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this expands the scope a bit, checking kep.yaml is ok, but it should not include the author name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fair also to exclude usernames from spell-checking kep.yaml. Even in this case it should be a separate PR though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, maybe it's hard to exclude usernames from spell-checking kep.yaml for misspell :(

@@ -0,0 +1,2 @@
# misspell ignore the following corrections, comma separated: fooa,boob
cyclinder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad rebase?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don't understand what you mean, Could you explain more about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why do we need this change at all for this KEP?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my name (cyclinder) did not pass the misspell check, so I made these changes, but I am not sure if I need to open a new PR for it, I put it on this pr for now. I can open another PR if it can be confirmed.

@@ -0,0 +1,39 @@
title: New TopologyManager Policy which configure the value of maxAllowableNUMANodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this file looks fine to me

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review!

@bart0sh bart0sh added this to Needs Reviewer in SIG Node PR Triage May 8, 2024
@cyclinder
Copy link
Author

Hi @klueska, Do you have a few comments on these files?

Comment on lines 1 to 23
<!--
**Note:** When your KEP is complete, all of these comment blocks should be removed.

To get started with this template:

- [ ] **Pick a hosting SIG.**
Make sure that the problem space is something the SIG is interested in taking
up. KEPs should not be checked in without a sponsoring SIG.
- [ ] **Create an issue in kubernetes/enhancements**
When filing an enhancement tracking issue, please make sure to complete all
fields in that template. One of the fields asks for a link to the KEP. You
can leave that blank until this KEP is filed, and then go back to the
enhancement and add the link.
- [ ] **Make a copy of this template directory.**
Copy this template into the owning SIG's directory and name it
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
leading-zero padding) assigned to your enhancement above.
- [ ] **Fill out as much of the kep.yaml file as you can.**
At minimum, you should fill in the "Title", "Authors", "Owning-sig",
"Status", and date-related fields.
- [ ] **Fill out this file as best you can.**
At minimum, you should fill in the "Summary" and "Motivation" sections.
These should be easy if you've preflighted the idea of the KEP with the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to remove the tutorial comments like this one I'm partially quoting once you filed the relevant section

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, update it now.

know that this has succeeded?
-->
- Introduce a new TopologyManager Policy Option called `max-allowable-numa-nodes`.
- Improve the topology manager to remove the state explosion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a goal of this KEP? I think this should be a non-goal because we want to enable users configure this limit but we do not aim to change the topology manager internal logic to do computations in a more efficient manner. Or do we? that would be a very significant scope increase

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, this should be a non-goal here. update it now.

and make progress.
-->

- This proposal does not aim to modify the existing TopologyManager Policies. It focuses solely on introducing a new policy for spreading the max allowable numa nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested change:
"a new policy for spreading the max allowable numa nodes" -> "a new policy option to let users configure the maximum supported number of NUMA nodes"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
[kubernetes/website]: https://git.k8s.io/website

## Summary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general in this document: NUMA is an acronym and so should be spelled in uppercase (e.g. not "numa" nor "Numa")

bogged down.
-->

#### Story 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add the story from the issue. I think I can probably find another user story, let me get back on this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! It's a little hard for me to find a user story.

implementation difficulties, etc.).
-->

N/A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we can abuse metrics to report the configured value? let's hear from other reviewers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @klueska


[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->
No
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm, right, I think we don't have SLIs/SLOs about pod admission time in the kubelet. Or do we?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, we don't have this, but better to confirm by other reviewers.

Are there any tests that were run/should be run to understand performance characteristics better
and validate the declared limits?
-->
No
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about how "state explosion" will lead to more memory being used by the kubelet, if at all. However this should not cause "resource exhaustion" and we can defer to the GA graduation

@k8s-ci-robot k8s-ci-robot requested a review from klueska May 16, 2024 05:49
…AllowableNUMANodes

Signed-off-by: cyclinder <kuocyclinder@gmail.com>
@klueska
Copy link
Contributor

klueska commented May 16, 2024

I've added this to the tracking sheet for 1.31:
https://docs.google.com/document/d/1U10J0WwgWXkdYrqWGGvO8iH2HKeerQAlygnqgDgWv4E/edit

@ffromani please let me know when this is ready for me to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/enhancements Issues or PRs related to the Enhancements subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Reviewer
SIG Node PR Triage
Needs Reviewer
Development

Successfully merging this pull request may close these issues.

None yet

4 participants