Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skipping MCAD CPU Preemption Test #696

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

Fiona-Waters
Copy link
Contributor

Skipping the MCAD CPU Preemption Test which is failing intermittently on PRs so that we can get some outstanding PRs merged.

Copy link

openshift-ci bot commented Dec 7, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign anishasthana for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ronensc
Copy link

ronensc commented Dec 7, 2023

For future reference, the root cause analysis of the test's failure has been conducted by @dgrove-oss, and it can be found here:
#691 (comment)

@Fiona-Waters
Copy link
Contributor Author

For future reference, the root cause analysis of the test's failure has been conducted by @dgrove-oss, and it can be found here: #691 (comment)

Thanks @ronensc That's good to know!

@dgrove-oss
Copy link

I don't think it's worth backporting, but I did redo these test cases for mcad v2 to be robust against different cluster sizes in project-codeflare/mcad#83

@Fiona-Waters
Copy link
Contributor Author

More investigation is required as to why these tests are failing. Closing this PR.

@asm582 asm582 removed the request for review from metalcycling December 17, 2023 20:42
Copy link
Contributor

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I like the move to more generic tests. One question.

//aw := createDeploymentAWwith550CPU(context, appendRandomString("aw-deployment-2-550cpu"))
cap := getClusterCapacitycontext(context)
resource := cpuDemand(cap, 0.275).String()
aw := createGenericDeploymentCustomPodResourcesWithCPUAW(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the cluster has many smaller nodes resulting a a high cap but inability to schedule AppWrappers becauase they do not fit on the individual nodes? Do we care about that at all in this test case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a test case perspective, the cluster is assumed to have homogenous nodes and it requests deployments that fit on a node in the cluster in CPU dimension.

Copy link
Contributor Author

@Fiona-Waters Fiona-Waters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, so happy to move forward with this improvement. Just a couple of small comments.

@@ -793,6 +795,36 @@ func createDeploymentAWwith550CPU(context *context, name string) *arbv1.AppWrapp
return appwrapper
}

func getClusterCapacitycontext(context *context) *clusterstateapi.Resource {
capacity := clusterstateapi.EmptyResource()
nodes, _ := context.kubeclient.CoreV1().Nodes().List(context.ctx, metav1.ListOptions{})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should handle the error here.

podList, err := context.kubeclient.CoreV1().Pods("").List(context.ctx, metav1.ListOptions{FieldSelector: labelSelector})
// TODO: when no pods are listed, do we send entire node capacity as available
// this will cause false positive dispatch.
if err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the error be caught like this instead?

Suggested change
if err != nil {
Expect(err).NotTo(HaveOccurred()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants