Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add go vet to the Makefile and fix its errors #691

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ronensc
Copy link

@ronensc ronensc commented Nov 15, 2023

Issue link

What changes have been made

  • Added go vet to the Makefile
  • Fixed its errors

Verification steps

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Copy link

openshift-ci bot commented Nov 15, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign asm582 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ronensc
Copy link
Author

ronensc commented Nov 28, 2023

@anishasthana I think MCAD-CI is flaky as I don't think its failure is related to the changes in this PR. Could you please try rerunning it?

@ronensc
Copy link
Author

ronensc commented Nov 28, 2023

@anishasthana
Copy link
Member

Hmm @asm582 are you aware of any recent issues with CI?

@dgrove-oss
Copy link

I observed when porting the test suite to MCAD v2 that many of the tests only work as expected if the capacity of the cluster is calculated to be about 2 CPUs. If the cluster has more capacity, then the tests that expect certain AppWrappers to not be runnable or to be preempted because higher-priority AppWrappers will have taken all the available capacity fail. The failing test here is one of those. Looking at the logs, I see that the controller thinks the available capacity is about 6 CPUs

 The available capacity to dispatch appwrapper is cpu 5950.00, memory 32085303296.00, GPU 16 and time took to calculate is 8.462585ms

In the v2 test suite, I am redoing all of these tests to be based on a % of total CPU capacity instead of using hardwired resource values to avoid this fragility.

@asm582
Copy link
Member

asm582 commented Nov 29, 2023

Thank you for the PR. I am curious, if we are doing %-based CPU calculation then are we diverging from the way Scheduler does accounting? and would that be ok?

@asm582
Copy link
Member

asm582 commented Nov 29, 2023

Hmm @asm582 are you aware of any recent issues with CI?

I usually set 2 CPUs and 8 GB RAM as resource setting in podman or docker desktop and they would run to completion locally. I am not sure if we changed resource requirement in the CI system

@dgrove-oss
Copy link

To be clear, tests like this are what are fragile in my opinion: https://github.com/project-codeflare/multi-cluster-app-dispatcher/blob/main/test/e2e/queue.go#L97-L125 because they are using hardwired CPU requests (not CPU requests that are calculated dynamically by the test as a fraction of the available capacity on the actual cluster when it is run).

Here it creates an AppWrapper with a 1100 CPU request (2 pods at 550 each) and then assumes that if it creates a 854 CPU request (2 pods at 427 each..the function name is misleading) it won't fit. If docker is limited to 2 CPUs, then the worker nodes have 4000 CPU capacity, Kubernetes + the mcad operator take around 2100 CPU and the math works as expected. The slightest change in available resource (or in the resources used by the system pods) and the test doesn't work as expected.

@ronensc
Copy link
Author

ronensc commented Dec 4, 2023

@dgrove-oss thanks for the detailed explanation of the root cause of the CI failure!
I noticed that these fragile tests were commented out temporarily in mcad v2. Should we adopt a similar approach here? Can this PR be approved even though the CI is currently failing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants