Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add poweroff hardware cleanup step after Tinkerbell E2E tests #8140

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

cxbrowne1207
Copy link
Member

@cxbrowne1207 cxbrowne1207 commented May 10, 2024

Issue #, if available:

Description of changes:
This PR makes changes to ensure that Tinkerbell hardware for a test are powered off after a run. A few changes were made as a result:

  • Changed --cleanup-vms to --cleanup-resources to make it more generic (because tinkerbell does not use VMs).
  • Changed the E2E test Provider.CleanupVMs method to Provider.CleanupResources
  • Move powerOffHardware out of ValidateHardwareDecommissioned so it does not run twice at the end of existing tests.
  • Implemented the E2E test Tinkerbell Provider.CleanupResources method to power off hardware machines.

Testing (if applicable):

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from cxbrowne1207. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eks-distro-bot eks-distro-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 10, 2024
@cxbrowne1207 cxbrowne1207 changed the title Add Tinkerbell cleanup step after E2E tests Add poweroff hardware cleanup step after Tinkerbell E2E tests May 10, 2024
@cxbrowne1207 cxbrowne1207 changed the title Add poweroff hardware cleanup step after Tinkerbell E2E tests [WIP] Add poweroff hardware cleanup step after Tinkerbell E2E tests May 10, 2024
Copy link

codecov bot commented May 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.39%. Comparing base (a65c62d) to head (af8985f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8140      +/-   ##
==========================================
- Coverage   73.40%   73.39%   -0.01%     
==========================================
  Files         577      577              
  Lines       35900    35900              
==========================================
- Hits        26352    26350       -2     
- Misses       7882     7883       +1     
- Partials     1666     1667       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -0,0 +1,17 @@
package test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand why this might make sense to go here, but if we are to ever get to a point where we can make our e2e test tool a completely different Go module, remove the spaghetti dependency tree, and drastically reduce the imports of our actual product code, eksctl-anywhere, we should not continue to add to this package tree. I recommend that we try to stick with the package tree that starts with the top level test directory/package. The internal/test/cleanup package would be ok for now but all that clean up code should ideally live in the top level test directory/package tree as well.

Also, i see that this function probably went here so that it can be consumed in 2 locations. Duplicating this code where it's used as opposed to creating a new dependency might be a good thing here. This is not a lot of code and taking on the extra code over the dependency, to me, makes sense. "A little copying is better than a little dependency." - https://go-proverbs.github.io/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling this out.

Yeah, it went here so that it can be consumed in 2 locations. Now, I agree with the principle that a little copying is better than a little dependency, especially since it's not a lot of code. I will go with duplicating this rather than introducing the dependency.

@@ -189,3 +195,73 @@ func NutanixTestResources(clusterName, endpoint, port string, insecure, ignoreEr
}
return nil
}

// TinkerbellTestResources cleans up machines by powering them down.
func TinkerbellTestResources(ctx context.Context, inventoryCSVFilePath string, ignoreErrors bool) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You moved other code from vms to machines. This function defined resources. Are resources different than machines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this specific instance, TinkerbellTestResources is intended to handle machines. I named it this way to align with existing method of naming these cleanups. The term "resources" was used more broadly to encompass potential future expansions where we might manage other types of test resources, not just machines. However, I see how this could lead to confusion given our current usage.

I could rename this to TinkerbellTestMachines and we could always expand around it if needed. However, the --cleanup-vms flag original seemed as if it was just for cleaning up vms, but it has since evolved to cleanup more for other providers for example cleaning up duplicate networks for cloudstack. So, I think the better call would be to change the machines to resources vernacular. I'll make that change

return nil
}

if _, err := bmcClient.SetPowerState(ctx, "off"); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked internally about the importance of machines to be off (because of duplicate DHCP servers running) before each test run. The current model, which i know this PR is mostly just moving code around and not new code you've created, doesn't provide us any guarantees that the hardware is actually power off. Do we want to address this kind of guarantee in this PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I agree that this might be a good opportunity to introduce a mechanism to verify that the hardware is indeed powered off before tests commence.

I believe we could add some retries on failure to address this. But even then, if the retries fail in that case we could raise some kind of alert that requires manual intervention. We currently don't fail the test in the cleanup stage, which I think is the most accurate way to guarantee and raise alarm to machines not powered off. I'd love to hear your thoughts on that and if that should change.

That said, I would like to keep this PR focused on the current task and maybe address the guarantee issue in another. Let me know if you think this works for you.

return nil
// CleanupMachines runs a clean up the Tinkerbell machines which simply powers them down.
func (t *Tinkerbell) CleanupMachines(_ string) error {
ctx, done := context.WithTimeout(context.Background(), 2*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This context appears to be shared between all bmclib calls to all machines in each test run. This means that all machines have a combined duration of 2minutes for all GetPowerState and SetPowerState that run. This might not be enough time. BMC's can be notoriously slow. Also, if any call takes the length of the duration then all subsequent calls will fail as the context will have been canceled/timed out. I recommend adding the timeout in powerOffHardware.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks. Good catch., I'll move it there as suggested.

@cxbrowne1207 cxbrowne1207 changed the title [WIP] Add poweroff hardware cleanup step after Tinkerbell E2E tests Add poweroff hardware cleanup step after Tinkerbell E2E tests May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants