Add poweroff hardware cleanup step after Tinkerbell E2E tests #8140

cxbrowne1207 · 2024-05-10T21:46:40Z

Issue #, if available:

Description of changes:
This PR makes changes to ensure that Tinkerbell hardware for a test are powered off after a run. A few changes were made as a result:

Changed --cleanup-vms to --cleanup-resources to make it more generic (because tinkerbell does not use VMs).
Changed the E2E test Provider.CleanupVMs method to Provider.CleanupResources
Move powerOffHardware out of ValidateHardwareDecommissioned so it does not run twice at the end of existing tests.
Implemented the E2E test Tinkerbell Provider.CleanupResources method to power off hardware machines.

Testing (if applicable):

Documentation added/planned (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

eks-distro-bot · 2024-05-10T21:46:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from cxbrowne1207. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2024-05-10T22:17:49Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.39%. Comparing base (a65c62d) to head (af8985f).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8140      +/-   ##
==========================================
- Coverage   73.40%   73.39%   -0.01%     
==========================================
  Files         577      577              
  Lines       35900    35900              
==========================================
- Hits        26352    26350       -2     
- Misses       7882     7883       +1     
- Partials     1666     1667       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jacobweinstock · 2024-05-10T22:33:24Z

internal/test/bmc.go

@@ -0,0 +1,17 @@
+package test


I can understand why this might make sense to go here, but if we are to ever get to a point where we can make our e2e test tool a completely different Go module, remove the spaghetti dependency tree, and drastically reduce the imports of our actual product code, eksctl-anywhere, we should not continue to add to this package tree. I recommend that we try to stick with the package tree that starts with the top level test directory/package. The internal/test/cleanup package would be ok for now but all that clean up code should ideally live in the top level test directory/package tree as well.

Also, i see that this function probably went here so that it can be consumed in 2 locations. Duplicating this code where it's used as opposed to creating a new dependency might be a good thing here. This is not a lot of code and taking on the extra code over the dependency, to me, makes sense. "A little copying is better than a little dependency." - https://go-proverbs.github.io/

Thanks for calling this out.

Yeah, it went here so that it can be consumed in 2 locations. Now, I agree with the principle that a little copying is better than a little dependency, especially since it's not a lot of code. I will go with duplicating this rather than introducing the dependency.

jacobweinstock · 2024-05-10T22:35:10Z

internal/test/cleanup/cleanup.go

@@ -189,3 +195,73 @@ func NutanixTestResources(clusterName, endpoint, port string, insecure, ignoreEr
 	}
 	return nil
 }
+
+// TinkerbellTestResources cleans up machines by powering them down.
+func TinkerbellTestResources(ctx context.Context, inventoryCSVFilePath string, ignoreErrors bool) error {


You moved other code from vms to machines. This function defined resources. Are resources different than machines?

In this specific instance, TinkerbellTestResources is intended to handle machines. I named it this way to align with existing method of naming these cleanups. The term "resources" was used more broadly to encompass potential future expansions where we might manage other types of test resources, not just machines. However, I see how this could lead to confusion given our current usage.

I could rename this to TinkerbellTestMachines and we could always expand around it if needed. However, the --cleanup-vms flag original seemed as if it was just for cleaning up vms, but it has since evolved to cleanup more for other providers for example cleaning up duplicate networks for cloudstack. So, I think the better call would be to change the machines to resources vernacular. I'll make that change

jacobweinstock · 2024-05-10T22:48:14Z

internal/test/cleanup/cleanup.go

+		return nil
+	}
+
+	if _, err := bmcClient.SetPowerState(ctx, "off"); err != nil {


We talked internally about the importance of machines to be off (because of duplicate DHCP servers running) before each test run. The current model, which i know this PR is mostly just moving code around and not new code you've created, doesn't provide us any guarantees that the hardware is actually power off. Do we want to address this kind of guarantee in this PR?

You're right, I agree that this might be a good opportunity to introduce a mechanism to verify that the hardware is indeed powered off before tests commence.

I believe we could add some retries on failure to address this. But even then, if the retries fail in that case we could raise some kind of alert that requires manual intervention. We currently don't fail the test in the cleanup stage, which I think is the most accurate way to guarantee and raise alarm to machines not powered off. I'd love to hear your thoughts on that and if that should change.

That said, I would like to keep this PR focused on the current task and maybe address the guarantee issue in another. Let me know if you think this works for you.

jacobweinstock · 2024-05-10T22:54:47Z

test/framework/tinkerbell.go

-	return nil
+// CleanupMachines runs a clean up the Tinkerbell machines which simply powers them down.
+func (t *Tinkerbell) CleanupMachines(_ string) error {
+	ctx, done := context.WithTimeout(context.Background(), 2*time.Minute)


This context appears to be shared between all bmclib calls to all machines in each test run. This means that all machines have a combined duration of 2minutes for all GetPowerState and SetPowerState that run. This might not be enough time. BMC's can be notoriously slow. Also, if any call takes the length of the duration then all subsequent calls will fail as the context will have been canceled/timed out. I recommend adding the timeout in powerOffHardware.

Ah, thanks. Good catch., I'll move it there as suggested.

fix linting errors

eks-distro-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 10, 2024

cxbrowne1207 changed the title ~~Add Tinkerbell cleanup step after E2E tests~~ Add poweroff hardware cleanup step after Tinkerbell E2E tests May 10, 2024

cxbrowne1207 force-pushed the tinkerbell-test-cleanup branch from 500a6aa to 84f9079 Compare May 10, 2024 22:12

cxbrowne1207 changed the title ~~Add poweroff hardware cleanup step after Tinkerbell E2E tests~~ [WIP] Add poweroff hardware cleanup step after Tinkerbell E2E tests May 10, 2024

eks-distro-bot added the do-not-merge/work-in-progress label May 10, 2024

jacobweinstock reviewed May 10, 2024

View reviewed changes

power down hardware in clean up step after tests

2af939c

fix linting errors

cxbrowne1207 force-pushed the tinkerbell-test-cleanup branch from 84f9079 to 3163645 Compare May 15, 2024 14:44

address PR comments

6f1afbd

cxbrowne1207 changed the title ~~[WIP] Add poweroff hardware cleanup step after Tinkerbell E2E tests~~ Add poweroff hardware cleanup step after Tinkerbell E2E tests May 15, 2024

eks-distro-bot removed the do-not-merge/work-in-progress label May 15, 2024

cxbrowne1207 force-pushed the tinkerbell-test-cleanup branch from 3163645 to 85b7f79 Compare May 15, 2024 16:00

cxbrowne1207 requested a review from jacobweinstock May 15, 2024 16:01

change cleanup machines to cleanup resources

af8985f

cxbrowne1207 force-pushed the tinkerbell-test-cleanup branch from 85b7f79 to af8985f Compare May 15, 2024 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add poweroff hardware cleanup step after Tinkerbell E2E tests #8140

Add poweroff hardware cleanup step after Tinkerbell E2E tests #8140

cxbrowne1207 commented May 10, 2024 •

edited

eks-distro-bot commented May 10, 2024

codecov bot commented May 10, 2024 •

edited

jacobweinstock May 10, 2024 •

edited

cxbrowne1207 May 15, 2024

jacobweinstock May 10, 2024

cxbrowne1207 May 15, 2024 •

edited

jacobweinstock May 10, 2024

cxbrowne1207 May 15, 2024

jacobweinstock May 10, 2024

cxbrowne1207 May 15, 2024

Add poweroff hardware cleanup step after Tinkerbell E2E tests #8140

Are you sure you want to change the base?

Add poweroff hardware cleanup step after Tinkerbell E2E tests #8140

Conversation

cxbrowne1207 commented May 10, 2024 • edited

eks-distro-bot commented May 10, 2024

codecov bot commented May 10, 2024 • edited

Codecov Report

jacobweinstock May 10, 2024 • edited

Choose a reason for hiding this comment

cxbrowne1207 May 15, 2024

Choose a reason for hiding this comment

jacobweinstock May 10, 2024

Choose a reason for hiding this comment

cxbrowne1207 May 15, 2024 • edited

Choose a reason for hiding this comment

jacobweinstock May 10, 2024

Choose a reason for hiding this comment

cxbrowne1207 May 15, 2024

Choose a reason for hiding this comment

jacobweinstock May 10, 2024

Choose a reason for hiding this comment

cxbrowne1207 May 15, 2024

Choose a reason for hiding this comment

cxbrowne1207 commented May 10, 2024 •

edited

codecov bot commented May 10, 2024 •

edited

jacobweinstock May 10, 2024 •

edited

cxbrowne1207 May 15, 2024 •

edited