New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add poweroff hardware cleanup step after Tinkerbell E2E tests #8140
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
500a6aa
to
84f9079
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #8140 +/- ##
==========================================
- Coverage 73.40% 73.39% -0.01%
==========================================
Files 577 577
Lines 35900 35900
==========================================
- Hits 26352 26350 -2
- Misses 7882 7883 +1
- Partials 1666 1667 +1 ☔ View full report in Codecov by Sentry. |
internal/test/bmc.go
Outdated
@@ -0,0 +1,17 @@ | |||
package test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand why this might make sense to go here, but if we are to ever get to a point where we can make our e2e test tool a completely different Go module, remove the spaghetti dependency tree, and drastically reduce the imports of our actual product code, eksctl-anywhere, we should not continue to add to this package tree. I recommend that we try to stick with the package tree that starts with the top level test
directory/package. The internal/test/cleanup
package would be ok for now but all that clean up code should ideally live in the top level test
directory/package tree as well.
Also, i see that this function probably went here so that it can be consumed in 2 locations. Duplicating this code where it's used as opposed to creating a new dependency might be a good thing here. This is not a lot of code and taking on the extra code over the dependency, to me, makes sense. "A little copying is better than a little dependency." - https://go-proverbs.github.io/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for calling this out.
Yeah, it went here so that it can be consumed in 2 locations. Now, I agree with the principle that a little copying is better than a little dependency, especially since it's not a lot of code. I will go with duplicating this rather than introducing the dependency.
internal/test/cleanup/cleanup.go
Outdated
@@ -189,3 +195,73 @@ func NutanixTestResources(clusterName, endpoint, port string, insecure, ignoreEr | |||
} | |||
return nil | |||
} | |||
|
|||
// TinkerbellTestResources cleans up machines by powering them down. | |||
func TinkerbellTestResources(ctx context.Context, inventoryCSVFilePath string, ignoreErrors bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You moved other code from vms
to machines
. This function defined resources
. Are resources different than machines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this specific instance, TinkerbellTestResources
is intended to handle machines. I named it this way to align with existing method of naming these cleanups. The term "resources" was used more broadly to encompass potential future expansions where we might manage other types of test resources, not just machines. However, I see how this could lead to confusion given our current usage.
I could rename this to TinkerbellTestMachines
and we could always expand around it if needed. However, the --cleanup-vms flag original seemed as if it was just for cleaning up vms, but it has since evolved to cleanup more for other providers for example cleaning up duplicate networks for cloudstack. So, I think the better call would be to change the machines
to resources
vernacular. I'll make that change
return nil | ||
} | ||
|
||
if _, err := bmcClient.SetPowerState(ctx, "off"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked internally about the importance of machines to be off (because of duplicate DHCP servers running) before each test run. The current model, which i know this PR is mostly just moving code around and not new code you've created, doesn't provide us any guarantees that the hardware is actually power off. Do we want to address this kind of guarantee in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I agree that this might be a good opportunity to introduce a mechanism to verify that the hardware is indeed powered off before tests commence.
I believe we could add some retries on failure to address this. But even then, if the retries fail in that case we could raise some kind of alert that requires manual intervention. We currently don't fail the test in the cleanup stage, which I think is the most accurate way to guarantee and raise alarm to machines not powered off. I'd love to hear your thoughts on that and if that should change.
That said, I would like to keep this PR focused on the current task and maybe address the guarantee issue in another. Let me know if you think this works for you.
test/framework/tinkerbell.go
Outdated
return nil | ||
// CleanupMachines runs a clean up the Tinkerbell machines which simply powers them down. | ||
func (t *Tinkerbell) CleanupMachines(_ string) error { | ||
ctx, done := context.WithTimeout(context.Background(), 2*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This context appears to be shared between all bmclib calls to all machines in each test run. This means that all machines have a combined duration of 2minutes for all GetPowerState
and SetPowerState
that run. This might not be enough time. BMC's can be notoriously slow. Also, if any call takes the length of the duration then all subsequent calls will fail as the context will have been canceled/timed out. I recommend adding the timeout in powerOffHardware
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks. Good catch., I'll move it there as suggested.
fix linting errors
84f9079
to
3163645
Compare
3163645
to
85b7f79
Compare
85b7f79
to
af8985f
Compare
Issue #, if available:
Description of changes:
This PR makes changes to ensure that Tinkerbell hardware for a test are powered off after a run. A few changes were made as a result:
--cleanup-vms
to--cleanup-resources
to make it more generic (because tinkerbell does not use VMs).Provider.CleanupVMs
method toProvider.CleanupResources
powerOffHardware
out ofValidateHardwareDecommissioned
so it does not run twice at the end of existing tests.Provider.CleanupResources
method to power off hardware machines.Testing (if applicable):
Documentation added/planned (if applicable):
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.