Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azurerm_kubernetes_cluster azurerm_kubernetes_cluster_node_pool - Adding node_soak_duration_in_minutes and drain_timeout_in_minutes to the node pool upgrade_settings block #26022

Closed
wants to merge 34 commits into from

Conversation

bitcloud
Copy link
Contributor

@bitcloud bitcloud commented May 19, 2024

Community Note

  • Please vote on this PR by adding a 👍 reaction to the original PR to help the community and maintainers prioritize for review
  • Please do not leave "+1" or "me too" comments, they generate extra noise for PR followers and do not help prioritize for review

Description

Based on the great PR #24491 from @aristosvo we added the new parameters node_soak_duration_in_minutes and drain_timeout_in_minutes and updated and fixed the testcases.

PR Checklist

  • I have followed the guidelines in our Contributing Documentation.
  • I have checked to ensure there aren't other open Pull Requests for the same update/change.
  • I have checked if my changes close any open issues. If so please include appropriate closing keywords below.
  • I have updated/added Documentation as required written in a helpful and kind way to assist users that may be unfamiliar with the resource / data source.
  • I have used a meaningful PR title to help maintainers and other users understand this change and help prevent duplicate work.
    For example: “resource_name_here - description of change e.g. adding property new_property_name_here

Changes to existing Resource / Data Source

  • I have added an explanation of what my changes do and why I'd like you to include them (This may be covered by linking to an issue above, but may benefit from additional explanation).
  • I have written new tests for my resource or datasource changes & updated any relevent documentation.
  • I have successfully run tests with my changes locally. If not, please provide details on testing challenges that prevented you running the tests.
  • (For changes that include a state migration only). I have manually tested the migration path between relevant versions of the provider.

Testing

  • My submission includes Test coverage as described in the Contribution Guide and the tests pass. (if this is not possible for any reason, please include details of why you did or could not add test coverage)

Change Log

Below please provide what should go into the changelog (if anything) conforming to the Changelog Format documented here.

  • azurerm_kubernetes_cluster azurerm_kubernetes_cluster_node_pool - adding node_soak_duration_in_minutes and drain_timeout_in_minutes to the node pool upgrade_settings block
  • update github.com/hashicorp/go-azure-sdk/resource-manager/containerservice to version 2023-09-02-preview

This is a (please select all that apply):

  • Bug Fix
  • New Feature (ie adding a service, resource, or data source)
  • Enhancement
  • Breaking Change

Related Issue(s)

Fixes: #13066
Build on top: #24491

Note

If this PR changes meaningfully during the course of review please update the title and description as required.

Copy link
Member

@stephybun stephybun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @bitcloud! Would you mind rebasing this? The version upgrade for AKS just went into main. Once that's done I'll continue with my review.

@itay-grudev
Copy link
Contributor

itay-grudev commented May 22, 2024

@stephybun I re-based the branch, resolved all the conflicts, ran the tests and tested that drain and soak work correctly with the provider in development mode.

I think it's ready.

Tell us if you need any changes and we'll add them ASAP.

Copy link
Member

@stephybun stephybun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bitcloud.

I made a first pass through this. Could you resolve the review comments? Once that's done we can run the tests and see if we're good to go or require more changes.

CHANGELOG.md Outdated
@@ -1,5 +1,10 @@
## 3.105.0 (Unreleased)

ENHANCEMENTS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the changelog in PRs is a frequent source of conflicts since it gets updated manually after merging PRs, so we should remove all changes to this file.

Comment on lines 1468 to 1482
values := make(map[string]interface{})

if input != nil && input.MaxSurge != nil {
maxSurge = *input.MaxSurge
values["max_surge"] = *input.MaxSurge
}

if maxSurge == "" {
return []interface{}{}
if input != nil && input.DrainTimeoutInMinutes != nil {
values["drain_timeout_in_minutes"] = *input.DrainTimeoutInMinutes
}

return []interface{}{
map[string]interface{}{
"max_surge": maxSurge,
},
if input != nil && input.DrainTimeoutInMinutes != nil {
values["node_soak_duration_in_minutes"] = *input.NodeSoakDurationInMinutes
}

return []interface{}{values}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually nil check the input first and return early if it's empty, GH won't allow me to use the suggestions feature on this snippet but could we update this to

	if input == nil {
		return []interface{}{}
	}

	values := make(map[string]interface{})
	
	if input.MaxSurge != nil {
		values["max_surge"] = *input.MaxSurge
	}

	if input.DrainTimeoutInMinutes != nil {
		values["drain_timeout_in_minutes"] = *input.DrainTimeoutInMinutes
	}

	if input.DrainTimeoutInMinutes != nil {
		values["node_soak_duration_in_minutes"] = *input.NodeSoakDurationInMinutes
	}

	return []interface{}{values}

Comment on lines 1140 to 1143
"drain_timeout_in_minutes": {
Type: pluginsdk.TypeInt,
Optional: true,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs It looks like this has a default value of 30 minutes, so we should set that here as well. Is there also some validation that we could add? e.g. value must be between 0 and 60 minutes?

Comment on lines 1144 to 1147
"node_soak_duration_in_minutes": {
Type: pluginsdk.TypeInt,
Optional: true,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, docs indicate this can be set between 0 and 30 minutes, so we should add that as validation here.

Comment on lines 1163 to 1170
"drain_timeout_in_minutes": {
Type: pluginsdk.TypeInt,
Optional: true,
},
"node_soak_duration_in_minutes": {
Type: pluginsdk.TypeInt,
Optional: true,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should apply defaults and validation where applicable

@@ -945,6 +945,10 @@ A `http_proxy_config` block supports the following:

A `upgrade_settings` block supports the following:

* `drain_timeout_in_minutes` - (Optional) The amount of time in minutes to wait on eviction of pods and graceful termination per node. This eviction wait time honors waiting on pod disruption budgets. If this time is exceeded, the upgrade fails.

* `node_soak_duration_in_minutes` - (Optional) The amount of time in minutes to wait after draining a node and before reimaging it and moving on to next node. If not specified, the default is 0 minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `node_soak_duration_in_minutes` - (Optional) The amount of time in minutes to wait after draining a node and before reimaging it and moving on to next node. If not specified, the default is 0 minutes.
* `node_soak_duration_in_minutes` - (Optional) The amount of time in minutes to wait after draining a node and before reimaging and moving on to next node. Defaults to `0`.

@@ -308,6 +308,10 @@ A `sysctl_config` block supports the following:

A `upgrade_settings` block supports the following:

* `drain_timeout_in_minutes` - (Optional) The amount of time in minutes to wait on eviction of pods and graceful termination per node. This eviction wait time honors waiting on pod disruption budgets. If this time is exceeded, the upgrade fails.

* `node_soak_duration_in_minutes` - (Optional) The amount of time in minutes to wait after draining a node and before reimaging it and moving on to next node. If not specified, the default is 0 minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `node_soak_duration_in_minutes` - (Optional) The amount of time in minutes to wait after draining a node and before reimaging it and moving on to next node. If not specified, the default is 0 minutes.
* `node_soak_duration_in_minutes` - (Optional) The amount of time in minutes to wait after draining a node and before reimaging and moving on to next node. Defaults to `0`.

@@ -945,6 +945,10 @@ A `http_proxy_config` block supports the following:

A `upgrade_settings` block supports the following:

* `drain_timeout_in_minutes` - (Optional) The amount of time in minutes to wait on eviction of pods and graceful termination per node. This eviction wait time honors waiting on pod disruption budgets. If this time is exceeded, the upgrade fails.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This eviction wait time honors waiting on pod disruption budgets

I'm unsure if this is correct, in the context of deleting the cluster we actually ignore PDBs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text is more about upgrades. I would intuitively expect that it is ignored for cluster deletion. I can amend the description, but I think that is understood and expected anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done much with PDBs so it wasn't implicit for me. I think to avoid any potential confusion it'd be good to
specify that e.g.

Suggested change
* `drain_timeout_in_minutes` - (Optional) The amount of time in minutes to wait on eviction of pods and graceful termination per node. This eviction wait time honors waiting on pod disruption budgets. If this time is exceeded, the upgrade fails.
* `drain_timeout_in_minutes` - (Optional) The amount of time in minutes to wait on eviction of pods and graceful termination per node. This eviction wait time honors pod disruption budgets for upgrades. If this time is exceeded, the upgrade fails.

Comment on lines 185 to 189

* `drain_timeout_in_minutes` - The amount of time in minutes to wait on eviction of pods and graceful termination per node. This eviction wait time honors waiting on pod disruption budgets. If this time is exceeded, the upgrade fails.

* `node_soak_duration_in_minutes` - The amount of time in minutes to wait after draining a node and before reimaging it and moving on to next node. If not specified, the default is 0 minutes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is an redundant space here and we also don't need to specify any defaults or possible values for properties in the data source docs

Suggested change
* `drain_timeout_in_minutes` - The amount of time in minutes to wait on eviction of pods and graceful termination per node. This eviction wait time honors waiting on pod disruption budgets. If this time is exceeded, the upgrade fails.
* `node_soak_duration_in_minutes` - The amount of time in minutes to wait after draining a node and before reimaging it and moving on to next node. If not specified, the default is 0 minutes.
* `drain_timeout_in_minutes` - The amount of time in minutes to wait on eviction of pods and graceful termination per node. This eviction wait time honors waiting on pod disruption budgets. If this time is exceeded, the upgrade fails.
* `node_soak_duration_in_minutes` - The amount of time in minutes to wait after draining a node and before reimaging it and moving on to next node.

@@ -94,6 +94,10 @@ In addition to the Arguments listed above - the following Attributes are exporte

A `upgrade_settings` block exports the following:

* `drain_timeout_in_minutes` - The amount of time in minutes to wait on eviction of pods and graceful termination per node. This eviction wait time honors waiting on pod disruption budgets. If this time is exceeded, the upgrade fails.

* `node_soak_duration_in_minutes` - The amount of time in minutes to wait after draining a node and before reimaging it and moving on to next node. If not specified, the default is 0 minutes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `node_soak_duration_in_minutes` - The amount of time in minutes to wait after draining a node and before reimaging it and moving on to next node. If not specified, the default is 0 minutes.
* `node_soak_duration_in_minutes` - The amount of time in minutes to wait after draining a node and before reimaging it and moving on to next node.

@itay-grudev
Copy link
Contributor

@stephybun I added all of the requested changes.

I am unsure about one of your documentation comments. Could you give me more guidance if you'd like it changed.

I trimmed the tests, but I left the corner cases with 0 values.

@stephybun
Copy link
Member

@itay-grudev

I am unsure about one of your documentation comments. Could you give me more guidance if you'd like it changed.

Sure, could you specify which?

@itay-grudev
Copy link
Contributor

@itay-grudev

I am unsure about one of your documentation comments. Could you give me more guidance if you'd like it changed.

Sure, could you specify which?

Never mind. You already replied to my comment 20min ago. I just applied your proposal.

Are you ok with the tests like this? I combined two of the tests in one, but I feel like the 0s and empty configuration should remain.

Copy link
Member

@stephybun stephybun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bitcloud @itay-grudev

Apologies if some of my review comments were unclear. I've tried to clarify and be more specific in my suggestions this time around. I also missed that the state migration schema was updated in my initial review, this needs to be removed. Please let me know if you have any further questions. Once these comments are resolved I'll run the tests and we'll see if any further changes are required.

Comment on lines 282 to 285
Config: r.upgradeSettingsConfig(data, map[string]interface{}{
"max_surge": "2",
"drain_timeout_in_minutes": 35,
}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test steps could still be condensed further, I've added what the test steps should look like below:

data.ResourceTest(t, r, []acceptance.TestStep{
		{
			Config: r.upgradeSettingsConfig(data, map[string]interface{}{
				"max_surge":                    "2%",
				"drain_timeout_in_minutes":      35,
				"node_soak_duration_in_minutes": 18,
			}),
			Check: acceptance.ComposeTestCheckFunc(
				check.That(data.ResourceName).ExistsInAzure(r),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.#").HasValue("1"),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.0.max_surge").HasValue("2"),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.0.drain_timeout_in_minutes").HasValue("40"),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.0.node_soak_duration_in_minutes").HasValue("18"),
			),
		},
		data.ImportStep(),
		{
			Config: r.upgradeSettingsConfig(data, map[string]interface{}{
				"max_surge":                    "10%",
				"drain_timeout_in_minutes":      0,
				"node_soak_duration_in_minutes": 0,
			}),
			Check: acceptance.ComposeTestCheckFunc(
				check.That(data.ResourceName).ExistsInAzure(r),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.#").HasValue("1"),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.0.max_surge").HasValue("10%"),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.0.drain_timeout_in_minutes").HasValue("0"),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.0.node_soak_duration_in_minutes").HasValue("0"),
			),
		},
		data.ImportStep(),
		{
			Config: r.upgradeSettingsConfig(data, map[string]interface{}{
				"max_surge": "10%",
			}),
			Check: acceptance.ComposeTestCheckFunc(
				check.That(data.ResourceName).ExistsInAzure(r),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.#").HasValue("1"),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.0.max_surge").DoesNotExist(),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.0.drain_timeout_in_minutes").DoesNotExist(),
				check.That(data.ResourceName).Key("default_node_pool.0.upgrade_settings.0.node_soak_duration_in_minutes").DoesNotExist(),
			),
		},
		data.ImportStep(),

This will:

  1. Test the create path for drain_timeout_in_minutes and node_soak_duration_in_minutes
  2. Test the update path for drain_timeout_in_minutes and node_soak_duration_in_minutes
  3. Test the update path for drain_timeout_in_minutes and node_soak_duration_in_minutes by removing it from the config

As a note, max_surge needs to be specified at the moment since the API changed the default (this will be defaulted in the provider in 4.0). We get a plan diff if it isn't specified.

…urce.go

Co-authored-by: stephybun <steph@hashicorp.com>
itay-grudev and others added 3 commits May 23, 2024 16:12
…urce.go

Co-authored-by: stephybun <steph@hashicorp.com>
…urce.go

Co-authored-by: stephybun <steph@hashicorp.com>
…_pool.go

Co-authored-by: stephybun <steph@hashicorp.com>
@github-actions github-actions bot added size/L and removed size/XL labels May 23, 2024
@itay-grudev
Copy link
Contributor

@stephybun Done. I'm sorry I took so long. I was waiting for the tests to finish.

@itay-grudev
Copy link
Contributor

@stephybun

:-) 

Thanks for the help.

@itay-grudev
Copy link
Contributor

@stephybun Can we do anything to help you and accelerate this to main?

@stephybun
Copy link
Member

stephybun commented May 27, 2024

@itay-grudev we have some test failures due to the way the max_surge property is handled in AKS at the moment which will change in the next upcoming major release of the provider. These new properties may also require some special handling up until 4.0 as well. Would it be alright for you if I pushed some changes to your branch to get this into a state where we can merge it?

@itay-grudev
Copy link
Contributor

@stephybun Yhea, I don't mind of course. Ping me if I can assist in any way.

@stephybun
Copy link
Member

Apologies for needing to intervene here @bitcloud / @itay-grudev.

The AKS API can be tricky at times, and while looking into this I found that the drain_timeout_in_minutes property cannot be set to 0 or removed/reset back to it's default after it's been specified. This requires a CustomizeDiff to force recreate the resource if a user attempts to unset this property since it otherwise results in a permanent plan diff. This also meant the tests could be greatly simplified.

I unfortunately did not have the necessary permissions to push those updates to your branch, and given the large number of changes made I've opened a separate PR that supersedes this one. I did my best to maintain the authorship of the original work in #26137. I'm going to close this PR now in favour of #26137. The tests are running for this and I'll have to defer review of this to another member of the team but please subscribe to #26137 for updates.

Thanks again for all your effort on this!

@stephybun stephybun closed this May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for AKS node pool cordonDrainTimeoutInMinutes parameter
4 participants