fix(magnum): enable upgrades with cluster_template_id changes #1598

mnaser · 2023-07-21T09:30:00Z

This should hopefully stop the cluster from being recreated and instead it will trigger an upgrade if the cluster_template_id is changed.

nikParasyr · 2023-07-21T10:13:02Z

@mnaser has this been tested on your env? we dont have ci for magnum unfortunately

mnaser · 2023-07-21T10:16:01Z

hi @nikParasyr -- i'm working with a few people here who will be testing and reporting how it works for them. :)

schlakob · 2023-07-24T06:49:32Z

Hi @mnaser, we tested the change and got this error from the API:

mnaser · 2023-07-24T07:18:44Z

Oh! I think we have to bump the micro version for the Magnum to support this operation.

https://docs.openstack.org/magnum/latest/contributor/api-microversion.html

I think we have to bump it to a newer version which can support this operation.

@schlakob do you think you might have sometime to look at this, maybe @nikParasyr rsn into a similar issue with this.

schlakob · 2023-07-24T07:40:56Z

Just to understand it correctly, you mean that we need to bump this version in our Openstack Infrastructure, am I right?

mnaser · 2023-07-24T12:34:49Z

Just to understand it correctly, you mean that we need to bump this version in our Openstack Infrastructure, am I right?

Nope! Inside the Terraform provider!

mnaser · 2023-07-24T12:46:21Z

@schlakob can you try again with this new updated commit?

schlakob · 2023-07-24T12:51:17Z

Yes I will try it this afternoon.

schlakob · 2023-07-24T13:56:57Z

@mnaser - I ran a quick test with the recent changes and this are my observations:

General speaking: terraform worked this time and it took about 1:30 min to modify the cluster.
Only strange thing was that after the terraform was successful the actual replacement of the nodes started.
When i run terraform again, it wants to upgrade again, since the template did not change in the state as it seems.

To validate this I am currently recreating the environment and will test again.

mnaser · 2023-07-24T16:15:03Z

@mnaser - I ran a quick test with the recent changes and this are my observations:

General speaking: terraform worked this time and it took about 1:30 min to modify the cluster.

That's a good start!

Only strange thing was that after the terraform was successful the actual replacement of the nodes started.

OK, this may actually be a bug inside of the Cluster API driver for Magnum. I filed vexxhost/magnum-cluster-api#176

When i run terraform again, it wants to upgrade again, since the template did not change in the state as it seems.

Can you confirm if the cluster template has changed in the API itself (aka is it a bug with the Terraform state not being updated, or is it the API cluster template in Magnum that was not updated?)

To validate this I am currently recreating the environment and will test again.

Excellent.

schlakob · 2023-07-25T07:36:44Z

So I run another test and observed about the same. Regarding the not changing template, it does not change in Magnum either. So in Nova it is still the old template name and id, but the cluster is in status "UPDATE_COMPLETE".

One additional note: The cluster I tested with hast 1 CP and 2 worker. It updates 1 Worker, then 1 min later the CP and then about 5 min later the second worker. After 10 minutes in, all nodes are on the new version. So far so good, but another 3 min after that, the first worker was recreating again.
But this looks like a Magnum issue, I just wanted to mention. I will observe if this happens every time or it was just this run.

mnaser · 2023-07-25T16:08:32Z

So I run another test and observed about the same. Regarding the not changing template, it does not change in Magnum either. So in Nova it is still the old template name and id, but the cluster is in status "UPDATE_COMPLETE".

Can you do me a favour and try doing the upgrade via the CLI and see if the template changes? If it doesn't, I think you might have found another Cluster API driver bug here.

The extra rollout does seem to be incorrect, I wonder if that's health checks failing because of the single control plane.

schlakob · 2023-07-26T09:38:46Z

Another test with CLI:

the cli instantly says the request was sent successfully und UPDATE_COMPLETE is the new status, but the upgrade starts about 2 min after that msg.
same issue with the non updating template, so it looks like it is a Magnum bug
another strange thing is that this time the control plane is in a weird state: The old control plane is still existing in the cluster, but the new control plane is created. (see screenshot)

nikParasyr · 2023-07-26T11:32:34Z

One additional note: The cluster I tested with hast 1 CP and 2 worker. It updates 1 Worker, then 1 min later the CP and then about 5 min later the second worker

maybe this is related? kubernetes-sigs/cluster-api#8628 (fix released in capi 1.5.0 )

In any case, these seem to be more cluster-api/magnum-cluster-api-driver related issues.

@mnaser let me know when capi (magnum-capi) issues are fixed to review the tpo changes. Thanks

mnaser · 2023-07-26T11:59:21Z

One additional note: The cluster I tested with hast 1 CP and 2 worker. It updates 1 Worker, then 1 min later the CP and then about 5 min later the second worker

maybe this is related? kubernetes-sigs/cluster-api#8628 (fix released in capi 1.5.0 )

This could be it! We are still running 1.4 I believe. We can roll that out to 1.5

In any case, these seem to be more cluster-api/magnum-cluster-api-driver related issues.

Correct, but I'd like to get a good ability to validate that a successful upgrade happens.

@mnaser let me know when capi (magnum-capi) issues are fixed to review the tpo changes. Thanks

Will do. We are almost there.

pawcykca · 2023-08-01T11:38:20Z

I have tested this changes on Magnum versions Ussuri and Wallaby - works fine.
All acceptance tests passed successfully (two of them need some small update).
It would be useful to add acceptance tests for this scenario.

mnaser · 2023-08-03T03:00:49Z

@pawcykca : thank you so much for testing this, how do you suggest writing acceptance tests for this?

mnaser · 2023-09-08T13:52:18Z

@nikParasyr i've tested this locally a few times as well with others who have, are we good to kick this in as is or how do you suggest?

nikParasyr

@mnaser sorry for the delay but i was off for some personal stuff.

Since this has been tested i can approve it. (need to find time to really figure out the ci for magnum).
Thanks for the work

fix(magnum): enable upgrades with cluster_template_id changes

9d2a907

fix(magnum): use correct microversion for upgrades

301f79f

nikParasyr approved these changes Sep 11, 2023

View reviewed changes

nikParasyr merged commit ca003d4 into main Sep 11, 2023
3 checks passed

nikParasyr deleted the fix/allow-upgrades branch September 11, 2023 08:42

jan-di mentioned this pull request Sep 12, 2023

Not able to change template ID of clusterV1 resource crossplane-contrib/provider-openstack#9

Closed

okozachenko1203 mentioned this pull request Oct 25, 2023

New release request #1617

Closed

Mazorius mentioned this pull request Nov 16, 2023

Adaption of labels for autoscaler forces replacement of CAPI cluster #1625

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(magnum): enable upgrades with cluster_template_id changes #1598

fix(magnum): enable upgrades with cluster_template_id changes #1598

mnaser commented Jul 21, 2023

nikParasyr commented Jul 21, 2023

mnaser commented Jul 21, 2023

schlakob commented Jul 24, 2023

mnaser commented Jul 24, 2023

schlakob commented Jul 24, 2023

mnaser commented Jul 24, 2023

mnaser commented Jul 24, 2023

schlakob commented Jul 24, 2023

schlakob commented Jul 24, 2023

mnaser commented Jul 24, 2023

schlakob commented Jul 25, 2023

mnaser commented Jul 25, 2023

schlakob commented Jul 26, 2023

nikParasyr commented Jul 26, 2023

mnaser commented Jul 26, 2023

pawcykca commented Aug 1, 2023

mnaser commented Aug 3, 2023

mnaser commented Sep 8, 2023

nikParasyr left a comment

fix(magnum): enable upgrades with cluster_template_id changes #1598

fix(magnum): enable upgrades with cluster_template_id changes #1598

Conversation

mnaser commented Jul 21, 2023

nikParasyr commented Jul 21, 2023

mnaser commented Jul 21, 2023

schlakob commented Jul 24, 2023

mnaser commented Jul 24, 2023

schlakob commented Jul 24, 2023

mnaser commented Jul 24, 2023

mnaser commented Jul 24, 2023

schlakob commented Jul 24, 2023

schlakob commented Jul 24, 2023

mnaser commented Jul 24, 2023

schlakob commented Jul 25, 2023

mnaser commented Jul 25, 2023

schlakob commented Jul 26, 2023

nikParasyr commented Jul 26, 2023

mnaser commented Jul 26, 2023

pawcykca commented Aug 1, 2023

mnaser commented Aug 3, 2023

mnaser commented Sep 8, 2023

nikParasyr left a comment

Choose a reason for hiding this comment