Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(magnum): enable upgrades with cluster_template_id changes #1598

Merged
merged 2 commits into from Sep 11, 2023

Conversation

mnaser
Copy link
Collaborator

@mnaser mnaser commented Jul 21, 2023

This should hopefully stop the cluster from being recreated and instead it will trigger an upgrade if the cluster_template_id is changed.

@nikParasyr
Copy link
Member

@mnaser has this been tested on your env? we dont have ci for magnum unfortunately

@mnaser
Copy link
Collaborator Author

mnaser commented Jul 21, 2023

hi @nikParasyr -- i'm working with a few people here who will be testing and reporting how it works for them. :)

@schlakob
Copy link

Hi @mnaser, we tested the change and got this error from the API:

image

@mnaser
Copy link
Collaborator Author

mnaser commented Jul 24, 2023

Oh! I think we have to bump the micro version for the Magnum to support this operation.

https://docs.openstack.org/magnum/latest/contributor/api-microversion.html

I think we have to bump it to a newer version which can support this operation.

@schlakob do you think you might have sometime to look at this, maybe @nikParasyr rsn into a similar issue with this.

@schlakob
Copy link

Just to understand it correctly, you mean that we need to bump this version in our Openstack Infrastructure, am I right?

@mnaser
Copy link
Collaborator Author

mnaser commented Jul 24, 2023

Just to understand it correctly, you mean that we need to bump this version in our Openstack Infrastructure, am I right?

Nope! Inside the Terraform provider!

@mnaser
Copy link
Collaborator Author

mnaser commented Jul 24, 2023

@schlakob can you try again with this new updated commit?

@schlakob
Copy link

Yes I will try it this afternoon.

@schlakob
Copy link

@mnaser - I ran a quick test with the recent changes and this are my observations:

  • General speaking: terraform worked this time and it took about 1:30 min to modify the cluster.
  • Only strange thing was that after the terraform was successful the actual replacement of the nodes started.
  • When i run terraform again, it wants to upgrade again, since the template did not change in the state as it seems.

To validate this I am currently recreating the environment and will test again.

@mnaser
Copy link
Collaborator Author

mnaser commented Jul 24, 2023

@mnaser - I ran a quick test with the recent changes and this are my observations:

  • General speaking: terraform worked this time and it took about 1:30 min to modify the cluster.

That's a good start!

  • Only strange thing was that after the terraform was successful the actual replacement of the nodes started.

OK, this may actually be a bug inside of the Cluster API driver for Magnum. I filed vexxhost/magnum-cluster-api#176

  • When i run terraform again, it wants to upgrade again, since the template did not change in the state as it seems.

Can you confirm if the cluster template has changed in the API itself (aka is it a bug with the Terraform state not being updated, or is it the API cluster template in Magnum that was not updated?)

To validate this I am currently recreating the environment and will test again.

Excellent.

@schlakob
Copy link

So I run another test and observed about the same. Regarding the not changing template, it does not change in Magnum either. So in Nova it is still the old template name and id, but the cluster is in status "UPDATE_COMPLETE".

One additional note: The cluster I tested with hast 1 CP and 2 worker. It updates 1 Worker, then 1 min later the CP and then about 5 min later the second worker. After 10 minutes in, all nodes are on the new version. So far so good, but another 3 min after that, the first worker was recreating again.
But this looks like a Magnum issue, I just wanted to mention. I will observe if this happens every time or it was just this run.

@mnaser
Copy link
Collaborator Author

mnaser commented Jul 25, 2023

So I run another test and observed about the same. Regarding the not changing template, it does not change in Magnum either. So in Nova it is still the old template name and id, but the cluster is in status "UPDATE_COMPLETE".

Can you do me a favour and try doing the upgrade via the CLI and see if the template changes? If it doesn't, I think you might have found another Cluster API driver bug here.

The extra rollout does seem to be incorrect, I wonder if that's health checks failing because of the single control plane.

@schlakob
Copy link

Another test with CLI:

  • the cli instantly says the request was sent successfully und UPDATE_COMPLETE is the new status, but the upgrade starts about 2 min after that msg.
  • same issue with the non updating template, so it looks like it is a Magnum bug
  • another strange thing is that this time the control plane is in a weird state: The old control plane is still existing in the cluster, but the new control plane is created. (see screenshot)

image

@nikParasyr
Copy link
Member

One additional note: The cluster I tested with hast 1 CP and 2 worker. It updates 1 Worker, then 1 min later the CP and then about 5 min later the second worker

maybe this is related? kubernetes-sigs/cluster-api#8628 (fix released in capi 1.5.0 )

In any case, these seem to be more cluster-api/magnum-cluster-api-driver related issues.

@mnaser let me know when capi (magnum-capi) issues are fixed to review the tpo changes. Thanks

@mnaser
Copy link
Collaborator Author

mnaser commented Jul 26, 2023

One additional note: The cluster I tested with hast 1 CP and 2 worker. It updates 1 Worker, then 1 min later the CP and then about 5 min later the second worker

maybe this is related? kubernetes-sigs/cluster-api#8628 (fix released in capi 1.5.0 )

This could be it! We are still running 1.4 I believe. We can roll that out to 1.5

In any case, these seem to be more cluster-api/magnum-cluster-api-driver related issues.

Correct, but I'd like to get a good ability to validate that a successful upgrade happens.

@mnaser let me know when capi (magnum-capi) issues are fixed to review the tpo changes. Thanks

Will do. We are almost there.

@pawcykca
Copy link
Contributor

pawcykca commented Aug 1, 2023

I have tested this changes on Magnum versions Ussuri and Wallaby - works fine.
All acceptance tests passed successfully (two of them need some small update).
It would be useful to add acceptance tests for this scenario.

@mnaser
Copy link
Collaborator Author

mnaser commented Aug 3, 2023

@pawcykca : thank you so much for testing this, how do you suggest writing acceptance tests for this?

@mnaser
Copy link
Collaborator Author

mnaser commented Sep 8, 2023

@nikParasyr i've tested this locally a few times as well with others who have, are we good to kick this in as is or how do you suggest?

Copy link
Member

@nikParasyr nikParasyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mnaser sorry for the delay but i was off for some personal stuff.

Since this has been tested i can approve it. (need to find time to really figure out the ci for magnum).
Thanks for the work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants