Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eksctl anywhere upgrade cluster doesn't use the extra hardware from hardware.csv #7818

Open
ygao-armada opened this issue Mar 10, 2024 · 2 comments

Comments

@ygao-armada
Copy link

ygao-armada commented Mar 10, 2024

What happened:
I created an EKS anywhere cluster with 1 CP node, kubernetes 1.26

Then I try to upgrade it to 1.27, with 2 new CP nodes plus the existing CP node in the new hardware file.

cluster config file change is:

diff eksa-mgmt02-md-cluster.yaml eksa-mgmt02-md-cluster-sc.yaml
25c25
<   kubernetesVersion: "1.26"
---
>   kubernetesVersion: "1.27"
36c36
<   osImageURL: "https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.26.7.gz"
---
>   osImageURL: "https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.27.11.gz"
70c70
<           IMG_URL: https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.26.7.gz
---
>           IMG_URL: https://<image store>.blob.core.windows.net/ubuntu-2004-efi/ubuntu-2004-efi-eksa-kube-v1.27.11.gz

hardware change is:

diff hardware-mgmt02.csv hardware-mgmt02-cp3.csv 
2a3,4
> eksa-control-02,...,type=cp,/dev/sda
> eksa-control-03,...,type=cp,/dev/sda

The command I use:
eksctl anywhere upgrade cluster -f eksa-mgmt02-md-cluster-sc.yaml --hardware-csv hardware-mgmt02-cp3.csv --no-timeouts -v 9

I see upgrade stuck with following errors:

...
2024-04-15T07:16:38.920Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1713165386699246825 kubectl get --ignore-not-found -o json --kubeconfig /home/armada/eksa/mgmt02/mgmt02/mgmt02-eks-a-cluster.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default mgmt02"}
2024-04-15T07:16:39.030Z	V9	Cluster generation and observedGeneration	{"Generation": 2, "ObservedGeneration": 2}
2024-04-15T07:16:39.031Z	V5	Error happened during retry	{"error": "cluster has an error: hardware validation failure: for rolling upgrade, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1", "retries": 1}
2024-04-15T07:16:39.031Z	V5	Sleeping before next retry	{"time": "1s"}
...
2024-04-15T07:32:06.179Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1713165386699246825 kubectl get --ignore-not-found -o json --kubeconfig /home/armada/eksa/mgmt02/mgmt02/mgmt02-eks-a-cluster.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default mgmt02"}
2024-04-15T07:32:06.285Z	V9	Cluster generation and observedGeneration	{"Generation": 2, "ObservedGeneration": 2}
2024-04-15T07:32:06.285Z	V5	Error happened during retry	{"error": "cluster has an error: hardware validation failure: for rolling upgrade, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1", "retries": 832}
2024-04-15T07:32:06.286Z	V5	Sleeping before next retry	{"time": "1s"}
...

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • EKS Anywhere Release: v0.18.7
  • EKS Distro Release: 1.26/1.27
@mitalipaygude
Copy link
Member

From the initial look what I understand is this was a rolling and scaling upgrade request. This means that 2 extra hardwares were expected for the upgrade - 1 for the scale up, 1 spare for the rolling upgrade of 1.26 -> 1.27. My guess is that the extra hardware you added to the csv was used for the new CP node and eks anywhere didn't have enough for the rolling upgrade.

@ygao-armada
Copy link
Author

@mitalipaygude I just try to use 1 more machine (1 existing host + 2 idle hosts), same issue shows:

2024-04-14T08:59:52.814Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1713084293614893292 kubectl get --ignore-not-found -o json --kubeconfig /home/armada/eksa/mgmt02/mgmt02/mgmt02-eks-a-cluster.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default mgmt02"}
2024-04-14T08:59:52.934Z	V9	Cluster generation and observedGeneration	{"Generation": 3, "ObservedGeneration": 3}
2024-04-14T08:59:52.934Z	V5	Error happened during retry	{"error": "cluster has an error: hardware validation failure: for rolling upgrade, minimum hardware count not met for selector '{\"type\":\"cp\"}': have 0, require 1", "retries": 794}
2024-04-14T08:59:52.934Z	V5	Sleeping before next retry	{"time": "1s"}

At the same time, according to https://anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-upgrades/baremetal-upgrades/:

EKS Anywhere upgrades on Bare Metal require at least one spare hardware server for control plane upgrade and one for each worker node group upgrade.

So 1 spare machine is good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants