Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hetzner arm nodes not joining cluster consistently #16491

Open
MTRNord opened this issue Apr 24, 2024 · 6 comments
Open

Hetzner arm nodes not joining cluster consistently #16491

MTRNord opened this issue Apr 24, 2024 · 6 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@MTRNord
Copy link

MTRNord commented Apr 24, 2024

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Client version: 1.29.0-beta.1 (git-v1.29.0-beta.1-154-g87a0483ca3)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.6

3. What cloud provider are you using?
Hetzner

4. What commands did you run? What is the simplest way to reproduce this issue?
kops create cluster --name=cluster-example.k8s.local --ssh-public-key=~/.ssh/id_ed25519.pub --cloud=hetzner --zones=hel1 --networking=cilium --network-cidr=10.10.0.0/16 --node-count=2 --control-plane-count=3 --control-plane-zones=hel1,fsn1 --node-size=cax21 --control-plane-size cax11

5. What happened after the commands executed?
All nodes and resources are created however validate fails. The one node only joined after 3 recreations. The other one doesnt join at all:

⬢ [fedora-toolbox:39] ❯ kops validate cluster --wait 10m
I0424 21:34:40.295844 1619707 featureflag.go:168] FeatureFlag "Scaleway"=true
Validating cluster midnightthoughts.k8s.local

INSTANCE GROUPS
NAME			ROLE		MACHINETYPE	MIN	MAX	SUBNETS
control-plane-fsn1-1	ControlPlane	cax11		1	1	fsn1
control-plane-hel1-1	ControlPlane	cax11		1	1	hel1
control-plane-hel1-2	ControlPlane	cax11		1	1	hel1
nodes-hel1		Node		cax21		2	2	hel1

NODE STATUS
NAME					ROLE		READY
control-plane-fsn1-1-4c0c2fca48e4d3ea	control-plane	True
control-plane-hel1-1-4d7606e4b08b2273	control-plane	True
control-plane-hel1-2-67426331b523d69c	control-plane	True
nodes-hel1-32dbe2a7d622155d		node		True

VALIDATION ERRORS
KIND	NAME		MESSAGE
Machine	46484806	machine "46484806" has not yet joined cluster

Validation Failed

6. What did you expect to happen?

All nodes join

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2024-04-24T19:07:04Z"
  generation: 2
  name: cluster-example.k8s.local
spec:
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  certManager:
    enabled: true
  channel: stable
  cloudProvider: hetzner
  configBase: scw://kops-cluster-example/cluster-example.k8s.local
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: control-plane-hel1-1
      name: hel1-1
    - instanceGroup: control-plane-fsn1-1
      name: fsn1-1
    - instanceGroup: control-plane-hel1-2
      name: hel1-2
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: control-plane-hel1-1
      name: hel1-1
    - instanceGroup: control-plane-fsn1-1
      name: fsn1-1
    - instanceGroup: control-plane-hel1-2
      name: hel1-2
    manager:
      backupRetentionDays: 90
    memoryRequest: 100Mi
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeDNS:
    nodeLocalDNS:
      cpuRequest: 25m
      enabled: true
      memoryRequest: 5Mi
    provider: CoreDNS
  kubeProxy:
    enabled: false
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - 0.0.0.0/0
  - ::/0
  kubernetesVersion: 1.28.6
  metricsServer:
    enabled: true
  networkCIDR: 10.10.0.0/16
  networking:
    cilium:
      enableNodePort: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  - ::/0
  sshKeyName: username@example.com
  subnets:
  - name: fsn1
    type: Public
    zone: fsn1
  - name: hel1
    type: Public
    zone: hel1
  topology:
    dns:
      type: None

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-24T19:07:05Z"
  labels:
    kops.k8s.io/cluster: cluster-example.k8s.local
  name: control-plane-fsn1-1
spec:
  image: ubuntu-22.04
  machineType: cax11
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - fsn1

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-24T19:07:05Z"
  labels:
    kops.k8s.io/cluster: cluster-example.k8s.local
  name: control-plane-hel1-1
spec:
  image: ubuntu-22.04
  machineType: cax11
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - hel1

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-24T19:07:05Z"
  labels:
    kops.k8s.io/cluster: cluster-example.k8s.local
  name: control-plane-hel1-2
spec:
  image: ubuntu-22.04
  machineType: cax11
  maxSize: 1
  minSize: 1
  role: Master
  subnets:
  - hel1

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-04-24T19:07:06Z"
  labels:
    kops.k8s.io/cluster: cluster-example.k8s.local
  name: nodes-hel1
spec:
  image: ubuntu-22.04
  machineType: cax21
  maxSize: 2
  minSize: 2
  role: Node
  subnets:
  - hel1

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

Additionally the ssh key seems to not get applied. trying to ssh in only yields a user password request. the SSH key doesnt get accepted.

This was tried well beyond the 10m mark.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 24, 2024
@MTRNord
Copy link
Author

MTRNord commented Apr 24, 2024

seems like after the 4th full delete and recreate it works again. I wonder if this is related to #15806

@hakman
Copy link
Member

hakman commented Apr 25, 2024

seems like after the 4th full delete and recreate it works again. I wonder if this is related to #15806

Only one way to find out. Please check the kops-configuration logs on failed nodes.
Also, I did not test the --zones=hel1,fsn part so not sure if it works with 2 regions.

@MTRNord
Copy link
Author

MTRNord commented Apr 27, 2024

I will have a look when it fails again. Took me some time to realise the user to connect via is not ubuntu but root on the hetzner instances.

Also, I did not test the --zones=hel1,fsn part so not sure if it works with 2 regions.

for the nodes it fails with an hard error, but for control plane it works just fine it seems. No errors or issues as far i was able to tell so far. All servers spawn and kubernetes says everything is happy. I had so far no workload on the cluster though. So it might have bugs I didnt see yet. But I am doubtful that there are any.

@MTRNord
Copy link
Author

MTRNord commented Apr 27, 2024

This time its a control-plane node. It seems to fail on this:

Apr 27 21:29:16 control-plane-fsn1-5c11fa08140f0e98 nodeup[1091]: I0427 21:29:16.539628    1091 files.go:136] Hash did not match for "/var/cache/nodeup/sha256:525e2b62ba92a1b6f3dc9612449a84aa61652e680f7ebf4eff579795fe464b57_cni-plugins-linux-arm64-v1_2_0_tgz": actual=sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 vs expected=sha256:525e2b62ba92a1b6f3dc9612449a84aa61652e680f7ebf4eff579795fe464b57
Apr 27 21:29:16 control-plane-fsn1-5c11fa08140f0e98 nodeup[1091]: I0427 21:29:16.539684    1091 http.go:82] Downloading "https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz"
Apr 27 21:29:16 control-plane-fsn1-5c11fa08140f0e98 nodeup[1091]: W0427 21:29:16.747301    1091 assetstore.go:251] error downloading url "https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz": error response from "https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz": HTTP 403
Apr 27 21:29:16 control-plane-fsn1-5c11fa08140f0e98 nodeup[1091]: W0427 21:29:16.747362    1091 main.go:133] got error running nodeup (will retry in 30s): error adding asset "525e2b62ba92a1b6f3dc9612449a84aa61652e680f7ebf4eff579795fe464b57@https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz": error response from "https://storage.googleapis.com/k8s-artifacts-cni/release/v1.2.0/cni-plugins-linux-arm64-v1.2.0.tgz": HTTP 403

as the server responds with <?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message><Details>We're sorry, but this service is not available in your location</Details></Error> which means this is the same bug as #15806 as hetzner has IPs which maxmind sadly recognises as Iran despite not being there. (I dealt with this before with docker and it was a huge hassle to get them to update the IP. Took me multiple explenations to get that fixed.).

TLDR: As a workaround deleting the server and updating to reinit the rest might be easiest here.

@hakman
Copy link
Member

hakman commented Apr 28, 2024

TLDR: As a workaround deleting the server and updating to reinit the rest might be easiest here.

That is pretty much the path of least resistance.
You may also want to take a look at another issue for some suggestions #16466 (comment).

@hakman hakman added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 28, 2024
@rehashedsalt
Copy link

Be sure and get in touch with Hetzner via support ticket if you get bit by a blocked IP. Best odds we have of them no longer being blackholed by Google is if Hetzner reaches out to them to see what the deal is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

4 participants