Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rook-ceph-osd-prepare job pod name generation fails on kubernetes 1.22.4 #9294

Closed
oguzdag opened this issue Dec 2, 2021 · 8 comments · Fixed by #9312
Closed

rook-ceph-osd-prepare job pod name generation fails on kubernetes 1.22.4 #9294

oguzdag opened this issue Dec 2, 2021 · 8 comments · Fixed by #9312
Labels
Projects

Comments

@oguzdag
Copy link

oguzdag commented Dec 2, 2021

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

When we try to initialise rook-ceph, osd-prepare job is failing to create pod and fails with below error

Error creating: Pod "rook-ceph-osd-prepare-ip-10-8-35-17.eu-west-2.compute.--1-f88dv" is invalid: [metadata.generateName: Invalid value: "rook-ceph-osd-prepare-ip-10-8-35-17.eu-west-2.compute.--1-": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is 'a-z0-9?(.a-z0-9?)'), metadata.name: Invalid value: "rook-ceph-osd-prepare-ip-10-8-35-17.eu-west-2.compute.--1-f88dv": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is 'a-z0-9?(.a-z0-9?)')]

Expected behavior:

It should create pods

How to reproduce it (minimal and precise):

on AWS, install k8s 1.22.4, install rook-ceph operator, if your nodename is in some length, it fails

here are some examples

nodename
ip-10-8-34-49.eu-west-2.compute.internal
pod name (k8s version)
rook-ceph-osd-prepare-ip-10-8-34-49.eu-west-2.compute.--1-hqvjs (1.22.4)
fails

nodename
ip-10-8-33-209.eu-west-2.compute.internal
pod name (k8s version)
rook-ceph-osd-prepare-ip-10-8-33-209.eu-west-2.compute.intsjlrb (1.21.5)
success

nodename
ip-10-8-34-9.eu-west-2.compute.internal
pod name (k8s version)
rook-ceph-osd-prepare-ip-10-8-34-9.eu-west-2.compute.i--1-tcg4v (1.22.4)
success

nodename
ip-10-8-35-17.eu-west-2.compute.internal
pod name (k8s version)
rook-ceph-osd-prepare-ip-10-8-35-17.eu-west-2.compute.--1-b8d5x (1.22.4)
fails

So if you can see, it gets the job name, cuts it 54 cols, adds random dash and alphanumeric hash, after cutting the job name, if the result ends with . it fails RFC 1123 on k8s 1.22.4 because it's adding some random string starting with -

but it doesn't happen on 1.21.x because it's adding only alphanumeric chars (no dashes)

File(s) to submit:

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary
  • Operator's logs, if necessary
  • Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read Github documentation if you need help.

Environment:

  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Cloud provider or hardware configuration:
  • Rook version (use rook version inside of a Rook Pod):
  • Storage backend version (e.g. for ceph do ceph -v):
  • Kubernetes version (use kubectl version):
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):
@oguzdag oguzdag added the bug label Dec 2, 2021
@travisn
Copy link
Member

travisn commented Dec 2, 2021

I am able to repro if the 32nd character of the hostname is a .. In minikube, this is simple by setting the name like so (named with repeating 10 letters to see it is the 32nd char that is a .):

kubectl label node minikube kubernetes.io/hostname=abcdefghijabcdefghijabcdefghija.foo --overwrite=true

Currently Rook will shorten the job name using a hash if the total job name would be longer than 63 characters as seen here. Now it appears we need to reduce that limit to greater than 53 characters.

@travisn travisn added this to To do in v1.8 via automation Dec 2, 2021
@travisn travisn added this to To do in v1.7 via automation Dec 3, 2021
@travisn travisn moved this from To do to Blocking Release in v1.7 Dec 3, 2021
@travisn travisn moved this from To do to Blocking Release in v1.8 Dec 3, 2021
@ibotty
Copy link

ibotty commented Dec 3, 2021

Is there a workaround? Re-deploying the node is not an option.

@travisn
Copy link
Member

travisn commented Dec 3, 2021

This issue was introduced in the job pod name generation in K8s 1.22 and is resolved with kubernetes/kubernetes#105676. I'm following up to see if this can be backported to 1.22 (currently the fix looks like only 1.23), and find a workaround in Rook in the meantime.

@travisn
Copy link
Member

travisn commented Dec 3, 2021

We will get a fix out for this early next week with #9312. The only workaround I can think of is if you can rename the kubernetes.io/hostname label on the k8s node, which likely isn't an option.

v1.7 automation moved this from Blocking Release to Done Dec 7, 2021
v1.8 automation moved this from Blocking Release to Done Dec 7, 2021
@ibotty
Copy link

ibotty commented Dec 7, 2021

🎆 Now we only need a bugfix release. Thank you for the prompt response!

@travisn
Copy link
Member

travisn commented Dec 7, 2021

Truncate job name of the osd prepare job further to avoid pod generation failure on K8s 1.22 #9312

v1.7.10 may not be until next week. In the meantime, could you test an interim 1.7 build and confirm that it is working for you? The image is: rook/ceph:v1.7.9-5.ga56f8a1. Thanks!

@ibotty
Copy link

ibotty commented Dec 7, 2021

I just figured I'd look if there were any images automatically built and I found rook/ceph:v1.7.9-5.ga56f8a1 by myself. It did start the job just fine. I will now have to look into it not deploying OSDs on PVCs, but this bug is fixed! Thank you for your prompt reply!

@oguzdag
Copy link
Author

oguzdag commented Dec 9, 2021

Hi, thanks for the fix, we tested with rook/ceph:v1.7.9-5.ga56f8a1 and observed the problem has gone, we can now run osd-prepare on any node. Thanks again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
v1.7
Done
v1.8
Done
3 participants