Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors inside Operator Pod #14011

Open
pomland-94 opened this issue Apr 2, 2024 · 23 comments
Open

Errors inside Operator Pod #14011

pomland-94 opened this issue Apr 2, 2024 · 23 comments
Labels

Comments

@pomland-94
Copy link

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
Errors inside the Operator Pod

Expected behavior:
Inside the Operator pod I get lots of errors every second:

2024-04-02 16:53:32.659044 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2024-04-02 16:53:32.669172 I | ceph-spec: parsing mon endpoints: c=10.233.56.172:6789,a=10.233.9.37:6789,b=10.233.15.105:6789
2024-04-02 16:53:32.717526 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v18.2.2...
2024-04-02 16:53:32.729139 I | op-k8sutil: Retrying 20 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:34.413442 I | ceph-spec: parsing mon endpoints: c=10.233.56.172:6789,a=10.233.9.37:6789,b=10.233.15.105:6789
2024-04-02 16:53:34.440048 I | ceph-csi: successfully created csi config map "rook-ceph-csi-config"
2024-04-02 16:53:34.454109 I | ceph-csi: Kubernetes version is 1.28
2024-04-02 16:53:34.469170 I | ceph-csi: detecting the ceph csi image version for image "quay.io/cephcsi/cephcsi:v3.10.2"
2024-04-02 16:53:34.482537 I | op-k8sutil: Retrying 20 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:34.857433 I | op-k8sutil: Retrying 19 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:36.489466 I | op-k8sutil: Retrying 19 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:36.863317 I | op-k8sutil: Retrying 18 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:38.494976 I | op-k8sutil: Retrying 18 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:38.870371 I | op-k8sutil: Retrying 17 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:40.500589 I | op-k8sutil: Retrying 17 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:40.877128 I | op-k8sutil: Retrying 16 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:42.506535 I | op-k8sutil: Retrying 16 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:42.883310 I | op-k8sutil: Retrying 15 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:44.513748 I | op-k8sutil: Retrying 15 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:44.891585 I | op-k8sutil: Retrying 14 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:46.522098 I | op-k8sutil: Retrying 14 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:46.897111 I | op-k8sutil: Retrying 13 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:48.529525 I | op-k8sutil: Retrying 13 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:48.914032 I | op-k8sutil: Retrying 12 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:50.535326 I | op-k8sutil: Retrying 12 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:50.919119 I | op-k8sutil: Retrying 11 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:52.541171 I | op-k8sutil: Retrying 11 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:52.924943 I | op-k8sutil: Retrying 10 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:54.547809 I | op-k8sutil: Retrying 10 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:54.929736 I | op-k8sutil: Retrying 9 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:56.555860 I | op-k8sutil: Retrying 9 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:56.935528 I | op-k8sutil: Retrying 8 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:53:58.562312 I | op-k8sutil: Retrying 8 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:53:58.943313 I | op-k8sutil: Retrying 7 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:00.568506 I | op-k8sutil: Retrying 7 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:54:00.947542 I | op-k8sutil: Retrying 6 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:02.381091 W | op-mon: failed to check mon health. skipping mon health check since cluster details are not initialized: cluster fsid is empty
2024-04-02 16:54:02.573903 I | op-k8sutil: Retrying 6 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:54:02.953034 I | op-k8sutil: Retrying 5 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:04.579274 I | op-k8sutil: Retrying 5 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:54:04.959124 I | op-k8sutil: Retrying 4 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:06.584083 I | op-k8sutil: Retrying 4 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:54:06.966232 I | op-k8sutil: Retrying 3 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:08.590552 I | op-k8sutil: Retrying 3 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:54:08.973765 I | op-k8sutil: Retrying 2 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:10.595762 I | op-k8sutil: Retrying 2 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:54:10.980891 I | op-k8sutil: Retrying 1 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:12.608445 I | op-k8sutil: Retrying 1 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:54:12.986408 I | op-k8sutil: Retrying 0 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:13.009340 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed to delete existing results ConfigMap rook-ceph-detect-version. failed to delete ConfigMap rook-ceph-detect-version. gave up waiting after 20 retries every 2ns seconds. <nil>
2024-04-02 16:54:13.260479 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2024-04-02 16:54:13.269707 I | ceph-spec: parsing mon endpoints: c=10.233.56.172:6789,a=10.233.9.37:6789,b=10.233.15.105:6789
2024-04-02 16:54:13.313642 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v18.2.2...
2024-04-02 16:54:13.325862 I | op-k8sutil: Retrying 20 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:14.614986 I | op-k8sutil: Retrying 0 more times every 2 seconds for ConfigMap rook-ceph-csi-detect-version to be deleted
2024-04-02 16:54:14.615069 E | ceph-csi: failed to reconcile failed to configure ceph csi: invalid csi version: failed to complete ceph CSI version job: failed to run CmdReporter rook-ceph-csi-detect-version successfully. failed to delete existing results ConfigMap rook-ceph-csi-detect-version. failed to delete ConfigMap rook-ceph-csi-detect-version. gave up waiting after 20 retries every 2ns seconds. <nil>
2024-04-02 16:54:15.333160 I | op-k8sutil: Retrying 19 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:17.340337 I | op-k8sutil: Retrying 18 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:19.346484 I | op-k8sutil: Retrying 17 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:21.357132 I | op-k8sutil: Retrying 16 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:23.363094 I | op-k8sutil: Retrying 15 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:25.368121 I | op-k8sutil: Retrying 14 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:27.374638 I | op-k8sutil: Retrying 13 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:29.381994 I | op-k8sutil: Retrying 12 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:31.388744 I | op-k8sutil: Retrying 11 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:33.395158 I | op-k8sutil: Retrying 10 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:35.401195 I | op-k8sutil: Retrying 9 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:37.405840 I | op-k8sutil: Retrying 8 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:39.412930 I | op-k8sutil: Retrying 7 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:41.422608 I | op-k8sutil: Retrying 6 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:43.428241 I | op-k8sutil: Retrying 5 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:45.434968 I | op-k8sutil: Retrying 4 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:47.381297 W | op-mon: failed to check mon health. skipping mon health check since cluster details are not initialized: cluster fsid is empty
2024-04-02 16:54:47.441551 I | op-k8sutil: Retrying 3 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:49.447974 I | op-k8sutil: Retrying 2 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:51.453672 I | op-k8sutil: Retrying 1 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:53.460848 I | op-k8sutil: Retrying 0 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:54:53.479991 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed to delete existing results ConfigMap rook-ceph-detect-version. failed to delete ConfigMap rook-ceph-detect-version. gave up waiting after 20 retries every 2ns seconds. <nil>
2024-04-02 16:55:17.931496 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2024-04-02 16:55:17.940030 I | ceph-spec: parsing mon endpoints: c=10.233.56.172:6789,a=10.233.9.37:6789,b=10.233.15.105:6789
2024-04-02 16:55:17.988806 I | ceph-spec: detecting the ceph image version for image quay.io/ceph/ceph:v18.2.2...
2024-04-02 16:55:18.001062 I | op-k8sutil: Retrying 20 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:20.008582 I | op-k8sutil: Retrying 19 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:22.014816 I | op-k8sutil: Retrying 18 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:24.020794 I | op-k8sutil: Retrying 17 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:26.027128 I | op-k8sutil: Retrying 16 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:28.033210 I | op-k8sutil: Retrying 15 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:30.039506 I | op-k8sutil: Retrying 14 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:32.045575 I | op-k8sutil: Retrying 13 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:32.381854 W | op-mon: failed to check mon health. skipping mon health check since cluster details are not initialized: cluster fsid is empty
2024-04-02 16:55:34.052073 I | op-k8sutil: Retrying 12 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:36.058810 I | op-k8sutil: Retrying 11 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:38.064947 I | op-k8sutil: Retrying 10 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:40.070091 I | op-k8sutil: Retrying 9 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:42.076613 I | op-k8sutil: Retrying 8 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:44.082009 I | op-k8sutil: Retrying 7 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:46.087353 I | op-k8sutil: Retrying 6 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:48.092959 I | op-k8sutil: Retrying 5 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:50.098485 I | op-k8sutil: Retrying 4 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:52.104935 I | op-k8sutil: Retrying 3 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:54.111029 I | op-k8sutil: Retrying 2 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:56.117623 I | op-k8sutil: Retrying 1 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:58.122826 I | op-k8sutil: Retrying 0 more times every 2 seconds for ConfigMap rook-ceph-detect-version to be deleted
2024-04-02 16:55:58.142788 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed to delete existing results ConfigMap rook-ceph-detect-version. failed to delete ConfigMap rook-ceph-detect-version. gave up waiting after 20 retries every 2ns seconds. <nil>
2024-04-02 16:56:17.382850 W | op-mon: failed to check mon health. skipping mon health check since cluster details are not initialized: cluster fsid is empty
2024-04-02 16:57:02.383047 W | op-mon: failed to check mon health. skipping mon health check since cluster details are not initialized: cluster fsid is empty
2024-04-02 16:57:47.383832 W | op-mon: failed to check mon health. skipping mon health check since cluster details are not initialized: cluster fsid is empty

How to reproduce it (minimal and precise):

  • Kubernetes v1.28.6
  • OS Debian 12 Bookworm
  • Rook 1.13.7

Installed rook with this commands from the tar.gz Archive

$ git clone --single-branch --branch v1.13.7 https://github.com/rook/rook.git
cd rook/deploy/examples
kubectl create -f crds.yaml -f common.yaml -f operator.yaml
kubectl create -f cluster.yaml
  • Cluster CR (custom resource), typically called cluster.yaml, if necessary
#################################################################################################################
# Define the settings for the rook-ceph cluster with common settings for a production cluster.
# All nodes with available raw devices will be used for the Ceph cluster. At least three nodes are required
# in this example. See the documentation for more details on storage settings available.

# For example, to create the cluster:
#   kubectl create -f crds.yaml -f common.yaml -f operator.yaml
#   kubectl create -f cluster.yaml
#################################################################################################################

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph # namespace:cluster
spec:
  cephVersion:
    # The container image used to launch the Ceph daemon pods (mon, mgr, osd, mds, rgw).
    # v17 is Quincy, v18 is Reef.
    # RECOMMENDATION: In production, use a specific version tag instead of the general v17 flag, which pulls the latest release and could result in different
    # versions running within the cluster. See tags available at https://hub.docker.com/r/ceph/ceph/tags/.
    # If you want to be more precise, you can always use a timestamp tag such as quay.io/ceph/ceph:v18.2.2-20240311
    # This tag might not contain a new Ceph version, just security fixes from the underlying operating system, which will reduce vulnerabilities
    image: quay.io/ceph/ceph:v18.2.2
    # Whether to allow unsupported versions of Ceph. Currently `quincy` and `reef` are supported.
    # Future versions such as `squid` (v19) would require this to be set to `true`.
    # Do not set to true in production.
    allowUnsupported: false
  # The path on the host where configuration files will be persisted. Must be specified.
  # Important: if you reinstall the cluster, make sure you delete this directory from each host or else the mons will fail to start on the new cluster.
  # In Minikube, the '/data' directory is configured to persist across reboots. Use "/data/rook" in Minikube environment.
  dataDirHostPath: /var/lib/rook
  # Whether or not upgrade should continue even if a check fails
  # This means Ceph's status could be degraded and we don't recommend upgrading but you might decide otherwise
  # Use at your OWN risk
  # To understand Rook's upgrade process of Ceph, read https://rook.io/docs/rook/latest/ceph-upgrade.html#ceph-version-upgrades
  skipUpgradeChecks: false
  # Whether or not continue if PGs are not clean during an upgrade
  continueUpgradeAfterChecksEvenIfNotHealthy: false
  # WaitTimeoutForHealthyOSDInMinutes defines the time (in minutes) the operator would wait before an OSD can be stopped for upgrade or restart.
  # If the timeout exceeds and OSD is not ok to stop, then the operator would skip upgrade for the current OSD and proceed with the next one
  # if `continueUpgradeAfterChecksEvenIfNotHealthy` is `false`. If `continueUpgradeAfterChecksEvenIfNotHealthy` is `true`, then operator would
  # continue with the upgrade of an OSD even if its not ok to stop after the timeout. This timeout won't be applied if `skipUpgradeChecks` is `true`.
  # The default wait timeout is 10 minutes.
  waitTimeoutForHealthyOSDInMinutes: 10
  mon:
    # Set the number of mons to be started. Generally recommended to be 3.
    # For highest availability, an odd number of mons should be specified.
    count: 3
    # The mons should be on unique nodes. For production, at least 3 nodes are recommended for this reason.
    # Mons should only be allowed on the same node for test environments where data loss is acceptable.
    allowMultiplePerNode: false
  mgr:
    # When higher availability of the mgr is needed, increase the count to 2.
    # In that case, one mgr will be active and one in standby. When Ceph updates which
    # mgr is active, Rook will update the mgr services to match the active mgr.
    count: 2
    allowMultiplePerNode: false
    modules:
      # List of modules to optionally enable or disable.
      # Note the "dashboard" and "monitoring" modules are already configured by other settings in the cluster CR.
      - name: rook
        enabled: true
  # enable the ceph dashboard for viewing cluster status
  dashboard:
    enabled: true
    # serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)
    # urlPrefix: /ceph-dashboard
    # serve the dashboard at the given port.
    # port: 8443
    # serve the dashboard using SSL
    ssl: true
    # The url of the Prometheus instance
    # prometheusEndpoint: <protocol>://<prometheus-host>:<port>
    # Whether SSL should be verified if the Prometheus server is using https
    # prometheusEndpointSSLVerify: false
  # enable prometheus alerting for cluster
  monitoring:
    # requires Prometheus to be pre-installed
    enabled: false
    # Whether to disable the metrics reported by Ceph. If false, the prometheus mgr module and Ceph exporter are enabled.
    # If true, the prometheus mgr module and Ceph exporter are both disabled. Default is false.
    metricsDisabled: false
  network:
    connections:
      # Whether to encrypt the data in transit across the wire to prevent eavesdropping the data on the network.
      # The default is false. When encryption is enabled, all communication between clients and Ceph daemons, or between Ceph daemons will be encrypted.
      # When encryption is not enabled, clients still establish a strong initial authentication and data integrity is still validated with a crc check.
      # IMPORTANT: Encryption requires the 5.11 kernel for the latest nbd and cephfs drivers. Alternatively for testing only,
      # you can set the "mounter: rbd-nbd" in the rbd storage class, or "mounter: fuse" in the cephfs storage class.
      # The nbd and fuse drivers are *not* recommended in production since restarting the csi driver pod will disconnect the volumes.
      encryption:
        enabled: false
      # Whether to compress the data in transit across the wire. The default is false.
      # See the kernel requirements above for encryption.
      compression:
        enabled: false
      # Whether to require communication over msgr2. If true, the msgr v1 port (6789) will be disabled
      # and clients will be required to connect to the Ceph cluster with the v2 port (3300).
      # Requires a kernel that supports msgr v2 (kernel 5.11 or CentOS 8.4 or newer).
      requireMsgr2: false
    # enable host networking
    #provider: host
    # enable the Multus network provider
    #provider: multus
    #selectors:
    #  The selector keys are required to be `public` and `cluster`.
    #  Based on the configuration, the operator will do the following:
    #    1. if only the `public` selector key is specified both public_network and cluster_network Ceph settings will listen on that interface
    #    2. if both `public` and `cluster` selector keys are specified the first one will point to 'public_network' flag and the second one to 'cluster_network'
    #
    #  In order to work, each selector value must match a NetworkAttachmentDefinition object in Multus
    #
    #  public: public-conf --> NetworkAttachmentDefinition object name in Multus
    #  cluster: cluster-conf --> NetworkAttachmentDefinition object name in Multus
    # Provide internet protocol version. IPv6, IPv4 or empty string are valid options. Empty string would mean IPv4
    #ipFamily: "IPv6"
    # Ceph daemons to listen on both IPv4 and Ipv6 networks
    #dualStack: false
    # Enable multiClusterService to export the mon and OSD services to peer cluster.
    # This is useful to support RBD mirroring between two clusters having overlapping CIDRs.
    # Ensure that peer clusters are connected using an MCS API compatible application, like Globalnet Submariner.
    #multiClusterService:
    #  enabled: false

  # enable the crash collector for ceph daemon crash collection
  crashCollector:
    disable: false
    # Uncomment daysToRetain to prune ceph crash entries older than the
    # specified number of days.
    #daysToRetain: 30
  # enable log collector, daemons will log on files and rotate
  logCollector:
    enabled: true
    periodicity: daily # one of: hourly, daily, weekly, monthly
    maxLogSize: 500M # SUFFIX may be 'M' or 'G'. Must be at least 1M.
  # automate [data cleanup process](https://github.com/rook/rook/blob/master/Documentation/Storage-Configuration/ceph-teardown.md#delete-the-data-on-hosts) in cluster destruction.
  cleanupPolicy:
    # Since cluster cleanup is destructive to data, confirmation is required.
    # To destroy all Rook data on hosts during uninstall, confirmation must be set to "yes-really-destroy-data".
    # This value should only be set when the cluster is about to be deleted. After the confirmation is set,
    # Rook will immediately stop configuring the cluster and only wait for the delete command.
    # If the empty string is set, Rook will not destroy any data on hosts during uninstall.
    confirmation: ""
    # sanitizeDisks represents settings for sanitizing OSD disks on cluster deletion
    sanitizeDisks:
      # method indicates if the entire disk should be sanitized or simply ceph's metadata
      # in both case, re-install is possible
      # possible choices are 'complete' or 'quick' (default)
      method: quick
      # dataSource indicate where to get random bytes from to write on the disk
      # possible choices are 'zero' (default) or 'random'
      # using random sources will consume entropy from the system and will take much more time then the zero source
      dataSource: zero
      # iteration overwrite N times instead of the default (1)
      # takes an integer value
      iteration: 1
    # allowUninstallWithVolumes defines how the uninstall should be performed
    # If set to true, cephCluster deletion does not wait for the PVs to be deleted.
    allowUninstallWithVolumes: false
  # To control where various services will be scheduled by kubernetes, use the placement configuration sections below.
  # The example under 'all' would have all services scheduled on kubernetes nodes labeled with 'role=storage-node' and
  # tolerate taints with a key of 'storage-node'.
  # placement:
  #   all:
  #     nodeAffinity:
  #       requiredDuringSchedulingIgnoredDuringExecution:
  #         nodeSelectorTerms:
  #         - matchExpressions:
  #           - key: role
  #             operator: In
  #             values:
  #             - storage-node
  #     podAffinity:
  #     podAntiAffinity:
  #     topologySpreadConstraints:
  #     tolerations:
  #     - key: storage-node
  #       operator: Exists
  # The above placement information can also be specified for mon, osd, and mgr components
  #   mon:
  # Monitor deployments may contain an anti-affinity rule for avoiding monitor
  # collocation on the same node. This is a required rule when host network is used
  # or when AllowMultiplePerNode is false. Otherwise this anti-affinity rule is a
  # preferred rule with weight: 50.
  #   osd:
  #    prepareosd:
  #    mgr:
  #    cleanup:
  annotations:
  #   all:
  #   mon:
  #   osd:
  #   cleanup:
  #   prepareosd:
  # clusterMetadata annotations will be applied to only `rook-ceph-mon-endpoints` configmap and the `rook-ceph-mon` and `rook-ceph-admin-keyring` secrets.
  # And clusterMetadata annotations will not be merged with `all` annotations.
  #    clusterMetadata:
  #       kubed.appscode.com/sync: "true"
  # If no mgr annotations are set, prometheus scrape annotations will be set by default.
  #   mgr:
  labels:
  #   all:
  #   mon:
  #   osd:
  #   cleanup:
  #   mgr:
  #   prepareosd:
  # These labels are applied to ceph-exporter servicemonitor only
  #   exporter:
  # monitoring is a list of key-value pairs. It is injected into all the monitoring resources created by operator.
  # These labels can be passed as LabelSelector to Prometheus
  #   monitoring:
  #   crashcollector:
  resources:
  #The requests and limits set here, allow the mgr pod to use half of one CPU core and 1 gigabyte of memory
  #   mgr:
  #     limits:
  #       memory: "1024Mi"
  #     requests:
  #       cpu: "500m"
  #       memory: "1024Mi"
  # The above example requests/limits can also be added to the other components
  #   mon:
  #   osd:
  # For OSD it also is a possible to specify requests/limits based on device class
  #   osd-hdd:
  #   osd-ssd:
  #   osd-nvme:
  #   prepareosd:
  #   mgr-sidecar:
  #   crashcollector:
  #   logcollector:
  #   cleanup:
  #   exporter:
  # The option to automatically remove OSDs that are out and are safe to destroy.
  removeOSDsIfOutAndSafeToRemove: false
  priorityClassNames:
    #all: rook-ceph-default-priority-class
    mon: system-node-critical
    osd: system-node-critical
    mgr: system-cluster-critical
    #crashcollector: rook-ceph-crashcollector-priority-class
  storage: # cluster level storage configuration and selection
    useAllNodes: true
    useAllDevices: true
    #deviceFilter:
    config:
      # crushRoot: "custom-root" # specify a non-default root label for the CRUSH map
      # metadataDevice: "md0" # specify a non-rotational storage so ceph-volume will use it as block db device of bluestore.
      # databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB
      # osdsPerDevice: "1" # this value can be overridden at the node or device level
      # encryptedDevice: "true" # the default value for this option is "false"
    # Individual nodes and their config can be specified as well, but 'useAllNodes' above must be set to false. Then, only the named
    # nodes below will be used as storage resources.  Each node's 'name' field should match their 'kubernetes.io/hostname' label.
    # nodes:
    #   - name: "172.17.4.201"
    #     devices: # specific devices to use for storage can be specified for each node
    #       - name: "sdb"
    #       - name: "nvme01" # multiple osds can be created on high performance devices
    #         config:
    #           osdsPerDevice: "5"
    #       - name: "/dev/disk/by-id/ata-ST4000DM004-XXXX" # devices can be specified using full udev paths
    #     config: # configuration can be specified at the node level which overrides the cluster level config
    #   - name: "172.17.4.301"
    #     deviceFilter: "^sd."
    # when onlyApplyOSDPlacement is false, will merge both placement.All() and placement.osd
    onlyApplyOSDPlacement: false
    # Time for which an OSD pod will sleep before restarting, if it stopped due to flapping
    # flappingRestartIntervalHours: 24
  # The section for configuring management of daemon disruptions during upgrade or fencing.
  disruptionManagement:
    # If true, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically
    # via the strategy outlined in the [design](https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md). The operator will
    # block eviction of OSDs by default and unblock them safely when drains are detected.
    managePodBudgets: true
    # A duration in minutes that determines how long an entire failureDomain like `region/zone/host` will be held in `noout` (in addition to the
    # default DOWN/OUT interval) when it is draining. This is only relevant when  `managePodBudgets` is `true`. The default value is `30` minutes.
    osdMaintenanceTimeout: 30
    # A duration in minutes that the operator will wait for the placement groups to become healthy (active+clean) after a drain was completed and OSDs came back up.
    # Operator will continue with the next drain if the timeout exceeds. It only works if `managePodBudgets` is `true`.
    # No values or 0 means that the operator will wait until the placement groups are healthy before unblocking the next drain.
    pgHealthCheckTimeout: 0

  # csi defines CSI Driver settings applied per cluster.
  csi:
    readAffinity:
      # Enable read affinity to enable clients to optimize reads from an OSD in the same topology.
      # Enabling the read affinity may cause the OSDs to consume some extra memory.
      # For more details see this doc:
      # https://rook.io/docs/rook/latest/Storage-Configuration/Ceph-CSI/ceph-csi-drivers/#enable-read-affinity-for-rbd-volumes
      enabled: false

    # cephfs driver specific settings.
    cephfs:
      # Set CephFS Kernel mount options to use https://docs.ceph.com/en/latest/man/8/mount.ceph/#options.
      # kernelMountOptions: ""
      # Set CephFS Fuse mount options to use https://docs.ceph.com/en/quincy/man/8/ceph-fuse/#options.
      # fuseMountOptions: ""

  # healthChecks
  # Valid values for daemons are 'mon', 'osd', 'status'
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    # Change pod liveness probe timing or threshold values. Works for all mon,mgr,osd daemons.
    livenessProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false
    # Change pod startup probe timing or threshold values. Works for all mon,mgr,osd daemons.
    startupProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false

Cluster Status to submit:

[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph exec -it rook-ceph-tools-6dd5d5d5dd-cwg4h -- bash
bash-4.4$ ceph status
  cluster:
    id:     b238034b-7f84-42d2-b8fb-205a8ed5f6b4
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum a,b,c (age 44m)
    mgr: a(active, since 42m), standbys: b
    mds: 1/1 daemons up, 1 hot standby
    osd: 4 osds: 4 up (since 43m), 4 in (since 43m)

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 81 pgs
    objects: 25 objects, 593 KiB
    usage:   110 MiB used, 400 GiB / 400 GiB avail
    pgs:     81 active+clean

  io:
    client:   853 B/s rd, 1 op/s rd, 0 op/s wr

bash-4.4$
@pomland-94 pomland-94 added the bug label Apr 2, 2024
@travisn
Copy link
Member

travisn commented Apr 2, 2024

The cluster looks healthy from the status you shared. A couple ideas...

  • What configmaps do you see in the rook-ceph namespace? Do you see those from the log messages such as rook-ceph-csi-detect-version, or do they get deleted in the end?
  • Try re-applying the common.yaml, perhaps there is some RBAC issue with deleting configmaps.

@pomland-94
Copy link
Author

The Installation of rook doesn't went well, I have to delete the mon-*-canary Pods/Deplyoments, because the mon pods get stuck in state pending.

I reapply the common.yml but nothing happened, same error.

[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph get cm
NAME                                  DATA   AGE
ceph-file-controller-detect-version   3      148m
kube-root-ca.crt                      1      155m
local-device-worker-01-de-nbg1-dc3    1      155m
local-device-worker-02-de-nbg1-dc3    1      155m
local-device-worker-03-de-nbg1-dc3    1      155m
local-device-worker-04-de-nbg1-dc3    1      155m
local-device-worker-05-de-nbg1-dc3    1      113m
local-device-worker-06-de-nbg1-dc3    1      113m
rook-ceph-csi-config                  1      154m
rook-ceph-csi-detect-version          3      154m
rook-ceph-csi-mapping-config          1      154m
rook-ceph-detect-version              3      154m
rook-ceph-mon-endpoints               5      154m
rook-ceph-operator-config             33     155m
rook-ceph-pdbstatemap                 2      148m
rook-config-override                  1      154m
[root::bastion-01-de-nbg1-dc3]
~:

@travisn
Copy link
Member

travisn commented Apr 2, 2024

The ceph status output from your original post included healthy status. Was that from a different cluster?

Can you share kubectl -n rook-ceph describe pod <mon> do show why they are getting stuck?

Try also a complete cleanup according to the cleanup guide and reinstall.

@pomland-94
Copy link
Author

now there are no Pods left with status pending, it was only at the bootstrap proccess. after i delete the mon canary deployments the main mon pods came up.

@pomland-94
Copy link
Author

pomland-94 commented Apr 3, 2024

Soo i tried it with a reinstall, same behaviour. This is the actual State when i apply my cluster.yaml

[root::bastion-01-de-nbg1-dc3]
~/rook/deploy/examples: kubectl --namespace rook-ceph get pods
NAME                                           READY   STATUS      RESTARTS   AGE
csi-cephfsplugin-2mfg7                         2/2     Running     0          27s
csi-cephfsplugin-gvcfz                         2/2     Running     0          27s
csi-cephfsplugin-pjdhw                         2/2     Running     0          27s
csi-cephfsplugin-provisioner-d5f4bb8c4-bksjq   5/5     Running     0          27s
csi-cephfsplugin-provisioner-d5f4bb8c4-t7dr6   5/5     Running     0          27s
csi-cephfsplugin-skzt8                         2/2     Running     0          27s
csi-cephfsplugin-vxh4z                         2/2     Running     0          27s
csi-cephfsplugin-zc4qs                         2/2     Running     0          27s
csi-rbdplugin-2mwnk                            2/2     Running     0          27s
csi-rbdplugin-2z6jx                            2/2     Running     0          27s
csi-rbdplugin-6mtcn                            2/2     Running     0          27s
csi-rbdplugin-mhvrs                            2/2     Running     0          27s
csi-rbdplugin-provisioner-b8f6cd9cf-qwl9v      5/5     Running     0          27s
csi-rbdplugin-provisioner-b8f6cd9cf-zkfcv      5/5     Running     0          27s
csi-rbdplugin-r8tqd                            2/2     Running     0          27s
csi-rbdplugin-v9f24                            2/2     Running     0          27s
rook-ceph-csi-detect-version-46rl8             0/1     Completed   0          45s
rook-ceph-detect-version-6bxs9                 0/1     Completed   0          45s
rook-ceph-mon-a-557c6696f9-qvt5r               0/2     Pending     0          34s
rook-ceph-mon-a-canary-5b8d49c665-nb2xc        2/2     Running     0          38s
rook-ceph-mon-b-canary-6869cf4654-lvm6r        2/2     Running     0          38s
rook-ceph-mon-c-canary-6dcbcbd6d7-fnp45        2/2     Running     0          38s
rook-ceph-operator-757cdc49bb-dbwn7            1/1     Running     0          102s
rook-discover-bf2hb                            1/1     Running     0          96s
rook-discover-hzztl                            1/1     Running     0          96s
rook-discover-kwv62                            1/1     Running     0          96s
rook-discover-lcgrs                            1/1     Running     0          96s
rook-discover-s9n48                            1/1     Running     0          96s
rook-discover-vlkxj                            1/1     Running     0          96s
[root::bastion-01-de-nbg1-dc3]
~/rook/deploy/examples:

And this is what a describe says from the Pending Pod:

[root::bastion-01-de-nbg1-dc3]
~/rook/deploy/examples: kubectl --namespace rook-ceph describe pod rook-ceph-mon-a-557c6696f9-qvt5r
Name:                 rook-ceph-mon-a-557c6696f9-qvt5r
Namespace:            rook-ceph
Priority:             [2000001000](tel:2000001000)
Priority Class Name:  system-node-critical
Service Account:      rook-ceph-default
Node:                 <none>
Labels:               app=rook-ceph-mon
                      app.kubernetes.io/component=cephclusters.ceph.rook.io
                      app.kubernetes.io/created-by=rook-ceph-operator
                      app.kubernetes.io/instance=a
                      app.kubernetes.io/managed-by=rook-ceph-operator
                      app.kubernetes.io/name=ceph-mon
                      app.kubernetes.io/part-of=rook-ceph
                      ceph_daemon_id=a
                      ceph_daemon_type=mon
                      mon=a
                      mon_cluster=rook-ceph
                      pod-template-hash=557c6696f9
                      rook.io/operator-namespace=rook-ceph
                      rook_cluster=rook-ceph
Annotations:          <none>
Status:               Pending
IP:
IPs:                  <none>
Controlled By:        ReplicaSet/rook-ceph-mon-a-557c6696f9
Init Containers:
  chown-container-data-dir:
    Image:      quay.io/ceph/ceph:v18.2.2
    Port:       <none>
    Host Port:  <none>
    Command:
      chown
    Args:
      --verbose
      --recursive
      ceph:ceph
      /var/log/ceph
      /var/lib/ceph/crash
      /run/ceph
      /var/lib/ceph/mon/ceph-a
    Environment:  <none>
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro)
      /run/ceph from ceph-daemons-sock-dir (rw)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ns74g (ro)
  init-mon-fs:
    Image:      quay.io/ceph/ceph:v18.2.2
    Port:       <none>
    Host Port:  <none>
    Command:
      ceph-mon
    Args:
      --fsid=9460fe9e-2a9d-476c-be66-24fffabcfab0
      --keyring=/etc/ceph/keyring-store/keyring
      --default-log-to-stderr=true
      --default-err-to-stderr=true
      --default-mon-cluster-log-to-stderr=true
      --default-log-stderr-prefix=debug
      --default-log-to-file=false
      --default-mon-cluster-log-to-file=false
      --mon-host=$(ROOK_CEPH_MON_HOST)
      --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS)
      --id=a
      --setuser=ceph
      --setgroup=ceph
      --public-addr=10.233.62.5
      --mkfs
    Environment:
      CONTAINER_IMAGE:                quay.io/ceph/ceph:v18.2.2
      POD_NAME:                       rook-ceph-mon-a-557c6696f9-qvt5r (v1:metadata.name)
      POD_NAMESPACE:                  rook-ceph (v1:metadata.namespace)
      NODE_NAME:                       (v1:spec.nodeName)
      POD_MEMORY_LIMIT:               node allocatable (limits.memory)
      POD_MEMORY_REQUEST:             0 (requests.memory)
      POD_CPU_LIMIT:                  node allocatable (limits.cpu)
      POD_CPU_REQUEST:                0 (requests.cpu)
      CEPH_USE_RANDOM_NONCE:          true
      ROOK_MSGR2:                     msgr2_false_encryption_false_compression_false
      ROOK_CEPH_MON_HOST:             <set to the key 'mon_host' in secret 'rook-ceph-config'>             Optional: false
      ROOK_CEPH_MON_INITIAL_MEMBERS:  <set to the key 'mon_initial_members' in secret 'rook-ceph-config'>  Optional: false
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro)
      /run/ceph from ceph-daemons-sock-dir (rw)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ns74g (ro)
Containers:
  mon:
    Image:       quay.io/ceph/ceph:v18.2.2
    Ports:       3300/TCP, 6789/TCP
    Host Ports:  0/TCP, 0/TCP
    Command:
      ceph-mon
    Args:
      --fsid=9460fe9e-2a9d-476c-be66-24fffabcfab0
      --keyring=/etc/ceph/keyring-store/keyring
      --default-log-to-stderr=true
      --default-err-to-stderr=true
      --default-mon-cluster-log-to-stderr=true
      --default-log-stderr-prefix=debug
      --default-log-to-file=false
      --default-mon-cluster-log-to-file=false
      --mon-host=$(ROOK_CEPH_MON_HOST)
      --mon-initial-members=$(ROOK_CEPH_MON_INITIAL_MEMBERS)
      --id=a
      --setuser=ceph
      --setgroup=ceph
      --foreground
      --public-addr=10.233.62.5
      --setuser-match-path=/var/lib/ceph/mon/ceph-a/store.db
      --public-bind-addr=$(ROOK_POD_IP)
    Liveness:  exec [env -i sh -c
outp="$(ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status 2>&1)"
rc=$?
if [ $rc -ne 0 ]; then
  echo "ceph daemon health check failed with the following output:"
  echo "$outp" | sed -e 's/^/> /g'
  exit $rc
fi
] delay=10s timeout=5s period=10s #success=1 #failure=3
    Startup:  exec [env -i sh -c
outp="$(ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status 2>&1)"
rc=$?
if [ $rc -ne 0 ]; then
  echo "ceph daemon health check failed with the following output:"
  echo "$outp" | sed -e 's/^/> /g'
  exit $rc
fi
] delay=10s timeout=5s period=10s #success=1 #failure=6
    Environment:
      CONTAINER_IMAGE:                quay.io/ceph/ceph:v18.2.2
      POD_NAME:                       rook-ceph-mon-a-557c6696f9-qvt5r (v1:metadata.name)
      POD_NAMESPACE:                  rook-ceph (v1:metadata.namespace)
      NODE_NAME:                       (v1:spec.nodeName)
      POD_MEMORY_LIMIT:               node allocatable (limits.memory)
      POD_MEMORY_REQUEST:             0 (requests.memory)
      POD_CPU_LIMIT:                  node allocatable (limits.cpu)
      POD_CPU_REQUEST:                0 (requests.cpu)
      CEPH_USE_RANDOM_NONCE:          true
      ROOK_MSGR2:                     msgr2_false_encryption_false_compression_false
      ROOK_CEPH_MON_HOST:             <set to the key 'mon_host' in secret 'rook-ceph-config'>             Optional: false
      ROOK_CEPH_MON_INITIAL_MEMBERS:  <set to the key 'mon_initial_members' in secret 'rook-ceph-config'>  Optional: false
      ROOK_POD_IP:                     (v1:status.podIP)
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro)
      /run/ceph from ceph-daemons-sock-dir (rw)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ns74g (ro)
  log-collector:
    Image:      quay.io/ceph/ceph:v18.2.2
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/bash
      -x
      -e
      -m
      -c
 
      CEPH_CLIENT_ID=ceph-mon.a
      PERIODICITY=daily
      LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph
      LOG_MAX_SIZE=500M
      ROTATE=7
 
      # edit the logrotate file to only rotate a specific daemon log
      # otherwise we will logrotate log files without reloading certain daemons
      # this might happen when multiple daemons run on the same machine
      sed -i "s|*.log|$CEPH_CLIENT_ID.log|" "$LOG_ROTATE_CEPH_FILE"
 
      # replace default daily with given user input
      sed --in-place "s/daily/$PERIODICITY/g" "$LOG_ROTATE_CEPH_FILE"
 
      # replace rotate count, default 7 for all ceph daemons other than rbd-mirror
      sed --in-place "s/rotate 7/rotate $ROTATE/g" "$LOG_ROTATE_CEPH_FILE"
 
      if [ "$LOG_MAX_SIZE" != "0" ]; then
        # adding maxsize $LOG_MAX_SIZE at the 4th line of the logrotate config file with 4 spaces to maintain indentation
        sed --in-place "4i \ \ \ \ maxsize $LOG_MAX_SIZE" "$LOG_ROTATE_CEPH_FILE"
      fi
 
      while true; do
        # we don't force the logrorate but we let the logrotate binary handle the rotation based on user's input for periodicity and size
        logrotate --verbose "$LOG_ROTATE_CEPH_FILE"
        sleep 15m
      done
 
    Environment:  <none>
    Mounts:
      /etc/ceph from rook-config-override (ro)
      /run/ceph from ceph-daemons-sock-dir (rw)
      /var/lib/ceph/crash from rook-ceph-crash (rw)
      /var/log/ceph from rook-ceph-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ns74g (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  rook-config-override:
    Type:               Projected (a volume that contains injected data from multiple sources)
    ConfigMapName:      rook-config-override
    ConfigMapOptional:  <nil>
  rook-ceph-mons-keyring:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  rook-ceph-mons-keyring
    Optional:    false
  ceph-daemons-sock-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/exporter
    HostPathType:  DirectoryOrCreate
  rook-ceph-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/rook-ceph/log
    HostPathType:
  rook-ceph-crash:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/rook-ceph/crash
    HostPathType:
  ceph-daemon-data:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/rook/mon-a/data
    HostPathType:
  kube-api-access-ns74g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/hostname=worker-04-de-nbg1-dc3
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  2m18s  default-scheduler  0/9 nodes are available: 1 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/9 nodes are available: 1 node(s) didn't satisfy existing pods anti-affinity rules, 8 Preemption is not helpful for scheduling..
  Warning  FailedScheduling  2m8s   default-scheduler  0/9 nodes are available: 1 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/9 nodes are available: 1 node(s) didn't satisfy existing pods anti-affinity rules, 8 Preemption is not helpful for scheduling..
[root::bastion-01-de-nbg1-dc3]
~/rook/deploy/examples:

And this are the logs from the Pod:

[root::bastion-01-de-nbg1-dc3]
~/rook/deploy/examples: kubectl --namespace rook-ceph logs pod/rook-ceph-mon-a-557c6696f9-qvt5r
Defaulted container "mon" out of: mon, log-collector, chown-container-data-dir (init), init-mon-fs (init)
[root::bastion-01-de-nbg1-dc3]
~/rook/deploy/examples:

When i delete the mon a, b, c canary Deployments, Statefulsets and Pods the Pod is coming up and the Installation is successfully, but the Errors inside the Operator Pods are the same.

@subhamkrai
Copy link
Contributor

@pomland-94 mon a, b, c canary Deployments will be auto-deleted did you manually deleted those?

@pomland-94
Copy link
Author

Yes, because they are not deleted automaticly.

@subhamkrai
Copy link
Contributor

I think you can watch canary logs why it is not deleted

@pomland-94
Copy link
Author

This are the logs from the Canary Pods:

[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph logs pod/rook-ceph-mon-a-canary-5b8d49c665-nb2xc
Defaulted container "mon" out of: mon, log-collector
[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph logs pod/rook-ceph-mon-b-canary-6869cf4654-lvm6r
Defaulted container "mon" out of: mon, log-collector
[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph logs pod/rook-ceph-mon-c-canary-6dcbcbd6d7-fnp45
Defaulted container "mon" out of: mon, log-collector
[root::bastion-01-de-nbg1-dc3]
~:

@BlaineEXE
Copy link
Member

This sounds like an issue with the Kubernetes platform to me. I also notice that you didn't fill out the full issue template questions that request environment information. Please add the missing info:

**Environment**:
* OS (e.g. from /etc/os-release):
* Kernel (e.g. `uname -a`):
* Cloud provider or hardware configuration:
* Rook version (use `rook version` inside of a Rook Pod):
* Storage backend version (e.g. for ceph do `ceph -v`):
* Kubernetes version (use `kubectl version`):
* Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
* Storage backend status (e.g. for Ceph use `ceph health` in the [Rook Ceph toolbox](https://rook.io/docs/rook/latest-release/Troubleshooting/ceph-toolbox/#interactive-toolbox)):

@pomland-94
Copy link
Author

pomland-94 commented Apr 4, 2024

I Fill out some of the Infos:

How to reproduce it (minimal and precise):

  • Kubernetes version: v1.28.6
  • OS
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Kernel Linux master-01-de-nbg1-dc3 6.1.0-18-arm64 #1 SMP Debian 6.1.76-1 (2024-02-01) aarch64 GNU/Linux
  • Kubernetes cluster type: On Premise (Installed with Kubespray)
  • Rook 1.13.7

@pomland-94
Copy link
Author

Anyone here with an idea?

@pomland-94
Copy link
Author

Hmm ok, maybe there is no solution to fix this? So in my company we go away from rook and look for another Storage Solutions with the features of rook.

@pomland-94 pomland-94 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 23, 2024
@travisn
Copy link
Member

travisn commented Apr 23, 2024

@pomland-94 Did you try in any other environments? Even in minikube to see if it will work for you? These environmental issues are difficult for us to troubleshoot since we cannot reproduce it in our environments.

@Madhu-1
Copy link
Member

Madhu-1 commented Apr 24, 2024

rook-ceph-csi-detect-version-46rl8 0/1 Completed 0 45s
rook-ceph-detect-version-6bxs9 0/1 Completed 0 45s

it looks like the configmap is held by some finalizers and not allowing to be deleted.

@pomland-94 can you please share the below details? and also lets not delete any resources manually because Rook handles all the deletion part of the resources its created. Ensure you try it on a new machine or cleanup the existing machine as part Rook doc.

  • -oyaml output of the rook-ceph-csi-detect-version configmap
  • logs on the csi job pod
  • -oyaml output of the job pod.

@pomland-94
Copy link
Author

This are other Logs from the Jobs:

[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph describe jobs/rook-ceph-csi-detect-version
Name:             rook-ceph-csi-detect-version
Namespace:        rook-ceph
Selector:         batch.kubernetes.io/controller-uid=54c42112-562e-4f63-87e9-8fe95521fafc
Labels:           app=rook-ceph-csi-detect-version
                  rook-version=v1.14.2
Annotations:      <none>
Controlled By:    Deployment/rook-ceph-operator
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Tue, 23 Apr 2024 20:11:36 +0200
Pods Statuses:    0 Active (0 Ready) / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=rook-ceph-csi-detect-version
                    batch.kubernetes.io/controller-uid=54c42112-562e-4f63-87e9-8fe95521fafc
                    batch.kubernetes.io/job-name=rook-ceph-csi-detect-version
                    controller-uid=54c42112-562e-4f63-87e9-8fe95521fafc
                    job-name=rook-ceph-csi-detect-version
                    rook-version=v1.14.2
  Service Account:  rook-ceph-system
  Init Containers:
   init-copy-binaries:
    Image:      rook/ceph:v1.14.2
    Port:       <none>
    Host Port:  <none>
    Command:
      cp
    Args:
      --archive
      --force
      --verbose
      /usr/local/bin/rook
      /rook/copied-binaries
    Environment:  <none>
    Mounts:
      /rook/copied-binaries from rook-copied-binaries (rw)
  Containers:
   cmd-reporter:
    Image:      quay.io/cephcsi/cephcsi:v3.11.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /rook/copied-binaries/rook
    Args:
      cmd-reporter
      --command
      {"cmd":["cephcsi"],"args":["--version"]}
      --config-map-name
      rook-ceph-csi-detect-version
      --namespace
      rook-ceph
    Environment:  <none>
    Mounts:
      /rook/copied-binaries from rook-copied-binaries (rw)
  Volumes:
   rook-copied-binaries:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
Events:         <none>
[root::bastion-01-de-nbg1-dc3]
~:
[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph describe jobs/rook-ceph-detect-version
Name:             rook-ceph-detect-version
Namespace:        rook-ceph
Selector:         batch.kubernetes.io/controller-uid=dcfa5441-f5fa-443d-b91b-d3455ec471e0
Labels:           app=rook-ceph-detect-version
                  rook-version=v1.14.2
Annotations:      <none>
Controlled By:    CephCluster/rook-ceph
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Tue, 23 Apr 2024 20:11:36 +0200
Pods Statuses:    0 Active (0 Ready) / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           app=rook-ceph-detect-version
                    batch.kubernetes.io/controller-uid=dcfa5441-f5fa-443d-b91b-d3455ec471e0
                    batch.kubernetes.io/job-name=rook-ceph-detect-version
                    controller-uid=dcfa5441-f5fa-443d-b91b-d3455ec471e0
                    job-name=rook-ceph-detect-version
                    rook-version=v1.14.2
  Service Account:  rook-ceph-cmd-reporter
  Init Containers:
   init-copy-binaries:
    Image:      rook/ceph:v1.14.2
    Port:       <none>
    Host Port:  <none>
    Command:
      cp
    Args:
      --archive
      --force
      --verbose
      /usr/local/bin/rook
      /rook/copied-binaries
    Environment:  <none>
    Mounts:
      /rook/copied-binaries from rook-copied-binaries (rw)
  Containers:
   cmd-reporter:
    Image:      quay.io/ceph/ceph:v18.2.2
    Port:       <none>
    Host Port:  <none>
    Command:
      /rook/copied-binaries/rook
    Args:
      cmd-reporter
      --command
      {"cmd":["ceph"],"args":["--version"]}
      --config-map-name
      rook-ceph-detect-version
      --namespace
      rook-ceph
    Environment:  <none>
    Mounts:
      /rook/copied-binaries from rook-copied-binaries (rw)
  Volumes:
   rook-copied-binaries:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
Events:         <none>
[root::bastion-01-de-nbg1-dc3]
~:
[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph logs job/rook-ceph-csi-detect-version -f
Defaulted container "cmd-reporter" out of: cmd-reporter, init-copy-binaries (init)
2024/04/23 18:11:37 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
2024-04-23 18:11:37.890788 I | job-reporter-cmd: running command: /usr/local/bin/cephcsi cephcsi --version
Cephcsi Version: v3.11.0
Git Commit: bc24b5eca87626d690a29effa9d7420cc0154a7a
Go Version: go1.21.5
Compiler: gc
Platform: linux/arm64
Kernel: 6.1.0-18-arm64
[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph logs job/rook-ceph-detect-version -f
Defaulted container "cmd-reporter" out of: cmd-reporter, init-copy-binaries (init)
2024/04/23 18:11:38 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
2024-04-23 18:11:38.203899 I | job-reporter-cmd: running command: /usr/bin/ceph ceph --version
ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
[root::bastion-01-de-nbg1-dc3]
~:

@pomland-94 pomland-94 reopened this Apr 24, 2024
@pomland-94
Copy link
Author

[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph get cm/rook-ceph-csi-detect-version -o yaml
apiVersion: v1
data:
  retcode: "0"
  stderr: ""
  stdout: |
    Cephcsi Version: v3.11.0
    Git Commit: bc24b5eca87626d690a29effa9d7420cc0154a7a
    Go Version: go1.21.5
    Compiler: gc
    Platform: linux/arm64
    Kernel: 6.1.0-18-arm64
kind: ConfigMap
metadata:
  creationTimestamp: "2024-04-23T18:11:37Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-04-23T18:11:37Z"
  finalizers:
  - foregroundDeletion
  labels:
    app: rook-cmd-reporter
  name: rook-ceph-csi-detect-version
  namespace: rook-ceph
  resourceVersion: "9191"
  uid: ddd3fcc3-bf7d-4a13-b5b9-ec3f9ba37a44
[root::bastion-01-de-nbg1-dc3]
~: kubectl --namespace rook-ceph get cm/rook-ceph-detect-version -o yaml
apiVersion: v1
data:
  retcode: "0"
  stderr: ""
  stdout: |
    ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
kind: ConfigMap
metadata:
  creationTimestamp: "2024-04-23T18:11:38Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-04-23T18:11:38Z"
  finalizers:
  - foregroundDeletion
  labels:
    app: rook-cmd-reporter
  name: rook-ceph-detect-version
  namespace: rook-ceph
  resourceVersion: "9270"
  uid: 1031113a-b480-4571-ada8-e9cdb94b1eef
[root::bastion-01-de-nbg1-dc3]
~:

@travisn
Copy link
Member

travisn commented Apr 24, 2024

The resources all have the finalizer:

  finalizers:
  - foregroundDeletion

Do you have some policy in your cluster that would add this finalizer? Can you disable that policy to see if it fixes the issue?

@pomland-94
Copy link
Author

It‘s a good question, i installed my kubespray cluster with the following hardening values.

# Hardening
---
## kube-apiserver
authorization_modes: ["Node", "RBAC"]
# AppArmor-based OS
kube_apiserver_feature_gates: ["AppArmor=true"]
kube_apiserver_request_timeout: 120s
kube_apiserver_service_account_lookup: true

# enable kubernetes audit
kubernetes_audit: true
audit_log_path: "/var/log/kube-apiserver-log.json"
audit_policy_file: "{{ kube_config_dir }}/audit-policy/apiserver-audit-policy.yaml"
audit_log_maxage: 30
audit_log_maxbackups: 10
audit_log_maxsize: 100

tls_min_version: VersionTLS12
tls_cipher_suites:
  - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
  - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
  - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305

# enable encryption at rest
kube_encrypt_secret_data: true
kube_encryption_resources: [secrets]
kube_encryption_algorithm: "secretbox"

kube_apiserver_enable_admission_plugins:
  - EventRateLimit
  - AlwaysPullImages
  - ServiceAccount
  - NamespaceLifecycle
  - NodeRestriction
  - LimitRanger
  - ResourceQuota
  - MutatingAdmissionWebhook
  - ValidatingAdmissionWebhook
  - PodNodeSelector
  - PodSecurity
kube_apiserver_admission_control_config_file: true
# EventRateLimit plugin configuration
kube_apiserver_admission_event_rate_limits:
  limit_1:
    type: Namespace
    qps: 50
    burst: 100
    cache_size: 2000
  limit_2:
    type: User
    qps: 50
    burst: 100
kube_profiling: true

## kube-controller-manager
# kube_controller_manager_bind_address: 127.0.0.1
kube_controller_terminated_pod_gc_threshold: 50
# AppArmor-based OS
# kube_controller_feature_gates: ["RotateKubeletServerCertificate=true"]
kube_controller_feature_gates: ["RotateKubeletServerCertificate=true", "AppArmor=true"]

## kube-scheduler
# kube_scheduler_bind_address: 127.0.0.1
# AppArmor-based OS
kube_scheduler_feature_gates: ["AppArmor=true"]

## etcd
etcd_deployment_type: kubeadm

## kubelet
kubelet_authorization_mode_webhook: true
kubelet_authentication_token_webhook: true
kube_read_only_port: 0
kubelet_rotate_server_certificates: false
kubelet_csr_approver_values:
  bypassHostnameCheck: true
  bypassDnsResolution: true
kubelet_protect_kernel_defaults: true
kubelet_event_record_qps: 1
kubelet_rotate_certificates: true
kubelet_streaming_connection_idle_timeout: "5m"
kubelet_make_iptables_util_chains: true
kubelet_feature_gates: ["RotateKubeletServerCertificate=true", "SeccompDefault=true"]
kubelet_seccomp_default: true
kubelet_systemd_hardening: false
# In case you have multiple interfaces in your
# control plane nodes and you want to specify the right
# IP addresses, kubelet_secure_addresses allows you
# to specify the IP from which the kubelet
# will receive the packets.
# kubelet_secure_addresses: "192.168.10.110 192.168.10.111 192.168.10.112"

# additional configurations
kube_owner: root
kube_cert_group: root

# create a default Pod Security Configuration and deny running of insecure pods
# kube_system namespace is exempted by default
kube_pod_security_use_default: true
kube_pod_security_default_enforce: restricted
kube_pod_security_exemptions_namespaces:
  - kube-system
  - calico-apiserver
  - metrics-server
  - rook-ceph
  - prometheus

@travisn
Copy link
Member

travisn commented Apr 24, 2024

The setting causing the foreground deletion is not obvious in that config. Background deletion is the K8s default, and somehow foreground is now being enforced. The topic on Cascading deletion may have some clues.

@pomland-94
Copy link
Author

so i have to create the deployment with the following Option?

kubectl apply -f crds.yaml -f common.yaml -f operator.yaml --cascade=foreground

kubectl apply -f cluster.yaml --cascade=foreground

When i interpret this article right?

@travisn
Copy link
Member

travisn commented Apr 24, 2024

No that won't help, whatever policy is causing this foreground deletion policy will still apply when the rook operator creates its resources. You need to find the policy that is causing the foreground policy and disable it so the default background policy is restored.

@pomland-94
Copy link
Author

I can't find the policy, I searched my whole cluster but can't find anything. I also tried to create the following cluster binding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: default-cascade-delete
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:controller:clusterrole-aggregation-controller
subjects:
- kind: ServiceAccount
  name: rook-ceph-default
  namespace: rook-ceph

but it seems that this not help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants