Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator error after K8s master token changed #9314

Closed
slow-zhang opened this issue Dec 6, 2021 · 5 comments · Fixed by #9347
Closed

Operator error after K8s master token changed #9314

slow-zhang opened this issue Dec 6, 2021 · 5 comments · Fixed by #9347
Assignees
Labels

Comments

@slow-zhang
Copy link

  • Bug Report

Expected behavior:
no-error

How to reproduce it (minimal and precise):

  • helm deploy a cluster
  • change k8s master's token

log

root@lasvai-staging-capos-k8s-master01:~/manifests# kubectl logs rook-ceph-operator-7475fbb785-5g42c -n rook-ceph-admin
2021-12-02 08:21:33.574197 I | op-flags: failed to set flag "logtostderr". no such flag -logtostderr
2021-12-02 08:21:33.574400 I | rookcmd: starting Rook v1.7.5 with arguments '/usr/local/bin/rook ceph operator'
2021-12-02 08:21:33.574408 I | rookcmd: flag values: --csi-cephfs-plugin-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin.yaml, --csi-cephfs-provisioner-dep-template-path=/etc/ceph-csi/cephfs/csi-cephfsplugin-provisioner-dep.yaml, --csi-rbd-plugin-template-path=/etc/ceph-csi/rbd/csi-rbdplugin.yaml, --csi-rbd-provisioner-dep-template-path=/etc/ceph-csi/rbd/csi-rbdplugin-provisioner-dep.yaml, --enable-machine-disruption-budget=false, --help=false, --kubeconfig=, --log-level=INFO, --operator-image=, --service-account=
2021-12-02 08:21:33.574412 I | cephcmd: starting Rook-Ceph operator
2021-12-02 08:21:33.843956 I | cephcmd: base ceph version inside the rook operator image is "ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)"
2021-12-02 08:21:33.849634 I | op-k8sutil: ROOK_CEPH_COMMANDS_TIMEOUT_SECONDS="15" (env var)
2021-12-02 08:21:33.851843 I | op-k8sutil: ROOK_ENABLE_DISCOVERY_DAEMON="false" (env var)
2021-12-02 08:21:33.854409 I | operator: looking for secret "rook-ceph-admission-controller"
2021-12-02 08:21:33.856280 I | operator: secret "rook-ceph-admission-controller" not found. proceeding without the admission controller
2021-12-02 08:21:33.858798 I | op-k8sutil: ROOK_ENABLE_FLEX_DRIVER="false" (env var)
2021-12-02 08:21:33.858809 I | operator: watching all namespaces for ceph cluster CRs
2021-12-02 08:21:33.859082 I | operator: setting up the controller-runtime manager
I1202 08:21:34.910370       6 request.go:668] Waited for 1.047258479s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/cert-manager.io/v1?timeout=32s
2021-12-02 08:21:39.063150 I | ceph-cluster-controller: successfully started
2021-12-02 08:21:39.063218 I | ceph-cluster-controller: enabling hotplug orchestration
2021-12-02 08:21:39.063237 I | ceph-crashcollector-controller: successfully started
2021-12-02 08:21:39.063301 I | ceph-block-pool-controller: successfully started
2021-12-02 08:21:39.063357 I | ceph-object-store-user-controller: successfully started
2021-12-02 08:21:39.063408 I | ceph-object-realm-controller: successfully started
2021-12-02 08:21:39.063453 I | ceph-object-zonegroup-controller: successfully started
2021-12-02 08:21:39.063497 I | ceph-object-zone-controller: successfully started
2021-12-02 08:21:39.063608 I | ceph-object-controller: successfully started
2021-12-02 08:21:39.063674 I | ceph-file-controller: successfully started
2021-12-02 08:21:39.063727 I | ceph-nfs-controller: successfully started
2021-12-02 08:21:39.063782 I | ceph-rbd-mirror-controller: successfully started
2021-12-02 08:21:39.063836 I | ceph-client-controller: successfully started
2021-12-02 08:21:39.063881 I | ceph-filesystem-mirror-controller: successfully started
2021-12-02 08:21:39.064670 I | operator: starting the controller-runtime manager
2021-12-02 08:21:39.405020 I | clusterdisruption-controller: create event from ceph cluster CR
2021-12-02 08:21:39.706431 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2021-12-02 08:21:39.708877 E | ceph-block-pool-controller: failed to reconcile. failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2021-12-02 08:21:39.708903 I | op-k8sutil: ROOK_ENABLE_FLEX_DRIVER="false" (env var)
2021-12-02 08:21:39.709169 E | ceph-file-controller: failed to reconcile failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2021-12-02 08:21:39.709397 E | ceph-object-controller: failed to reconcile CephObjectStore "rook-ceph/ceph-objectstore". failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2021-12-02 08:21:39.709408 I | op-k8sutil: Reporting Event rook-ceph:ceph-objectstore Warning:ReconcileFailed:failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2021-12-02 08:21:39.710289 I | clusterdisruption-controller: deleted all legacy blocking PDBs for osds
2021-12-02 08:21:39.715906 E | ceph-file-controller: failed to reconcile failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2021-12-02 08:21:39.716297 E | ceph-block-pool-controller: failed to reconcile. failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2021-12-02 08:21:39.723559 I | clusterdisruption-controller: deleted all legacy node drain canary pods
2021-12-02 08:21:39.724154 E | ceph-object-controller: failed to reconcile CephObjectStore "rook-ceph/ceph-objectstore". failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2021-12-02 08:21:39.725882 I | ceph-csi: successfully created csi config map "rook-ceph-csi-config"
2021-12-02 08:21:39.804723 E | ceph-file-controller: failed to reconcile failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2021-12-02 08:21:39.909389 E | ceph-block-pool-controller: failed to reconcile. failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2021-12-02 08:21:40.111492 E | ceph-object-controller: failed to reconcile CephObjectStore "rook-ceph/ceph-objectstore". failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2021-12-02 08:21:40.221997 I | clusterdisruption-controller: Ceph "rook-ceph" cluster not ready, cannot check Ceph status yet.
2021-12-02 08:21:40.309740 I | ceph-cluster-controller: clusterInfo not yet found, must be a new cluster
2021-12-02 08:21:40.321066 I | ceph-cluster-controller: enabling ceph mon monitoring goroutine for cluster "rook-ceph"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x1894b82]

goroutine 992 [running]:
github.com/rook/rook/pkg/operator/ceph/cluster/osd.NewOSDHealthMonitor(0xc0006b3180, 0x0, 0xc001b88700, 0x0, 0xc001590658, 0x0, 0x0, 0x0, 0xc001590660, 0x0, ...)
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/osd/health.go:64 +0xc2
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).startMonitoringCheck(0xc0001ccbb0, 0xc000b81800, 0x0, 0x21d33d2, 0x3)
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/monitoring.go:120 +0x125
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).configureCephMonitoring(0xc0001ccbb0, 0xc000b81800, 0x0)
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/monitoring.go:67 +0x5ae
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).initializeCluster(0xc0001ccbb0, 0xc000b81800, 0x221a96c, 0x28)
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/cluster.go:205 +0x65b
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).reconcileCephCluster(0xc0001ccbb0, 0xc000a2a900, 0xc00090eea0, 0xc000a2a900, 0x0)
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:394 +0x2f0
github.com/rook/rook/pkg/operator/ceph/cluster.(*ReconcileCephCluster).reconcile(0xc00089e420, 0xc0008318e0, 0x9, 0xc0008318c0, 0x9, 0xc000780400, 0x0, 0xc0012ecd80, 0x40e0f8, 0x30)
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:272 +0x3e9
github.com/rook/rook/pkg/operator/ceph/cluster.(*ReconcileCephCluster).Reconcile(0xc00089e420, 0x2587018, 0xc00090ec30, 0xc0008318e0, 0x9, 0xc0008318c0, 0x9, 0xc00090ec30, 0xc00003a000, 0x2029de0, ...)
	/home/rook/go/src/github.com/rook/rook/pkg/operator/ceph/cluster/controller.go:227 +0x74
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0007a6b40, 0x2586f70, 0xc000892940, 0x1fc2780, 0xc001710860)
	/home/rook/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:298 +0x30d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0007a6b40, 0x2586f70, 0xc000892940, 0x0)
	/home/rook/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:253 +0x205
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2(0xc000796360, 0xc0007a6b40, 0x2586f70, 0xc000892940)
	/home/rook/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:214 +0x6b
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
	/home/rook/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.6/pkg/internal/controller/controller.go:210 +0x425
@slow-zhang slow-zhang added the bug label Dec 6, 2021
@travisn
Copy link
Member

travisn commented Dec 6, 2021

Looks like this is crashing on this line:

logger.Infof("ceph osd status in namespace %q check interval %q", h.clusterInfo.Namespace, checkInterval.Duration.String())

@leseb Do you see how clusterInfo or checkInterval could be ni?

leseb added a commit to leseb/rook that referenced this issue Dec 8, 2021
We should always return a nil pointer of clusterInfo if
CreateOrLoadClusterInfo() returns an error.

Closes: rook#9314
Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/rook that referenced this issue Dec 8, 2021
We should always return a nil pointer of clusterInfo if
CreateOrLoadClusterInfo() returns an error.

Closes: rook#9314
Signed-off-by: Sébastien Han <seb@redhat.com>
@leseb
Copy link
Member

leseb commented Dec 8, 2021

I've attempted a patch here #9347 but we need more details, like the CephCluster CR for instance. @slow-zhang can you please provide that? Thanks

@leseb leseb self-assigned this Dec 8, 2021
@slow-zhang
Copy link
Author

slow-zhang commented Dec 8, 2021

I am not sure what is the CR means? the rook ceph cluster config? the env is empty because I redeployed the cluster

@leseb
Copy link
Member

leseb commented Dec 8, 2021

The CR is the Custom Resource, the CephCluster you injected, with https://github.com/rook/rook/blob/master/deploy/examples/cluster.yaml?

@slow-zhang
Copy link
Author

i deploy it by helm, here is my value file. the most params are default. maybe I only add some taint and label selector

operatorNamespace: rook-ceph-admin

# The metadata.name of the CephCluster CR. The default name is the same as the namespace.
# clusterName: rook-ceph

# Ability to override ceph.conf
# configOverride: |
#   [global]
#   mon_allow_pool_delete = true
#   osd_pool_default_size = 3
#   osd_pool_default_min_size = 2

# Installs a debugging toolbox deployment
toolbox:
  enabled: true
  image: rook/ceph:v1.7.5
  tolerations: []
  affinity: {}

monitoring:
  # requires Prometheus to be pre-installed
  # enabling will also create RBAC rules to allow Operator to create ServiceMonitors
  enabled: false # jingzhang: disable and add a ServiceMonitors in monitoring namespace
  rulesNamespaceOverride:

cephFileSystems:
  - name: ceph-filesystem
    # see https://github.com/rook/rook/blob/master/Documentation/ceph-filesystem-crd.md#filesystem-settings for available configuration
    spec:
      metadataPool:
        replicated:
          size: 3
      dataPools:
        - failureDomain: host
          replicated:
            size: 3
      metadataServer:
        activeCount: 1
        activeStandby: true
    storageClass:
      enabled: false


# imagePullSecrets option allow to pull docker images from private docker registry. Option will be passed to all service accounts.
# imagePullSecrets:
# - name: my-registry-secret

# All values below are taken from the CephCluster CRD
# More information can be found at [Ceph Cluster CRD](/Documentation/ceph-cluster-crd.md)
cephClusterSpec:
  cephVersion:
    # The container image used to launch the Ceph daemon pods (mon, mgr, osd, mds, rgw).
    # v14 is nautilus, v15 is octopus, and v16 is pacific.
    # RECOMMENDATION: In production, use a specific version tag instead of the general v14 flag, which pulls the latest release and could result in different
    # versions running within the cluster. See tags available at https://hub.docker.com/r/ceph/ceph/tags/.
    # If you want to be more precise, you can always use a timestamp tag such quay.io/ceph/ceph:v15.2.11-20200419
    # This tag might not contain a new Ceph version, just security fixes from the underlying operating system, which will reduce vulnerabilities
    image: quay.io/ceph/ceph:v16.2.6
    # Whether to allow unsupported versions of Ceph. Currently `nautilus` and `octopus` are supported.
    # Future versions such as `pacific` would require this to be set to `true`.
    # Do not set to true in production.
    allowUnsupported: false

  # The path on the host where configuration files will be persisted. Must be specified.
  # Important: if you reinstall the cluster, make sure you delete this directory from each host or else the mons will fail to start on the new cluster.
  # In Minikube, the '/data' directory is configured to persist across reboots. Use "/data/rook" in Minikube environment.
  dataDirHostPath: /var/lib/rook

  # Whether or not upgrade should continue even if a check fails
  # This means Ceph's status could be degraded and we don't recommend upgrading but you might decide otherwise
  # Use at your OWN risk
  # To understand Rook's upgrade process of Ceph, read https://rook.io/docs/rook/master/ceph-upgrade.html#ceph-version-upgrades
  skipUpgradeChecks: false

  # Whether or not continue if PGs are not clean during an upgrade
  continueUpgradeAfterChecksEvenIfNotHealthy: false

  # WaitTimeoutForHealthyOSDInMinutes defines the time (in minutes) the operator would wait before an OSD can be stopped for upgrade or restart.
  # If the timeout exceeds and OSD is not ok to stop, then the operator would skip upgrade for the current OSD and proceed with the next one
  # if `continueUpgradeAfterChecksEvenIfNotHealthy` is `false`. If `continueUpgradeAfterChecksEvenIfNotHealthy` is `true`, then opertor would
  # continue with the upgrade of an OSD even if its not ok to stop after the timeout. This timeout won't be applied if `skipUpgradeChecks` is `true`.
  # The default wait timeout is 10 minutes.
  waitTimeoutForHealthyOSDInMinutes: 10

  mon:
    # Set the number of mons to be started. Generally recommended to be 3.
    # For highest availability, an odd number of mons should be specified.
    count: 3
    # The mons should be on unique nodes. For production, at least 3 nodes are recommended for this reason.
    # Mons should only be allowed on the same node for test environments where data loss is acceptable.
    allowMultiplePerNode: false

  mgr:
    # When higher availability of the mgr is needed, increase the count to 2.
    # In that case, one mgr will be active and one in standby. When Ceph updates which
    # mgr is active, Rook will update the mgr services to match the active mgr.
    count: 2
    modules:
      # Several modules should not need to be included in this list. The "dashboard" and "monitoring" modules
      # are already enabled by other settings in the cluster CR.
      - name: pg_autoscaler
        enabled: true

  # enable the ceph dashboard for viewing cluster status
  dashboard:
    enabled: true
    # serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)
    # urlPrefix: /ceph-dashboard
    # serve the dashboard at the given port.
    # port: 8443

  # Network configuration, see: https://github.com/rook/rook/blob/master/Documentation/ceph-cluster-crd.md#network-configuration-settings
  network:
  #   # enable host networking
    provider: host
  #   # EXPERIMENTAL: enable the Multus network provider
  #   provider: multus
  #   selectors:
  #     # The selector keys are required to be `public` and `cluster`.
  #     # Based on the configuration, the operator will do the following:
  #     #   1. if only the `public` selector key is specified both public_network and cluster_network Ceph settings will listen on that interface
  #     #   2. if both `public` and `cluster` selector keys are specified the first one will point to 'public_network' flag and the second one to 'cluster_network'
  #     #
  #     # In order to work, each selector value must match a NetworkAttachmentDefinition object in Multus
  #     #
  #     # public: public-conf --> NetworkAttachmentDefinition object name in Multus
  #     # cluster: cluster-conf --> NetworkAttachmentDefinition object name in Multus
  #   # Provide internet protocol version. IPv6, IPv4 or empty string are valid options. Empty string would mean IPv4
  #   ipFamily: "IPv6"
  #   # Ceph daemons to listen on both IPv4 and Ipv6 networks
  #   dualStack: false

  # enable the crash collector for ceph daemon crash collection
  crashCollector:
    disable: false
    # Uncomment daysToRetain to prune ceph crash entries older than the
    # specified number of days.
    # daysToRetain: 30

  # enable log collector, daemons will log on files and rotate
  # logCollector:
  #   enabled: true
  #   periodicity: 24h # SUFFIX may be 'h' for hours or 'd' for days.

  # automate [data cleanup process](https://github.com/rook/rook/blob/master/Documentation/ceph-teardown.md#delete-the-data-on-hosts) in cluster destruction.
  cleanupPolicy:
    # Since cluster cleanup is destructive to data, confirmation is required.
    # To destroy all Rook data on hosts during uninstall, confirmation must be set to "yes-really-destroy-data".
    # This value should only be set when the cluster is about to be deleted. After the confirmation is set,
    # Rook will immediately stop configuring the cluster and only wait for the delete command.
    # If the empty string is set, Rook will not destroy any data on hosts during uninstall.
    confirmation: ""
    # sanitizeDisks represents settings for sanitizing OSD disks on cluster deletion
    sanitizeDisks:
      # method indicates if the entire disk should be sanitized or simply ceph's metadata
      # in both case, re-install is possible
      # possible choices are 'complete' or 'quick' (default)
      method: quick
      # dataSource indicate where to get random bytes from to write on the disk
      # possible choices are 'zero' (default) or 'random'
      # using random sources will consume entropy from the system and will take much more time then the zero source
      dataSource: zero
      # iteration overwrite N times instead of the default (1)
      # takes an integer value
      iteration: 1
    # allowUninstallWithVolumes defines how the uninstall should be performed
    # If set to true, cephCluster deletion does not wait for the PVs to be deleted.
    allowUninstallWithVolumes: false

  # To control where various services will be scheduled by kubernetes, use the placement configuration sections below.
  # The example under 'all' would have all services scheduled on kubernetes nodes labeled with 'role=storage-node' and
  # tolerate taints with a key of 'storage-node'.
  placement:
    all:
      # add lable: role=storage-node
      # cmd: kubectl label node <node> role=storage-node
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: role
              operator: In
              values:
              - storage-node
      # podAffinity:
      # podAntiAffinity:
      # topologySpreadConstraints:
      # add taint: node-role.kubernetes.io/ceph=true:NoSchedule
      # cmd: kubectl taint nodes <node> node-role.kubernetes.io/ceph=true:NoSchedule
      tolerations:
      - key: node-role.kubernetes.io/ceph
        operator: Exists
  #   # The above placement information can also be specified for mon, osd, and mgr components
  #   mon:
  #   # Monitor deployments may contain an anti-affinity rule for avoiding monitor
  #   # collocation on the same node. This is a required rule when host network is used
  #   # or when AllowMultiplePerNode is false. Otherwise this anti-affinity rule is a
  #   # preferred rule with weight: 50.
  #   osd:
  #   mgr:
  #   cleanup:

  # annotations:
  #   all:
  #   mon:
  #   osd:
  #   cleanup:
  #   prepareosd:
  #   # If no mgr annotations are set, prometheus scrape annotations will be set by default.
  #   mgr:

  # labels:
  #   all:
  #   mon:
  #   osd:
  #   cleanup:
  #   mgr:
  #   prepareosd:
  #   # monitoring is a list of key-value pairs. It is injected into all the monitoring resources created by operator.
  #   # These labels can be passed as LabelSelector to Prometheus
  #   monitoring:

  # resources:
  #   # The requests and limits set here, allow the mgr pod to use half of one CPU core and 1 gigabyte of memory
  #   mgr:
  #     limits:
  #       cpu: "500m"
  #       memory: "1024Mi"
  #     requests:
  #       cpu: "500m"
  #       memory: "1024Mi"
  #   # The above example requests/limits can also be added to the other components
  #   mon:
  #   osd:
  #   prepareosd:
  #   mgr-sidecar:
  #   crashcollector:
  #   logcollector:
  #   cleanup:

  # The option to automatically remove OSDs that are out and are safe to destroy.
  removeOSDsIfOutAndSafeToRemove: false

  # priority classes to apply to ceph resources
  # priorityClassNames:
  #   all: rook-ceph-default-priority-class
  #   mon: rook-ceph-mon-priority-class
  #   osd: rook-ceph-osd-priority-class
  #   mgr: rook-ceph-mgr-priority-class

  storage: # cluster level storage configuration and selection
    useAllNodes: true
    useAllDevices: true
    # deviceFilter:
    # config:
    #   crushRoot: "custom-root" # specify a non-default root label for the CRUSH map
    #   metadataDevice: "md0" # specify a non-rotational storage so ceph-volume will use it as block db device of bluestore.
    #   databaseSizeMB: "1024" # uncomment if the disks are smaller than 100 GB
    #   journalSizeMB: "1024"  # uncomment if the disks are 20 GB or smaller
    #   osdsPerDevice: "1" # this value can be overridden at the node or device level
    #   encryptedDevice: "true" # the default value for this option is "false"
    # # Individual nodes and their config can be specified as well, but 'useAllNodes' above must be set to false. Then, only the named
    # # nodes below will be used as storage resources.  Each node's 'name' field should match their 'kubernetes.io/hostname' label.
    # nodes:
    #   - name: "vai-k8s-ceph-03-worker-02"
    #     deviceFilter: "^sd[bd]"
    #   - name: "vai-k8s-ceph-03-worker-02"
    #     deviceFilter: "^sd[bd]"
    #   - name: "vai-k8s-ceph-03-worker-03"
    #     deviceFilter: "^sd[bd]"

  # The section for configuring management of daemon disruptions during upgrade or fencing.
  disruptionManagement:
    # If true, the operator will create and manage PodDisruptionBudgets for OSD, Mon, RGW, and MDS daemons. OSD PDBs are managed dynamically
    # via the strategy outlined in the [design](https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md). The operator will
    # block eviction of OSDs by default and unblock them safely when drains are detected.
    managePodBudgets: true
    # A duration in minutes that determines how long an entire failureDomain like `region/zone/host` will be held in `noout` (in addition to the
    # default DOWN/OUT interval) when it is draining. This is only relevant when  `managePodBudgets` is `true`. The default value is `30` minutes.
    osdMaintenanceTimeout: 30
    # A duration in minutes that the operator will wait for the placement groups to become healthy (active+clean) after a drain was completed and OSDs came back up.
    # Operator will continue with the next drain if the timeout exceeds. It only works if `managePodBudgets` is `true`.
    # No values or 0 means that the operator will wait until the placement groups are healthy before unblocking the next drain.
    pgHealthCheckTimeout: 0
    # If true, the operator will create and manage MachineDisruptionBudgets to ensure OSDs are only fenced when the cluster is healthy.
    # Only available on OpenShift.
    manageMachineDisruptionBudgets: false
    # Namespace in which to watch for the MachineDisruptionBudgets.
    machineDisruptionBudgetNamespace: openshift-machine-api

  # Configure the healthcheck and liveness probes for ceph pods.
  # Valid values for daemons are 'mon', 'osd', 'status'
  healthCheck:
    daemonHealth:
      mon:
        disabled: false
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s
    # Change pod liveness probe, it works for all mon, mgr, and osd pods.
    livenessProbe:
      mon:
        disabled: false
      mgr:
        disabled: false
      osd:
        disabled: false

leseb added a commit to leseb/rook that referenced this issue Dec 8, 2021
We should always return a nil pointer of clusterInfo if
CreateOrLoadClusterInfo() returns an error.

Closes: rook#9314
Signed-off-by: Sébastien Han <seb@redhat.com>
mergify bot pushed a commit that referenced this issue Dec 8, 2021
We should always return a nil pointer of clusterInfo if
CreateOrLoadClusterInfo() returns an error.

Closes: #9314
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fdd243d)
mergify bot pushed a commit that referenced this issue Dec 8, 2021
We should always return a nil pointer of clusterInfo if
CreateOrLoadClusterInfo() returns an error.

Closes: #9314
Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit fdd243d)
parth-gr pushed a commit to parth-gr/rook that referenced this issue Feb 22, 2022
We should always return a nil pointer of clusterInfo if
CreateOrLoadClusterInfo() returns an error.

Closes: rook#9314
Signed-off-by: Sébastien Han <seb@redhat.com>
parth-gr pushed a commit to parth-gr/rook that referenced this issue Feb 22, 2022
We should always return a nil pointer of clusterInfo if
CreateOrLoadClusterInfo() returns an error.

Closes: rook#9314
Signed-off-by: Sébastien Han <seb@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants