Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: rgw: allow specifying daemon startup probes #9468

Merged
merged 1 commit into from Jan 4, 2022

Conversation

BlaineEXE
Copy link
Member

@BlaineEXE BlaineEXE commented Dec 18, 2021

Allow specifying daemon startup probes where we also allow configuring
liveness probes. Startup probes allow Rook to tolerate when Ceph daemons
occasionally take a long time to start up while not also making
Kubernetes liveness probes slower to detect runtime failures of daemons.

Startup probes are beta in Kubernetes 1.18, so we should not enable
probes by default for earlier Kubernetes versions.

Signed-off-by: Blaine Gardner blaine.gardner@redhat.com

TODO:

  • disable probes for k8s <1.18
    Note: turns out, tests on k8s 1.16 don't fail if resources contain startup probes. I think k8s must silently ignore the config there. Good news for us. We don't have to check the k8s version after all.

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #9401

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks straightforward.

@BlaineEXE BlaineEXE force-pushed the startup-probes branch 2 times, most recently from 4eab169 to 9c6c8a9 Compare December 21, 2021 20:23
@BlaineEXE
Copy link
Member Author

Liveness probes as observed in a test cluster. These are working as intended for this PR, but I have some comments inline also (<!-- <comment> -->). We have the opportunity to consider changing default values to fail-and-restart daemons more quickly if we so choose.

$  kubectl -n rook-ceph get pod -o json | jq -r '.items[] | (.metadata.name), (.spec.containers[0] | ("startup:"), (.startupProbe), ("liveness:"), (.livenessProbe), ("readiness:"), (.readinessProbe)), ("\n")'

           

           <!-- no probes on any CSI pods -->


csi-cephfsplugin-5hlh4
startup:
null
liveness:
null
readiness:
null


csi-cephfsplugin-provisioner-689686b44-hnpcl
startup:
null
liveness:
null
readiness:
null


csi-rbdplugin-provisioner-5775fb866b-9v8ph
startup:
null
liveness:
null
readiness:
null


csi-rbdplugin-scm67
startup:
null
liveness:
null
readiness:
null


rook-ceph-mgr-a-795bcd9595-6m7qj
startup:
{
  "exec": {
    "command": [
      "env",
      "-i",
      "sh",
      "-c",
      "ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status"
    ]
  },
  "failureThreshold": 6,
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
liveness:
{
  "exec": {
    "command": [
      "env",
      "-i",
      "sh",
      "-c",
      "ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status"
    ]
  },
  "failureThreshold": 3,
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
readiness:
null           <!-- mgr doesn't have a readiness probe, but maybe we should have one? -->


rook-ceph-mon-a-75dd59d857-pksvp
startup:
{
  "exec": {
    "command": [
      "env",
      "-i",
      "sh",
      "-c",
      "ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status"
    ]
  },
  "failureThreshold": 6,           <!-- most pods have 60 seconds to start up before being failed -->
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
liveness:
{
  "exec": {
    "command": [
      "env",
      "-i",
      "sh",
      "-c",
      "ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status"
    ]
  },
  "failureThreshold": 3,           <!-- liveness probes remain at 30 seconds before being killed due to failures -->
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
readiness:
null


rook-ceph-operator-b77b97c9c-rf24k
startup:
null
liveness:
null
readiness:
null


rook-ceph-osd-0-794865644-4z9cb
startup:
{
  "exec": {
    "command": [
      "env",
      "-i",
      "sh",
      "-c",
      "ceph --admin-daemon /run/ceph/ceph-osd.0.asok status"
    ]
  },
  "failureThreshold": 9,           <!-- OSDs have 90 seconds to start up now (was effectively 45+30=75 seconds before) -->
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
liveness:
{
  "exec": {
    "command": [
      "env",
      "-i",
      "sh",
      "-c",
      "ceph --admin-daemon /run/ceph/ceph-osd.0.asok status"
    ]
  },
  "failureThreshold": 3,           <!-- running OSDs can now fail after 30 seconds before being restarted (was whatever the remainder of 45+30=75 seconds) -->
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
readiness:
null


rook-ceph-osd-1-664796587d-bsz7s
startup:
{
  "exec": {
    "command": [
      "env",
      "-i",
      "sh",
      "-c",
      "ceph --admin-daemon /run/ceph/ceph-osd.1.asok status"
    ]
  },
  "failureThreshold": 9,
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
liveness:
{
  "exec": {
    "command": [
      "env",
      "-i",
      "sh",
      "-c",
      "ceph --admin-daemon /run/ceph/ceph-osd.1.asok status"
    ]
  },
  "failureThreshold": 3,
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
readiness:
null


rook-ceph-osd-2-56697574cf-qnqtl
startup:
{
  "exec": {
    "command": [
      "env",
      "-i",
      "sh",
      "-c",
      "ceph --admin-daemon /run/ceph/ceph-osd.2.asok status"
    ]
  },
  "failureThreshold": 9,
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
liveness:
{
  "exec": {
    "command": [
      "env",
      "-i",
      "sh",
      "-c",
      "ceph --admin-daemon /run/ceph/ceph-osd.2.asok status"
    ]
  },
  "failureThreshold": 3,
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
readiness:
null


rook-ceph-osd-prepare-minikube--1-lsk6z
startup:
null
liveness:
null
readiness:
null


rook-ceph-rgw-my-store-a-76777dd9df-xk2nd
startup:
{
  "failureThreshold": 18,           <!-- rgws have 3 minutes to start up (was effectively 30s before) -->
  "httpGet": {
    "path": "/swift/healthcheck",
    "port": 8080,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}
liveness:
{
  "failureThreshold": 3,           <!-- rgws are killed after 30s (or 3 failures) during runtime, same as before -->
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "tcpSocket": {
    "port": 8080
  },
  "timeoutSeconds": 1
}
readiness:
{
  "failureThreshold": 3,
  "httpGet": {
    "path": "/swift/healthcheck",
    "port": 8080,
    "scheme": "HTTP"
  },
  "initialDelaySeconds": 10,
  "periodSeconds": 10,
  "successThreshold": 1,
  "timeoutSeconds": 1
}


rook-ceph-tools-555c879675-d6m2n
startup:
null
liveness:
null
readiness:
null

Allow specifying daemon startup probes where we also allow configuring
liveness probes. Startup probes allow Rook to tolerate when Ceph daemons
occasionally take a long time to start up while not also making
Kubernetes liveness probes slower to detect runtime failures of daemons.

Startup probes are beta in Kubernetes 1.18, so we should not enable
probes by default for earlier Kubernetes versions.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
Comment on lines -620 to +622
The liveness probe of each daemon can also be controlled via `livenessProbe`, the setting is valid for `mon`, `mgr` and `osd`.
Here is a complete example for both `daemonHealth` and `livenessProbe`:
The liveness probe and startup probe of each daemon can also be controlled via `livenessProbe` and
`startupProbe` respectively. The settings are valid for `mon`, `mgr` and `osd`.
Here is a complete example for both `daemonHealth`, `livenessProbe`, and `startupProbe`:
Copy link
Member Author

@BlaineEXE BlaineEXE Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the code actually allows overriding the probes for mds as well, but this should really come from the CephFilesystem CR. Should we add that in a new PR and remove the mds config from CephCluster?

@leseb

Copy link
Member

@leseb leseb Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it should be moved out so we can configure it from the CephFilesystem CR, just like the CephObjectStore does. I'm fine with a followup PR. Please open an issue to track this, thanks!

@BlaineEXE BlaineEXE marked this pull request as ready for review December 21, 2021 22:52
@BlaineEXE
Copy link
Member Author

@JensErat noted here (#9283 (comment)) that 90 seconds to start an OSD was working for them. This is the value chosen for my initial implementation here.

@JensErat
Copy link
Contributor

@JensErat noted here (#9283 (comment)) that 90 seconds to start an OSD was working for them. This is the value chosen for my initial implementation here.

Sounds reasonable. If my specific, special setup should ever require larger values, your change still allows to configure higher values. Thank you very much! I somewhat assume this can also be considered closing #9283, at least for some users.

@BlaineEXE
Copy link
Member Author

@JensErat noted here (#9283 (comment)) that 90 seconds to start an OSD was working for them. This is the value chosen for my initial implementation here.

Sounds reasonable. If my specific, special setup should ever require larger values, your change still allows to configure higher values. Thank you very much! I somewhat assume this can also be considered closing #9283, at least for some users.

Good point. I think we should try to solicit feedback from users to see if 9283 is resolved by this fix once it's merged and released.

Comment on lines -620 to +622
The liveness probe of each daemon can also be controlled via `livenessProbe`, the setting is valid for `mon`, `mgr` and `osd`.
Here is a complete example for both `daemonHealth` and `livenessProbe`:
The liveness probe and startup probe of each daemon can also be controlled via `livenessProbe` and
`startupProbe` respectively. The settings are valid for `mon`, `mgr` and `osd`.
Here is a complete example for both `daemonHealth`, `livenessProbe`, and `startupProbe`:
Copy link
Member

@leseb leseb Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it should be moved out so we can configure it from the CephFilesystem CR, just like the CephObjectStore does. I'm fine with a followup PR. Please open an issue to track this, thanks!

@BlaineEXE BlaineEXE merged commit 5b9ac16 into rook:master Jan 4, 2022
@BlaineEXE BlaineEXE deleted the startup-probes branch January 4, 2022 18:03
mergify bot added a commit that referenced this pull request Jan 4, 2022
core: rgw: allow specifying daemon startup probes (backport #9468)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow defining startupProbes, like we have livenessProbes
3 participants