New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: rgw: allow specifying daemon startup probes #9468
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks straightforward.
4eab169
to
9c6c8a9
Compare
Liveness probes as observed in a test cluster. These are working as intended for this PR, but I have some comments inline also ( $ kubectl -n rook-ceph get pod -o json | jq -r '.items[] | (.metadata.name), (.spec.containers[0] | ("startup:"), (.startupProbe), ("liveness:"), (.livenessProbe), ("readiness:"), (.readinessProbe)), ("\n")'
<!-- no probes on any CSI pods -->
csi-cephfsplugin-5hlh4
startup:
null
liveness:
null
readiness:
null
csi-cephfsplugin-provisioner-689686b44-hnpcl
startup:
null
liveness:
null
readiness:
null
csi-rbdplugin-provisioner-5775fb866b-9v8ph
startup:
null
liveness:
null
readiness:
null
csi-rbdplugin-scm67
startup:
null
liveness:
null
readiness:
null
rook-ceph-mgr-a-795bcd9595-6m7qj
startup:
{
"exec": {
"command": [
"env",
"-i",
"sh",
"-c",
"ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status"
]
},
"failureThreshold": 6,
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
liveness:
{
"exec": {
"command": [
"env",
"-i",
"sh",
"-c",
"ceph --admin-daemon /run/ceph/ceph-mgr.a.asok status"
]
},
"failureThreshold": 3,
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
readiness:
null <!-- mgr doesn't have a readiness probe, but maybe we should have one? -->
rook-ceph-mon-a-75dd59d857-pksvp
startup:
{
"exec": {
"command": [
"env",
"-i",
"sh",
"-c",
"ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status"
]
},
"failureThreshold": 6, <!-- most pods have 60 seconds to start up before being failed -->
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
liveness:
{
"exec": {
"command": [
"env",
"-i",
"sh",
"-c",
"ceph --admin-daemon /run/ceph/ceph-mon.a.asok mon_status"
]
},
"failureThreshold": 3, <!-- liveness probes remain at 30 seconds before being killed due to failures -->
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
readiness:
null
rook-ceph-operator-b77b97c9c-rf24k
startup:
null
liveness:
null
readiness:
null
rook-ceph-osd-0-794865644-4z9cb
startup:
{
"exec": {
"command": [
"env",
"-i",
"sh",
"-c",
"ceph --admin-daemon /run/ceph/ceph-osd.0.asok status"
]
},
"failureThreshold": 9, <!-- OSDs have 90 seconds to start up now (was effectively 45+30=75 seconds before) -->
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
liveness:
{
"exec": {
"command": [
"env",
"-i",
"sh",
"-c",
"ceph --admin-daemon /run/ceph/ceph-osd.0.asok status"
]
},
"failureThreshold": 3, <!-- running OSDs can now fail after 30 seconds before being restarted (was whatever the remainder of 45+30=75 seconds) -->
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
readiness:
null
rook-ceph-osd-1-664796587d-bsz7s
startup:
{
"exec": {
"command": [
"env",
"-i",
"sh",
"-c",
"ceph --admin-daemon /run/ceph/ceph-osd.1.asok status"
]
},
"failureThreshold": 9,
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
liveness:
{
"exec": {
"command": [
"env",
"-i",
"sh",
"-c",
"ceph --admin-daemon /run/ceph/ceph-osd.1.asok status"
]
},
"failureThreshold": 3,
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
readiness:
null
rook-ceph-osd-2-56697574cf-qnqtl
startup:
{
"exec": {
"command": [
"env",
"-i",
"sh",
"-c",
"ceph --admin-daemon /run/ceph/ceph-osd.2.asok status"
]
},
"failureThreshold": 9,
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
liveness:
{
"exec": {
"command": [
"env",
"-i",
"sh",
"-c",
"ceph --admin-daemon /run/ceph/ceph-osd.2.asok status"
]
},
"failureThreshold": 3,
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
readiness:
null
rook-ceph-osd-prepare-minikube--1-lsk6z
startup:
null
liveness:
null
readiness:
null
rook-ceph-rgw-my-store-a-76777dd9df-xk2nd
startup:
{
"failureThreshold": 18, <!-- rgws have 3 minutes to start up (was effectively 30s before) -->
"httpGet": {
"path": "/swift/healthcheck",
"port": 8080,
"scheme": "HTTP"
},
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
liveness:
{
"failureThreshold": 3, <!-- rgws are killed after 30s (or 3 failures) during runtime, same as before -->
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"tcpSocket": {
"port": 8080
},
"timeoutSeconds": 1
}
readiness:
{
"failureThreshold": 3,
"httpGet": {
"path": "/swift/healthcheck",
"port": 8080,
"scheme": "HTTP"
},
"initialDelaySeconds": 10,
"periodSeconds": 10,
"successThreshold": 1,
"timeoutSeconds": 1
}
rook-ceph-tools-555c879675-d6m2n
startup:
null
liveness:
null
readiness:
null
|
Allow specifying daemon startup probes where we also allow configuring liveness probes. Startup probes allow Rook to tolerate when Ceph daemons occasionally take a long time to start up while not also making Kubernetes liveness probes slower to detect runtime failures of daemons. Startup probes are beta in Kubernetes 1.18, so we should not enable probes by default for earlier Kubernetes versions. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
9c6c8a9
to
c07d89d
Compare
The liveness probe of each daemon can also be controlled via `livenessProbe`, the setting is valid for `mon`, `mgr` and `osd`. | ||
Here is a complete example for both `daemonHealth` and `livenessProbe`: | ||
The liveness probe and startup probe of each daemon can also be controlled via `livenessProbe` and | ||
`startupProbe` respectively. The settings are valid for `mon`, `mgr` and `osd`. | ||
Here is a complete example for both `daemonHealth`, `livenessProbe`, and `startupProbe`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the code actually allows overriding the probes for mds
as well, but this should really come from the CephFilesystem CR. Should we add that in a new PR and remove the mds
config from CephCluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it should be moved out so we can configure it from the CephFilesystem CR, just like the CephObjectStore does. I'm fine with a followup PR. Please open an issue to track this, thanks!
@JensErat noted here (#9283 (comment)) that 90 seconds to start an OSD was working for them. This is the value chosen for my initial implementation here. |
Sounds reasonable. If my specific, special setup should ever require larger values, your change still allows to configure higher values. Thank you very much! I somewhat assume this can also be considered closing #9283, at least for some users. |
Good point. I think we should try to solicit feedback from users to see if 9283 is resolved by this fix once it's merged and released. |
The liveness probe of each daemon can also be controlled via `livenessProbe`, the setting is valid for `mon`, `mgr` and `osd`. | ||
Here is a complete example for both `daemonHealth` and `livenessProbe`: | ||
The liveness probe and startup probe of each daemon can also be controlled via `livenessProbe` and | ||
`startupProbe` respectively. The settings are valid for `mon`, `mgr` and `osd`. | ||
Here is a complete example for both `daemonHealth`, `livenessProbe`, and `startupProbe`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it should be moved out so we can configure it from the CephFilesystem CR, just like the CephObjectStore does. I'm fine with a followup PR. Please open an issue to track this, thanks!
core: rgw: allow specifying daemon startup probes (backport #9468)
Allow specifying daemon startup probes where we also allow configuring
liveness probes. Startup probes allow Rook to tolerate when Ceph daemons
occasionally take a long time to start up while not also making
Kubernetes liveness probes slower to detect runtime failures of daemons.
Startup probes are beta in Kubernetes 1.18, so we should not enable
probes by default for earlier Kubernetes versions.
Signed-off-by: Blaine Gardner blaine.gardner@redhat.com
TODO:
Note: turns out, tests on k8s 1.16 don't fail if resources contain startup probes. I think k8s must silently ignore the config there. Good news for us. We don't have to check the k8s version after all.
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #9401
Checklist:
make codegen
) has been run to update object specifications, if necessary.