New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One ceph node reboot caused whole rook-ceph cluster inaccessible #13995
Comments
@akash123-eng what was ceph status during that time? and how many osds were running on that node? also rook 1.10.8 is very old, try to upgrade |
@subhamkrai ceph status was in warning state as mentioned above with below warnings. only 44 PG were active + clean + stale state and other all pg were active + clean
on that node there were some OSDs running but it is less than 10% of our total OSDs |
@akash123-eng rook operator logs may help, but again 1.10 is old version, . Also, it is very unlikely to have this scenario if only one node is down out 10+ nodes In ~week, we'll have 1.14 release so I'll suggest plan upgrading to get new fixes and features. |
@subhamkrai we are planning to upgrade to 1.12 but it will take some time. What I can search in the operator logs particularly? |
@akash123-eng Please check:
|
|
|
@subhamkrai @travisn |
@travisn with the information provided above can you please help on how to find root cause of this issue and how to avoid it in future ? |
Please share the full output of these two commands, to confirm the cluster can handle the loss of a node:
|
@travisn ok will share it tomorrow but as mentioned above we are using rack failureDomain and the server which got rebooted without node was in a rack which also has other server which was running it's osd properly during that time. So we are suspecting that rook was not able to identify that server got rebooted and it kept sending request to that increasing slowops to a certain extent and then stopped all operations completely. |
The issue would be at the data layer (ceph). Rook doesn't detect when nodes go down. And if Ceph is getting stuck, I suspect an issue with the CRUSH map or pool configuration, thus the request for those details. |
Hi Team,
We are using rook-ceph with rook operator 1.10.8 and ceph 17.2.5 deployed on kubernetes 1.25.9.
We have 10+ nodes and cephfs and rbd blockpool with multi-replica setup
Yesterday we faced a very strange issue.
One of the nodes of rook-ceph where osd pods were hosted and one mon pod was hosted was rebooted without draining due to some issue.
Due to which mon pod went down on that node but it was rescheduled on some other node but other 2 mon pod which was on different nodes was running and osd pod went into pending state.
Ideally, this should have caused data redundancy only but our whole ceph cluster was inaccessible, no read and writes were happening on rook-ceph PVCs also there was no prometheus metrics during that duration.
Once the rebooted node came healthy issue was resolved but this should not have caused whole ceph clsuter inaccessible as less than 10% PGs were active + stale,all other PGs were active + clean and OSDs except osd of that node were UP.
During this issue slow ops started to increase but it went till some point after which it was neither increasing nor decreasing.
ceph status was showing:
can you please let us know what might have caused this and how to prevent it ?
The text was updated successfully, but these errors were encountered: