One ceph node reboot caused whole rook-ceph cluster inaccessible #13995

akash123-eng · 2024-03-29T08:25:17Z

Hi Team,

We are using rook-ceph with rook operator 1.10.8 and ceph 17.2.5 deployed on kubernetes 1.25.9.
We have 10+ nodes and cephfs and rbd blockpool with multi-replica setup
Yesterday we faced a very strange issue.

One of the nodes of rook-ceph where osd pods were hosted and one mon pod was hosted was rebooted without draining due to some issue.
Due to which mon pod went down on that node but it was rescheduled on some other node but other 2 mon pod which was on different nodes was running and osd pod went into pending state.
Ideally, this should have caused data redundancy only but our whole ceph cluster was inaccessible, no read and writes were happening on rook-ceph PVCs also there was no prometheus metrics during that duration.

Once the rebooted node came healthy issue was resolved but this should not have caused whole ceph clsuter inaccessible as less than 10% PGs were active + stale,all other PGs were active + clean and OSDs except osd of that node were UP.

During this issue slow ops started to increase but it went till some point after which it was neither increasing nor decreasing.
ceph status was showing:

2 MDSs report slow metadata IOs
2 MDSs report slow requests.
Reduced data availability. 44 PGS stale
1817 slow ops,oldest one blocked for 4234 sec

can you please let us know what might have caused this and how to prevent it ?

The text was updated successfully, but these errors were encountered:

subhamkrai · 2024-03-29T09:46:27Z

@akash123-eng what was ceph status during that time? and how many osds were running on that node?

also rook 1.10.8 is very old, try to upgrade

akash123-eng · 2024-03-29T10:18:28Z

@subhamkrai ceph status was in warning state as mentioned above with below warnings. only 44 PG were active + clean + stale state and other all pg were active + clean

2 MDSs report slow metadata IOs
2 MDSs report slow requests.
Reduced data availability. 44 PGS stale
1817 slow ops,oldest one blocked for 4234 sec

on that node there were some OSDs running but it is less than 10% of our total OSDs

subhamkrai · 2024-03-29T11:21:19Z

@akash123-eng rook operator logs may help, but again 1.10 is old version, . Also, it is very unlikely to have this scenario if only one node is down out 10+ nodes

In ~week, we'll have 1.14 release so I'll suggest plan upgrading to get new fixes and features.

akash123-eng · 2024-03-29T12:19:37Z

@subhamkrai we are planning to upgrade to 1.12 but it will take some time.
For now we wanted to identify root cause of the issue as it should not reoccur again.
Can you please help on that ?
Cannot post all the operator logs here due to some issue.

What I can search in the operator logs particularly?

travisn · 2024-03-29T14:54:55Z

@akash123-eng Please check:

failureDomain: host in all of your pools, filesystems, and object stores. If the failureDomain is instead osd, you could easily hit this issue when multiple OSDs go down. But if the failureDomain is host, only one replica of data would be affected when one node goes down.
Replica size is 3 in all pools, filesystems, and object stores?
Mons are on separate nodes, right?

akash123-eng · 2024-03-29T17:14:05Z

@travisn

We are using failure domain rack
Replica size is 2 for rbd block pool , 3 for cephfs and there is 1 rbd single replica pool also.
We understand that apps using singleReplica pool storage class would be affected but in our case whole cephcluster itself was not operational even filesystem based pvc where replica is set to 3 and no read and writes were happening on any of the pvc unless rebooted node came back up again
Yes Mon are on separate nodes

travisn · 2024-03-29T17:37:05Z

Rack failure domain sounds good.
Was the active mds pod on the node that shut down? The mds pod should be scheduled to another node automatically, but it might take a few minutes.
The rbd replica 2 pool was also blocked? It is expected to still work since min_size would be 1

akash123-eng · 2024-03-29T18:28:04Z

@subhamkrai @travisn
there was no mds pod running on the node which was rebooted.
Yes rbd block pool of size 2 was also affected as mentioned earlier, even PVCs based on cephfilesystem with replica size of 3 were also affected ,no read or write was happening on those pvcs

akash123-eng · 2024-04-01T15:11:50Z

@travisn with the information provided above can you please help on how to find root cause of this issue and how to avoid it in future ?
What we can search in operator logs which can confirm there was no read writes happening on rook ceph as there are thousands of logs during that duration and its very hand to find root cause

travisn · 2024-04-01T17:55:28Z

Please share the full output of these two commands, to confirm the cluster can handle the loss of a node:

ceph osd crush dump
ceph osd pool ls detail

akash123-eng · 2024-04-03T17:26:58Z

@travisn ok will share it tomorrow but as mentioned above we are using rack failureDomain and the server which got rebooted without node was in a rack which also has other server which was running it's osd properly during that time. So we are suspecting that rook was not able to identify that server got rebooted and it kept sending request to that increasing slowops to a certain extent and then stopped all operations completely.
How we can check the same in the logs?

travisn · 2024-04-03T18:19:50Z

The issue would be at the data layer (ceph). Rook doesn't detect when nodes go down. And if Ceph is getting stuck, I suspect an issue with the CRUSH map or pool configuration, thus the request for those details.

akash123-eng added the bug label Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One ceph node reboot caused whole rook-ceph cluster inaccessible #13995

One ceph node reboot caused whole rook-ceph cluster inaccessible #13995

akash123-eng commented Mar 29, 2024 •

edited

subhamkrai commented Mar 29, 2024

akash123-eng commented Mar 29, 2024

subhamkrai commented Mar 29, 2024

akash123-eng commented Mar 29, 2024

travisn commented Mar 29, 2024

akash123-eng commented Mar 29, 2024

travisn commented Mar 29, 2024

akash123-eng commented Mar 29, 2024 •

edited

akash123-eng commented Apr 1, 2024

travisn commented Apr 1, 2024

akash123-eng commented Apr 3, 2024

travisn commented Apr 3, 2024

One ceph node reboot caused whole rook-ceph cluster inaccessible #13995

One ceph node reboot caused whole rook-ceph cluster inaccessible #13995

Comments

akash123-eng commented Mar 29, 2024 • edited

subhamkrai commented Mar 29, 2024

akash123-eng commented Mar 29, 2024

subhamkrai commented Mar 29, 2024

akash123-eng commented Mar 29, 2024

travisn commented Mar 29, 2024

akash123-eng commented Mar 29, 2024

travisn commented Mar 29, 2024

akash123-eng commented Mar 29, 2024 • edited

akash123-eng commented Apr 1, 2024

travisn commented Apr 1, 2024

akash123-eng commented Apr 3, 2024

travisn commented Apr 3, 2024

akash123-eng commented Mar 29, 2024 •

edited

akash123-eng commented Mar 29, 2024 •

edited