Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One ceph node reboot caused whole rook-ceph cluster inaccessible #13995

Open
akash123-eng opened this issue Mar 29, 2024 · 12 comments
Open

One ceph node reboot caused whole rook-ceph cluster inaccessible #13995

akash123-eng opened this issue Mar 29, 2024 · 12 comments
Labels

Comments

@akash123-eng
Copy link

akash123-eng commented Mar 29, 2024

Hi Team,

We are using rook-ceph with rook operator 1.10.8 and ceph 17.2.5 deployed on kubernetes 1.25.9.
We have 10+ nodes and cephfs and rbd blockpool with multi-replica setup
Yesterday we faced a very strange issue.

One of the nodes of rook-ceph where osd pods were hosted and one mon pod was hosted was rebooted without draining due to some issue.
Due to which mon pod went down on that node but it was rescheduled on some other node but other 2 mon pod which was on different nodes was running and osd pod went into pending state.
Ideally, this should have caused data redundancy only but our whole ceph cluster was inaccessible, no read and writes were happening on rook-ceph PVCs also there was no prometheus metrics during that duration.

Once the rebooted node came healthy issue was resolved but this should not have caused whole ceph clsuter inaccessible as less than 10% PGs were active + stale,all other PGs were active + clean and OSDs except osd of that node were UP.

During this issue slow ops started to increase but it went till some point after which it was neither increasing nor decreasing.
ceph status was showing:

2 MDSs report slow metadata IOs
2 MDSs report slow requests.
Reduced data availability. 44 PGS stale
1817 slow ops,oldest one blocked for 4234 sec

can you please let us know what might have caused this and how to prevent it ?

@subhamkrai
Copy link
Contributor

@akash123-eng what was ceph status during that time? and how many osds were running on that node?

also rook 1.10.8 is very old, try to upgrade

@akash123-eng
Copy link
Author

@subhamkrai ceph status was in warning state as mentioned above with below warnings. only 44 PG were active + clean + stale state and other all pg were active + clean

2 MDSs report slow metadata IOs
2 MDSs report slow requests.
Reduced data availability. 44 PGS stale
1817 slow ops,oldest one blocked for 4234 sec

on that node there were some OSDs running but it is less than 10% of our total OSDs

@subhamkrai
Copy link
Contributor

@akash123-eng rook operator logs may help, but again 1.10 is old version, . Also, it is very unlikely to have this scenario if only one node is down out 10+ nodes

In ~week, we'll have 1.14 release so I'll suggest plan upgrading to get new fixes and features.

@akash123-eng
Copy link
Author

@subhamkrai we are planning to upgrade to 1.12 but it will take some time.
For now we wanted to identify root cause of the issue as it should not reoccur again.
Can you please help on that ?
Cannot post all the operator logs here due to some issue.

What I can search in the operator logs particularly?

@travisn
Copy link
Member

travisn commented Mar 29, 2024

@akash123-eng Please check:

  • failureDomain: host in all of your pools, filesystems, and object stores. If the failureDomain is instead osd, you could easily hit this issue when multiple OSDs go down. But if the failureDomain is host, only one replica of data would be affected when one node goes down.
  • Replica size is 3 in all pools, filesystems, and object stores?
  • Mons are on separate nodes, right?

@akash123-eng
Copy link
Author

@travisn

We are using failure domain rack
Replica size is 2 for rbd block pool , 3 for cephfs and there is 1 rbd single replica pool also.
We understand that apps using singleReplica pool storage class would be affected but in our case whole cephcluster itself was not operational even filesystem based pvc where replica is set to 3 and no read and writes were happening on any of the pvc unless rebooted node came back up again
Yes Mon are on separate nodes

@travisn
Copy link
Member

travisn commented Mar 29, 2024

  • Rack failure domain sounds good.
  • Was the active mds pod on the node that shut down? The mds pod should be scheduled to another node automatically, but it might take a few minutes.
  • The rbd replica 2 pool was also blocked? It is expected to still work since min_size would be 1

@akash123-eng
Copy link
Author

akash123-eng commented Mar 29, 2024

@subhamkrai @travisn
there was no mds pod running on the node which was rebooted.
Yes rbd block pool of size 2 was also affected as mentioned earlier, even PVCs based on cephfilesystem with replica size of 3 were also affected ,no read or write was happening on those pvcs

@akash123-eng
Copy link
Author

@travisn with the information provided above can you please help on how to find root cause of this issue and how to avoid it in future ?
What we can search in operator logs which can confirm there was no read writes happening on rook ceph as there are thousands of logs during that duration and its very hand to find root cause

@travisn
Copy link
Member

travisn commented Apr 1, 2024

Please share the full output of these two commands, to confirm the cluster can handle the loss of a node:

ceph osd crush dump
ceph osd pool ls detail

@akash123-eng
Copy link
Author

@travisn ok will share it tomorrow but as mentioned above we are using rack failureDomain and the server which got rebooted without node was in a rack which also has other server which was running it's osd properly during that time. So we are suspecting that rook was not able to identify that server got rebooted and it kept sending request to that increasing slowops to a certain extent and then stopped all operations completely.
How we can check the same in the logs?

@travisn
Copy link
Member

travisn commented Apr 3, 2024

The issue would be at the data layer (ceph). Rook doesn't detect when nodes go down. And if Ceph is getting stuck, I suspect an issue with the CRUSH map or pool configuration, thus the request for those details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants