Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New BlockPool / SC + Parallel RBD Volume Creation hangs and fails #8696

Closed
DandyDeveloper opened this issue Sep 13, 2021 · 11 comments · Fixed by #8923
Closed

New BlockPool / SC + Parallel RBD Volume Creation hangs and fails #8696

DandyDeveloper opened this issue Sep 13, 2021 · 11 comments · Fixed by #8923
Assignees
Labels
Projects

Comments

@DandyDeveloper
Copy link

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
After creating a new CephBlockPool and creating several PVCs in parallel on creation of that BlockPool / SC, the csi-provisioner hangs and results in nothing happening.

This is a known upstream CSI issue with a fix here: ceph/ceph#43113

Expected behavior:
Parallel image creation should result in the PVCs being created and no problems are expected.

How to reproduce it (minimal and precise):
Create CephBlockPool + StorageClass against the CSI provisioner.
Scale 2 StatefulSet resource relying on VolumeCLaimTemplates
Watch your PVCs sit in Pending whilst the provisioner struggles to process the requests

Workaround
Fortunately, there's a relatively simple workaround to this problems;

  • Exec into the Ceph Toolbox
  • Create a random image in the RBD pool; rbd create <pool>/test --size 1G
  • Restart the provisioners: kubectl delete pod -l app=csi-rbdplugin-provisioner -n rook-ceph
  • PVCs should start being created

Discussion about the problem here: https://rook-io.slack.com/archives/CK9CF5H2R/p1631503838341700

Environment
NR

@DandyDeveloper
Copy link
Author

The new CSI build once the upstream Ceph issue is merged will fix this problem.

@Rakshith-R
Copy link
Member

Rakshith-R commented Sep 13, 2021

Thanks @DandyDeveloper!

Particularly, ceph csi v3.4.0 (built with ceph pacific base image) and rook v1.7.1(which ships with cephcsi v3.4.0 as default) is affected by this issue.

Updated WA (recommended, will not leave any stale omap entries)

  • Execute rbd pool init <pool_name> command from toolbox or ceph-csi pods(similar to this).

  • Restart the csi-rbdplugin-provisioner-xxx pods.

    kubectl -n rook-ceph delete pods -l app=csi-rbdplugin-provisioner

see discussion here
[Thanks @Madhu-1 ]


Another workaround without the need for rook toolbox:

  • delete PVCs in progress for that newly created pool.
  • restart rbd provisioner pod.
  • Issue just a single PVC create request.
  • continue with creating PVCs after the initial PVC is in bound state.

The above steps will also resolve the deadlock.

@idryomov
Copy link

If you are going to use the toolbox pod, an even simpler workaround is rbd pool init <pool>.

@ricosega
Copy link

ricosega commented Sep 20, 2021

If you are going to use the toolbox pod, an even simpler workaround is rbd pool init <pool>.

Thank you, I hit this issue a week ago and didn't know how to solve it.

@Madhu-1
Copy link
Member

Madhu-1 commented Sep 20, 2021

@travisn we have 3 options here

cc @Rakshith-R

@travisn
Copy link
Member

travisn commented Sep 20, 2021

This will remain an issue until ceph v16.2.7? Until then, let's pin this github issue so it's more visible. If possible, it would be nice to add this to the csi troubleshooting guide as well.

@travisn travisn pinned this issue Sep 20, 2021
@travisn travisn moved this from Blocking Release to In progress in v1.7 Sep 20, 2021
@Madhu-1
Copy link
Member

Madhu-1 commented Sep 21, 2021

This will remain an issue until ceph v16.2.7? Until then, let's pin this github issue so it's more visible. If possible, it would be nice to add this to the csi troubleshooting guide as well.

I hope ceph/ceph#43113 will be part of 16.2.7. Sounds good to update the troubleshooting guide. @Rakshith-R can you please add this to the csi troubleshooting guide?

@idryomov
Copy link

I hope ceph/ceph#43113 will be part of 16.2.7.

Definitely.

Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Sep 21, 2021
This commit adds workaround for Parallel RBD PVC Creation hangs on
new pools in ceph-csi-troubleshooting.md.
Refer: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Sep 21, 2021
This commit adds workaround for Parallel RBD PVC Creation hangs on
new pools in ceph-csi-troubleshooting.md.

Refer: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Sep 22, 2021
This commit adds workaround for Parallel RBD PVC Creation hangs on
new pools in ceph-csi-troubleshooting.md.

Refer: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Sep 22, 2021
This commit adds workaround for Parallel RBD PVC Creation hangs on
new pools in ceph-csi-troubleshooting.md.

Refer: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Sep 27, 2021
This commit adds workaround for Parallel RBD PVC Creation hangs on
new pools in ceph-csi-troubleshooting.md.

Refer: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
@travisn travisn moved this from In progress to Blocking Release in v1.7 Oct 5, 2021
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Oct 6, 2021
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Oct 6, 2021
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Oct 6, 2021
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Oct 6, 2021
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Oct 6, 2021
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Oct 6, 2021
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Oct 6, 2021
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
Rakshith-R added a commit to Rakshith-R/rook that referenced this issue Oct 6, 2021
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
v1.7 automation moved this from Blocking Release to Done Oct 6, 2021
mergify bot pushed a commit that referenced this issue Oct 6, 2021
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: #8696

Signed-off-by: Rakshith R <rar@redhat.com>
(cherry picked from commit ab87e1d)
parth-gr pushed a commit to parth-gr/rook that referenced this issue Feb 22, 2022
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
parth-gr pushed a commit to parth-gr/rook that referenced this issue Feb 22, 2022
This is done in order to prevent deadlock when parallel
PVC create requests are issued on a new uninitialized
rbd block pool due to https://tracker.ceph.com/issues/52537.

Fixes: rook#8696

Signed-off-by: Rakshith R <rar@redhat.com>
@travisn travisn unpinned this issue Apr 26, 2022
@voarsh2
Copy link

voarsh2 commented Jan 4, 2023

Why is this issue closed? I still get this issue.

@travisn
Copy link
Member

travisn commented Jan 4, 2023

Why is this issue closed? I still get this issue.

With #8923 Rook is initializing the pool, so this same issue is not expected. What version of ceph and rook? Does it happen only at initial creation of the pool, or do you observe the hang later?

@voarsh2
Copy link

voarsh2 commented Jan 5, 2023

With #8923 Rook is initializing the pool, so this same issue is not expected. What version of ceph and rook? Does it happen only at initial creation of the pool, or do you observe the hang later?

I am on Rook 1.10.8

  • Perhaps a different issue, but when Googling it sounded similar.
    My pools are not new, but when I make lots of Statefulsets with PVC's they hang and I have to delete the RBD provisioner pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
v1.7
Done
Development

Successfully merging a pull request may close this issue.

7 participants