failed to provision volume with StorageClass "ceph-block": an operation with the given Volume ID pvc-ID already exists #8749

cmanzur · 2021-09-17T14:27:06Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

PVC's with StorageClass ceph-filesystem works.
PVC's with StorageClass ceph-block get stucked in Pending status.

Example:

kubectl describe pvc -n monitoring alertmanager-monitoring-alertmanager-db-alertmanager-monitoring-alertmanager-0 

Name:          alertmanager-monitoring-alertmanager-db-alertmanager-monitoring-alertmanager-0
Namespace:     monitoring
StorageClass:  ceph-block
Status:        Pending
Volume:        
Labels:        alertmanager=monitoring-alertmanager
               app=alertmanager
Annotations:   volume.beta.kubernetes.io/storage-provisioner: rook-ceph.rbd.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Used By:       alertmanager-monitoring-alertmanager-0
Events:
  Type     Reason                Age                  From                                                                                                       Message
  ----     ------                ----                 ----                                                                                                       -------
  Warning  ProvisioningFailed    3m23s                rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-97779fc75-wtpdx_7bb07313-28c7-4da8-82c6-639e208094a2  failed to provision volume with StorageClass "ceph-block": rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   Provisioning          77s (x9 over 5m53s)  rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-97779fc75-wtpdx_7bb07313-28c7-4da8-82c6-639e208094a2  External provisioner is provisioning volume for claim "monitoring/alertmanager-monitoring-alertmanager-db-alertmanager-monitoring-alertmanager-0"
  Warning  ProvisioningFailed    77s (x8 over 3m23s)  rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-97779fc75-wtpdx_7bb07313-28c7-4da8-82c6-639e208094a2  failed to provision volume with StorageClass "ceph-block": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-76e57307-83ba-4093-8d16-5349062ae5d9 already exists
  Normal   ExternalProvisioning  7s (x25 over 5m53s)  persistentvolume-controller                                                                                waiting for a volume to be created, either by external provisioner "rook-ceph.rbd.csi.ceph.com" or manually created by system administrator

Expected behavior:

PVC should be in Bound status

How to reproduce it (minimal and precise):

1- Deploy a k8s cluster with 5 workers and 3 CP. I'm using the default configuration using kubeadm init. Nothing special. In fact, it worked before with Ceph v15.2.8. Each worker has 4 disks. Disk /dev/sdc is raw device (no formatted, no partitions)
2- Deploy rook-ceph helm chart v1.7.3 with default values
3- Deploy rook-ceph-cluster helm chart v1.7.3. Changed:

useAllNodes: true
useAllDevices: false
deviceFilter: "sdc"

4- Create a deployment using PVC and storageClass ceph-block

Inside the workers the /dev/sdc (which is a special disk for OSD) has PV,VG and LV created OK.
I don't see any errors in ceph dashboard. All green.

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  annotations:
    meta.helm.sh/release-name: rook-ceph-cluster
    meta.helm.sh/release-namespace: rook-ceph
  creationTimestamp: "2021-09-17T13:52:16Z"
  finalizers:
  - cephcluster.ceph.rook.io
  generation: 3
  labels:
    app.kubernetes.io/managed-by: Helm
    manager: rook
    operation: Update
    subresource: status
    time: "2021-09-17T14:05:10Z"
  name: rook-ceph
  namespace: rook-ceph
  resourceVersion: "14683"
  uid: bce4b1f9-d210-405c-864f-cd3119401ade
spec:
  annotations:
    prepareosd:
      sidecar.istio.io/inject: "false"
  cephVersion:
    image: quay.io/ceph/ceph:v16.2.5
  cleanupPolicy:
    sanitizeDisks:
      dataSource: zero
      iteration: 1
      method: quick
  crashCollector: {}
  dashboard:
    enabled: true
  dataDirHostPath: /var/lib/rook
  disruptionManagement:
    machineDisruptionBudgetNamespace: openshift-machine-api
    managePodBudgets: true
    osdMaintenanceTimeout: 30
  external: {}
  healthCheck:
    daemonHealth:
      mon:
        interval: 45s
      osd:
        interval: 1m0s
      status:
        interval: 1m0s
    livenessProbe:
      mgr: {}
      mon: {}
      osd: {}
  labels:
    all:
      app.kubernetes.io/name: rook-ceph
    cleanup:
      app.kubernetes.io/component: cleanup
    mgr:
      app.kubernetes.io/component: mgr
    mon:
      app.kubernetes.io/component: mon
    osd:
      app.kubernetes.io/component: osd
    prepareosd:
      app.kubernetes.io/component: prepareosd
  logCollector: {}
  mgr:
    count: 1
    modules:
    - enabled: true
      name: pg_autoscaler
  mon:
    count: 3
  monitoring:
    rulesNamespace: rook-ceph
  network: {}
  resources:
    osd:
      limits:
        cpu: 1000m
        memory: 2Gi
      requests:
        cpu: 40m
        memory: 1Gi
  security:
    kms: {}
  storage:
    deviceFilter: sdc
    useAllDevices: false
    useAllNodes: true
  waitTimeoutForHealthyOSDInMinutes: 10
status:
  ceph:
    capacity:
      bytesAvailable: 1610514444288
      bytesTotal: 1610591764480
      bytesUsed: 77320192
      lastUpdated: "2021-09-17T14:05:08Z"
    health: HEALTH_OK
    lastChanged: "2021-09-17T14:05:08Z"
    lastChecked: "2021-09-17T14:05:08Z"
    previousHealth: HEALTH_WARN
    versions:
      mds:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 2
      mgr:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 1
      mon:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 3
      osd:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 5
      overall:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 12
      rgw:
        ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable): 1
  conditions:
  - lastHeartbeatTime: "2021-09-17T14:06:18Z"
    lastTransitionTime: "2021-09-17T13:53:46Z"
    message: Cluster created successfully
    reason: ClusterCreated
    status: "True"
    type: Ready
  - lastHeartbeatTime: "2021-09-17T14:07:00Z"
    lastTransitionTime: "2021-09-17T14:07:00Z"
    message: Configuring Ceph OSDs
    reason: ClusterProgressing
    status: "True"
    type: Progressing
  message: Configuring Ceph OSDs
  phase: Progressing
  state: Creating
  storage:
    deviceClasses:
    - name: ssd
  version:
    image: quay.io/ceph/ceph:v16.2.5
    version: 16.2.5-0

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read Github documentation if you need help.

Environment:

OS (e.g. from /etc/os-release): Debian 11 bullseye (also tried with Debian 10)
Kernel (e.g. uname -a): 5.10.0-8-amd64
Cloud provider or hardware configuration: baremetal with 5 workers and 3 CP
Rook version (use rook version inside of a Rook Pod): 1.7.3
Storage backend version (e.g. for ceph do ceph -v): 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
Kubernetes version (use kubectl version): 1.22.1
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): baremetal
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -o name | egrep rook-ceph-tools ) -- /bin/bash
[root@rook-ceph-tools-647ccbc595-t72lg /]# ceph osd status
ID  HOST                USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE      
 0  dev-k8s-worker-01  20.2M   299G      0        0       1      28   exists,up  
 1  dev-k8s-worker-03  14.3M   299G      0        0       0       0   exists,up  
 2  dev-k8s-worker-02  15.8M   299G      0        0       0      56   exists,up  
 3  dev-k8s-worker-04  16.1M   299G      0        0       2      148   exists,up  
 4  dev-k8s-worker-05  15.2M   299G      0        0       0       36   exists,up 

[root@rook-ceph-tools-647ccbc595-t72lg /]# ceph health
HEALTH_OK

[root@rook-ceph-tools-647ccbc595-t72lg /]# ceph status
  cluster:
    id:     bb07c0cb-c74a-4279-a8fc-a1caa9040b05
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 11m)
    mgr: a(active, since 9m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 5 osds: 5 up (since 6s), 5 in (since 10m)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   11 pools, 177 pgs
    objects: 249 objects, 12 KiB
    usage:   76 MiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     149 active+clean
             28  active+clean+wait
 
  io:
    client:   1022 B/s rd, 1 op/s rd, 0 op/s wr

The text was updated successfully, but these errors were encountered:

BlaineEXE · 2021-09-17T20:03:16Z

@Madhu-1 want to take a look?

huangxuelun · 2021-09-19T01:44:27Z

I met same problem in rook 1.7.3 and 1.7.2 version. use helm to install it, just one value that show toolbox. in toolbox tell me the health of cluster is ok.
run k8s v1.21.1 on centOS 7.9. 3 master nodes, 3 work nodes.
Linux master1 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 x86_64 x86_64 GNU/Linux

[root@master1 ~]# kubectl -n rook-ceph get pod
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-44l6g 3/3 Running 15 6d13h
csi-cephfsplugin-4h6rm 3/3 Running 15 6d13h
csi-cephfsplugin-66zt5 3/3 Running 15 6d13h
csi-cephfsplugin-796h6 3/3 Running 15 6d13h
csi-cephfsplugin-hl4wc 3/3 Running 15 6d13h
csi-cephfsplugin-provisioner-75d5d845f4-9r4gp 6/6 Running 18 2d7h
csi-cephfsplugin-provisioner-75d5d845f4-p6z2d 6/6 Running 30 6d13h
csi-cephfsplugin-xtgzf 3/3 Running 15 6d13h
csi-rbdplugin-66zvf 3/3 Running 15 6d13h
csi-rbdplugin-cv46v 3/3 Running 15 6d13h
csi-rbdplugin-dgvzx 3/3 Running 15 6d13h
csi-rbdplugin-h6l98 3/3 Running 15 6d13h
csi-rbdplugin-provisioner-5845579d68-rxqpw 6/6 Running 32 6d13h
csi-rbdplugin-provisioner-5845579d68-ww6gh 6/6 Running 18 2d7h
csi-rbdplugin-tqsch 3/3 Running 15 6d13h
csi-rbdplugin-xs4wn 3/3 Running 15 6d13h
rook-ceph-crashcollector-192.168.104.120-6f58f55679-7gh6h 1/1 Running 5 6d13h
rook-ceph-crashcollector-192.168.104.122-9f884b655-4z9jf 1/1 Running 5 6d13h
rook-ceph-crashcollector-192.168.104.123-5bd85595bd-n8lwn 1/1 Running 5 6d13h
rook-ceph-crashcollector-192.168.104.124-648886fb65-7s9j6 1/1 Running 5 6d13h
rook-ceph-crashcollector-192.168.104.125-778cc5966b-z5r2m 1/1 Running 5 6d13h
rook-ceph-mds-ceph-filesystem-a-64cc6cd5db-4x974 1/1 Running 5 6d13h
rook-ceph-mds-ceph-filesystem-b-7d8df9864-qqbdb 1/1 Running 5 6d13h
rook-ceph-mgr-a-5778b4b494-h699v 1/1 Running 5 6d13h
rook-ceph-mon-a-6dbf5cdd4f-7txxj 1/1 Running 5 6d13h
rook-ceph-mon-b-f56f99cb-p77ss 1/1 Running 9 6d13h
rook-ceph-mon-c-bdcd4d89d-zn79n 1/1 Running 9 6d13h
rook-ceph-operator-6967668c59-jxch8 1/1 Running 3 2d7h
rook-ceph-osd-0-7bb7bc66f7-m6m7z 1/1 Running 9 6d13h
rook-ceph-osd-1-5cf8655cdf-4dqr5 1/1 Running 5 6d13h
rook-ceph-osd-2-666965b8bf-pzt62 1/1 Running 9 6d13h
rook-ceph-osd-prepare-192.168.104.120-5c9lm 0/1 Completed 0 51m
rook-ceph-osd-prepare-192.168.104.121-64pd8 0/1 Completed 0 51m
rook-ceph-osd-prepare-192.168.104.122-g8zdr 0/1 Completed 0 51m
rook-ceph-osd-prepare-192.168.104.123-z67pd 0/1 Completed 0 51m
rook-ceph-osd-prepare-192.168.104.124-5bnb5 0/1 Completed 0 51m
rook-ceph-osd-prepare-192.168.104.125-9h5xx 0/1 Completed 0 51m
rook-ceph-rgw-ceph-objectstore-a-7b5bc99c88-rmqlr 1/1 Running 9 6d13h
rook-ceph-toolbox-job-dbj5s 0/1 Completed 0 6d12h
rook-ceph-tools-54fc95f4f4-wqghz 1/1 Running 5 6d13h

[root@rook-ceph-tools-54fc95f4f4-wqghz /]# ceph status
cluster:
id: a0472cbe-c704-47ca-bb42-75c52f2f5b88
health: HEALTH_OK

services:
mon: 3 daemons, quorum a,b,c (age 12m)
mgr: a(active, since 15m)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 13m), 3 in (since 6d)
rgw: 1 daemon active (1 hosts, 1 zones)

data:
volumes: 1/1 healthy
pools: 11 pools, 177 pgs
objects: 416 objects, 36 KiB
usage: 112 MiB used, 6.0 TiB / 6.0 TiB avail
pgs: 177 active+clean

io:
client: 3.8 KiB/s rd, 511 B/s wr, 4 op/s rd, 1 op/s wr

[root@rook-ceph-tools-54fc95f4f4-wqghz /]# ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 192.168.104.123 36.7M 2047G 0 0 0 0 exists,up
1 192.168.104.125 36.0M 2047G 0 0 1 0 exists,up
2 192.168.104.124 34.4M 2047G 0 0 2 105 exists,up

[root@rook-ceph-tools-54fc95f4f4-wqghz /]# ceph osd stat
3 osds: 3 up (since 76m), 3 in (since 6d); epoch: e501

[root@rook-ceph-tools-54fc95f4f4-wqghz /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 6.00000 root default
-5 2.00000 host 192-168-104-123
0 hdd 2.00000 osd.0 up 1.00000 1.00000
-7 2.00000 host 192-168-104-124
2 hdd 2.00000 osd.2 up 1.00000 1.00000
-3 2.00000 host 192-168-104-125
1 hdd 2.00000 osd.1 up 1.00000 1.00000

[root@rook-ceph-tools-54fc95f4f4-wqghz /]# ceph pg stat
177 pgs: 177 active+clean; 36 KiB data, 107 MiB used, 6.0 TiB / 6.0 TiB avail; 2.7 KiB/s rd, 425 B/s wr, 5 op/s

W0919 01:27:00.958652 1 controller.go:958] Retrying syncing claim "aed49dce-632e-4506-a725-320f754175da", failure 13
E0919 01:27:00.958716 1 controller.go:981] error syncing claim "aed49dce-632e-4506-a725-320f754175da": failed to provision volume with StorageClass "ceph-block": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-aed49dce-632e-4506-a725-320f754175da already exists
I0919 01:27:00.958767 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"finder-nacos-0", UID:"aed49dce-632e-4506-a725-320f754175da", APIVersion:"v1", ResourceVersion:"585317", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ceph-block": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-aed49dce-632e-4506-a725-320f754175da already exists
I0919 01:27:01.296739 1 controller.go:1332] provision "default/my-pv-claim1" class "rook-ceph-block": started
I0919 01:27:01.298449 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"my-pv-claim1", UID:"447862f6-28bf-4ec1-92e8-3bd84c913014", APIVersion:"v1", ResourceVersion:"601523", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/my-pv-claim1"
W0919 01:27:01.344225 1 controller.go:958] Retrying syncing claim "447862f6-28bf-4ec1-92e8-3bd84c913014", failure 13
E0919 01:27:01.344342 1 controller.go:981] error syncing claim "447862f6-28bf-4ec1-92e8-3bd84c913014": failed to provision volume with StorageClass "rook-ceph-block": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-447862f6-28bf-4ec1-92e8-3bd84c913014 already exists
I0919 01:27:01.344394 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"my-pv-claim1", UID:"447862f6-28bf-4ec1-92e8-3bd84c913014", APIVersion:"v1", ResourceVersion:"601523", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "rook-ceph-block": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-447862f6-28bf-4ec1-92e8-3bd84c913014 already exists

ricosega · 2021-09-20T08:51:11Z

Same problem happened to me after moving from static yamls with version v1.5.7 to Helm with versions 1.7.2 and 1.7.3
It stucks with pods in pending status forever:

failed to provision volume with StorageClass "ceph-block": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-6372c4d8-b6d0-4e18-a6b6-7917892f08ce already exists

Madhu-1 · 2021-09-20T09:02:09Z

Is this same as #8696 or #8727?

cmanzur · 2021-09-20T09:41:00Z

Is this same as #8696 or #8727?

Yes! Thanks! It worked using from the toolbox:

rbd pool init <pool>

Madhu-1 · 2021-09-20T09:41:56Z

Thank you. Closing it as duplicate. will tack it in #8696

cmanzur added the bug label Sep 17, 2021

Madhu-1 closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to provision volume with StorageClass "ceph-block": an operation with the given Volume ID pvc-ID already exists #8749

failed to provision volume with StorageClass "ceph-block": an operation with the given Volume ID pvc-ID already exists #8749

cmanzur commented Sep 17, 2021 •

edited

BlaineEXE commented Sep 17, 2021

huangxuelun commented Sep 19, 2021 •

edited

ricosega commented Sep 20, 2021

Madhu-1 commented Sep 20, 2021

cmanzur commented Sep 20, 2021 •

edited

Madhu-1 commented Sep 20, 2021

failed to provision volume with StorageClass "ceph-block": an operation with the given Volume ID pvc-ID already exists #8749

failed to provision volume with StorageClass "ceph-block": an operation with the given Volume ID pvc-ID already exists #8749

Comments

cmanzur commented Sep 17, 2021 • edited

BlaineEXE commented Sep 17, 2021

huangxuelun commented Sep 19, 2021 • edited

io: client: 3.8 KiB/s rd, 511 B/s wr, 4 op/s rd, 1 op/s wr

[root@rook-ceph-tools-54fc95f4f4-wqghz /]# ceph osd status ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE 0 192.168.104.123 36.7M 2047G 0 0 0 0 exists,up 1 192.168.104.125 36.0M 2047G 0 0 1 0 exists,up 2 192.168.104.124 34.4M 2047G 0 0 2 105 exists,up

[root@rook-ceph-tools-54fc95f4f4-wqghz /]# ceph osd stat 3 osds: 3 up (since 76m), 3 in (since 6d); epoch: e501

ricosega commented Sep 20, 2021

Madhu-1 commented Sep 20, 2021

cmanzur commented Sep 20, 2021 • edited

Madhu-1 commented Sep 20, 2021

cmanzur commented Sep 17, 2021 •

edited

huangxuelun commented Sep 19, 2021 •

edited

io:
client: 3.8 KiB/s rd, 511 B/s wr, 4 op/s rd, 1 op/s wr

[root@rook-ceph-tools-54fc95f4f4-wqghz /]# ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 192.168.104.123 36.7M 2047G 0 0 0 0 exists,up
1 192.168.104.125 36.0M 2047G 0 0 1 0 exists,up
2 192.168.104.124 34.4M 2047G 0 0 2 105 exists,up

[root@rook-ceph-tools-54fc95f4f4-wqghz /]# ceph osd stat
3 osds: 3 up (since 76m), 3 in (since 6d); epoch: e501

cmanzur commented Sep 20, 2021 •

edited