Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD Resize Increases Used Capacity Not Available Capacity #14099

Closed
jameshearttech opened this issue Apr 19, 2024 · 22 comments
Closed

OSD Resize Increases Used Capacity Not Available Capacity #14099

jameshearttech opened this issue Apr 19, 2024 · 22 comments
Labels

Comments

@jameshearttech
Copy link

jameshearttech commented Apr 19, 2024

** Previous bug report for same issue. Only this time with a different OS, VMware controller, Kubernetes version, Rook version, and Ceph version. Where is the bug?! #12511**

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:

After resizing the underlying disk at the hypervisor and OS level resizing the OSD increases cluster total capacity and used capacity.

Expected behavior:

After resizing the underlying disk at the hypervisor and OS level resizing the OSD increases cluster total capacity and available capacity.

How to reproduce it (minimal and precise):

Build a Kubernetes cluster from virtual machines where Rook consumes virtual disks as OSDs. Resize a virtual disk used as an OSD at the hypervisor level. Resize the disk at the OS level. Resize the disk at the Ceph level (e.g., restart the OSD pod).

File(s) to submit:

image
image

  • Cluster CR (custom resource), typically called cluster.yaml, if necessary: rook-ceph.yaml.txt

Logs to submit:

Cluster Status to submit:

  cluster:
    id:     e1ebb901-75ad-4b7c-90d9-69edf914c04e
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,d (age 2d)
    mgr: b(active, since 90m), standbys: a
    mds: 1/1 daemons up, 1 hot standby
    osd: 8 osds: 8 up (since 91m), 8 in (since 2h)
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 265 pgs
    objects: 493.10k objects, 748 GiB
    usage:   1.3 TiB used, 1.5 TiB / 2.8 TiB avail
    pgs:     265 active+clean
 
  io:
    client:   103 KiB/s rd, 1.4 MiB/s wr, 22 op/s rd, 14 op/s wr

Environment:

  • OS (e.g. from /etc/os-release): Talos (v1.6.4)
  • Kernel (e.g. uname -a): 6.1.74-talos
  • Cloud provider or hardware configuration: VMware
  • Rook version (use rook version inside of a Rook Pod): v1.13.3
  • Storage backend version (e.g. for ceph do ceph -v): v18.2.2
  • Kubernetes version (use kubectl version): v1.29.1
  • Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): Talos
  • Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox): HEALTH_OK
@parth-gr
Copy link
Member

@jameshearttech can you share the ceph osd df tree outputs before and after resize.

@jameshearttech
Copy link
Author

For context there is this Slack thread. I see ceph osd df output from when there were 7 K8s nodes and OSDs (i.e., 1 OSD per node). Is the output there good enough? If not I'll have to attempt another resize and get the output at that time. Here is the current output from ceph osd df tree.

$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd df tree
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME            
 -1         2.73438         -  2.8 TiB  1.1 TiB  1.0 TiB  493 MiB   16 GiB  1.7 TiB  40.42  1.00    -          root default         
 -5         0.34180         -  350 GiB  121 GiB  119 GiB   31 MiB  1.8 GiB  229 GiB  34.49  0.85    -              host mgmt-worker0
  0    ssd  0.34180   1.00000  350 GiB  121 GiB  119 GiB   31 MiB  1.8 GiB  229 GiB  34.49  0.85   89      up          osd.0        
 -9         0.34180         -  350 GiB  141 GiB  139 GiB   37 MiB  1.7 GiB  209 GiB  40.23  1.00    -              host mgmt-worker1
  2    ssd  0.34180   1.00000  350 GiB  141 GiB  139 GiB   37 MiB  1.7 GiB  209 GiB  40.23  1.00  109      up          osd.2        
 -3         0.34180         -  350 GiB  125 GiB  123 GiB   68 MiB  2.1 GiB  225 GiB  35.76  0.88    -              host mgmt-worker2
  1    ssd  0.34180   1.00000  350 GiB  125 GiB  123 GiB   68 MiB  2.1 GiB  225 GiB  35.76  0.88   92      up          osd.1        
 -7         0.34180         -  350 GiB  134 GiB  133 GiB   80 MiB  1.7 GiB  216 GiB  38.43  0.95    -              host mgmt-worker3
  3    ssd  0.34180   1.00000  350 GiB  134 GiB  133 GiB   80 MiB  1.7 GiB  216 GiB  38.43  0.95  102      up          osd.3        
-11         0.34180         -  350 GiB  132 GiB  129 GiB   72 MiB  2.3 GiB  218 GiB  37.68  0.93    -              host mgmt-worker4
  4    ssd  0.34180   1.00000  350 GiB  132 GiB  129 GiB   72 MiB  2.3 GiB  218 GiB  37.68  0.93  103      up          osd.4        
-13         0.34180         -  350 GiB  147 GiB  145 GiB   63 MiB  1.8 GiB  203 GiB  41.97  1.04    -              host mgmt-worker5
  5    ssd  0.34180   1.00000  350 GiB  147 GiB  145 GiB   63 MiB  1.8 GiB  203 GiB  41.97  1.04  100      up          osd.5        
-15         0.34180         -  350 GiB  137 GiB  134 GiB   64 MiB  2.3 GiB  213 GiB  39.11  0.97    -              host mgmt-worker6
  6    ssd  0.34180   1.00000  350 GiB  137 GiB  134 GiB   64 MiB  2.3 GiB  213 GiB  39.11  0.97  100      up          osd.6        
-17         0.34180         -  450 GiB  235 GiB  133 GiB   78 MiB  2.4 GiB  215 GiB  52.29  1.29    -              host mgmt-worker7
  7    ssd  0.34180   1.00000  450 GiB  235 GiB  133 GiB   78 MiB  2.4 GiB  215 GiB  52.29  1.29  100      up          osd.7        
-19               0         -      0 B      0 B      0 B      0 B      0 B      0 B      0     0    -              host mgmt-worker8
                        TOTAL  2.8 TiB  1.1 TiB  1.0 TiB  493 MiB   16 GiB  1.7 TiB  40.42

@parth-gr
Copy link
Member

parth-gr commented Apr 22, 2024

The ceph side values are correct
TOTAL 2.8 TiB 1.1 TiB 1.0 TiB 493 MiB 16 GiB 1.7 TiB 40.42

So it is the dashboard that probably has an error,
@rkachach wanna take a look?

@jameshearttech
Copy link
Author

jameshearttech commented Apr 22, 2024

I'm not following how you concluded that the Ceph side values are correct from:
TOTAL 2.8 TiB 1.1 TiB 1.0 TiB 493 MiB 16 GiB 1.7 TiB 40.42

The queries in the dashboard are pretty straight forward.
Query A: ceph_cluster_total_bytes{cluster="$cluster"}-ceph_cluster_total_used_bytes{cluster="$cluster"}
Query B: ceph_cluster_total_used_bytes{cluster="$cluster"}
Query C: ceph_cluster_total_bytes{}

The dashboard uses these queries to visualize available, used, and total capacity. How do we conclude that the dashboard is showing incorrect used/available capacity, but the Ceph CLI is showing correct used/available capacity? Is it an issue with the Prometheus metrics from Ceph?

EDIT: I fixed query C: ceph_cluster_total_bytes{cluster="$cluster"}

@parth-gr
Copy link
Member

parth-gr commented Apr 22, 2024

From your Screenshots:

                      Available      Used     Total

Previous values:        1.51               1.23        2.73

New Values:             1.51               1.33       2.83

From ceph side          1.7                1.1         2.8
(osd df tree)
(New values)

Which says ceph side values are increased correctly

@jameshearttech
Copy link
Author

I don't understand. My whole point was that used space increased rather than available space. You seem to be interpreting that differently?

@parth-gr
Copy link
Member

But if you see the ceph output the values are correct.

From ceph side          1.7                1.1         2.8
(osd df tree)
(New values)

@jameshearttech
Copy link
Author

jameshearttech commented Apr 23, 2024

At this point I have replaced all the OSDs with slightly larger OSDs. I'm trying to get back down to 4 K8s nodes. I wanted to take another shot at resizing an OSD.

Here is the current screenshot from Grafana.

image

Here is the current output of ceph osd df tree.

$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- sh -c "ceph osd df tree"
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME            
 -1         3.51599         -  3.5 TiB  1.1 TiB  1.1 TiB  418 MiB   13 GiB  2.4 TiB  31.01  1.00    -          root default         
 -5         0.43950         -  450 GiB  125 GiB  124 GiB      0 B  1.5 GiB  325 GiB  27.81  0.90    -              host mgmt-worker0
  0    ssd  0.43950   1.00000  450 GiB  125 GiB  124 GiB      0 B  1.5 GiB  325 GiB  27.81  0.90   89      up          osd.0        
 -9         0.43950         -  450 GiB  145 GiB  144 GiB   29 MiB  1.2 GiB  305 GiB  32.19  1.04    -              host mgmt-worker1
  2    ssd  0.43950   1.00000  450 GiB  145 GiB  144 GiB   29 MiB  1.2 GiB  305 GiB  32.19  1.04  108      up          osd.2        
 -3         0.43950         -  450 GiB  130 GiB  129 GiB   60 MiB  1.5 GiB  320 GiB  28.97  0.93    -              host mgmt-worker2
  1    ssd  0.43950   1.00000  450 GiB  130 GiB  129 GiB   60 MiB  1.5 GiB  320 GiB  28.97  0.93   92      up          osd.1        
 -7         0.43950         -  450 GiB  140 GiB  138 GiB   75 MiB  1.1 GiB  310 GiB  31.02  1.00    -              host mgmt-worker3
  3    ssd  0.43950   1.00000  450 GiB  140 GiB  138 GiB   75 MiB  1.1 GiB  310 GiB  31.02  1.00  102      up          osd.3        
-11         0.43950         -  450 GiB  146 GiB  144 GiB   66 MiB  2.0 GiB  304 GiB  32.44  1.05    -              host mgmt-worker4
  4    ssd  0.43950   1.00000  450 GiB  146 GiB  144 GiB   66 MiB  2.0 GiB  304 GiB  32.44  1.05  107      up          osd.4        
-13         0.43950         -  450 GiB  151 GiB  149 GiB   56 MiB  1.6 GiB  299 GiB  33.58  1.08    -              host mgmt-worker5
  5    ssd  0.43950   1.00000  450 GiB  151 GiB  149 GiB   56 MiB  1.6 GiB  299 GiB  33.58  1.08   99      up          osd.5        
-15         0.43950         -  450 GiB  134 GiB  132 GiB   64 MiB  1.9 GiB  316 GiB  29.69  0.96    -              host mgmt-worker6
  6    ssd  0.43950   1.00000  450 GiB  134 GiB  132 GiB   64 MiB  1.9 GiB  316 GiB  29.69  0.96   96      up          osd.6        
-17         0.43950         -  450 GiB  146 GiB  144 GiB   67 MiB  2.1 GiB  304 GiB  32.43  1.05    -              host mgmt-worker7
  7    ssd  0.43950   1.00000  450 GiB  146 GiB  144 GiB   67 MiB  2.1 GiB  304 GiB  32.43  1.05  102      up          osd.7        
                        TOTAL  3.5 TiB  1.1 TiB  1.1 TiB  418 MiB   13 GiB  2.4 TiB  31.01                                          
MIN/MAX VAR: 0.90/1.08  STDDEV: 1.88

I drained then shutdown mgmt-worker0. I resized the virtual disk for /dev/sdb from 450 GB to 850 GB, which is consumed by Rook as osd.0. I started mgmt-worker0 then uncordoned the node. I waited for Ceph to rebalance then checked the result.

Here is the post osd.0 resize screenshot from Grafana.

image

Here is the post osd.0 resize output of ceph osd df tree.

$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- sh -c "ceph osd df tree"
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME            
 -1         3.51599         -  3.9 TiB  1.5 TiB  1.1 TiB  440 MiB   12 GiB  2.4 TiB  37.90  1.00    -          root default         
 -5         0.43950         -  850 GiB  546 GiB  146 GiB   22 MiB  723 MiB  304 GiB  64.27  1.70    -              host mgmt-worker0
  0    ssd  0.43950   1.00000  850 GiB  546 GiB  146 GiB   22 MiB  723 MiB  304 GiB  64.27  1.70   99      up          osd.0        
 -9         0.43950         -  450 GiB  141 GiB  139 GiB   29 MiB  1.2 GiB  309 GiB  31.24  0.82    -              host mgmt-worker1
  2    ssd  0.43950   1.00000  450 GiB  141 GiB  139 GiB   29 MiB  1.2 GiB  309 GiB  31.24  0.82  106      up          osd.2        
 -3         0.43950         -  450 GiB  130 GiB  129 GiB   60 MiB  1.5 GiB  320 GiB  28.97  0.76    -              host mgmt-worker2
  1    ssd  0.43950   1.00000  450 GiB  130 GiB  129 GiB   60 MiB  1.5 GiB  320 GiB  28.97  0.76   92      up          osd.1        
 -7         0.43950         -  450 GiB  140 GiB  138 GiB   75 MiB  1.1 GiB  310 GiB  31.03  0.82    -              host mgmt-worker3
  3    ssd  0.43950   1.00000  450 GiB  140 GiB  138 GiB   75 MiB  1.1 GiB  310 GiB  31.03  0.82  102      up          osd.3        
-11         0.43950         -  450 GiB  146 GiB  144 GiB   66 MiB  2.0 GiB  304 GiB  32.44  0.86    -              host mgmt-worker4
  4    ssd  0.43950   1.00000  450 GiB  146 GiB  144 GiB   66 MiB  2.0 GiB  304 GiB  32.44  0.86  107      up          osd.4        
-13         0.43950         -  450 GiB  140 GiB  138 GiB   56 MiB  1.6 GiB  310 GiB  31.13  0.82    -              host mgmt-worker5
  5    ssd  0.43950   1.00000  450 GiB  140 GiB  138 GiB   56 MiB  1.6 GiB  310 GiB  31.13  0.82   94      up          osd.5        
-15         0.43950         -  450 GiB  134 GiB  132 GiB   64 MiB  1.9 GiB  316 GiB  29.69  0.78    -              host mgmt-worker6
  6    ssd  0.43950   1.00000  450 GiB  134 GiB  132 GiB   64 MiB  1.9 GiB  316 GiB  29.69  0.78   96      up          osd.6        
-17         0.43950         -  450 GiB  139 GiB  137 GiB   67 MiB  2.1 GiB  311 GiB  30.97  0.82    -              host mgmt-worker7
  7    ssd  0.43950   1.00000  450 GiB  139 GiB  137 GiB   67 MiB  2.1 GiB  311 GiB  30.97  0.82   99      up          osd.7        
                        TOTAL  3.9 TiB  1.5 TiB  1.1 TiB  440 MiB   12 GiB  2.4 TiB  37.90                                          
MIN/MAX VAR: 0.76/1.70  STDDEV: 11.50

Here is a post osd.0 resize screenshot from Ceph dashboard.

image

In all 3 cases the used capacity appears to have increased by 400 GB, which is how much I increased the size of the virtual disk underlying osd.0. Looking a bit closer I see the use remained at 1.1 TB while the raw use increased to 1.5 TB. Does this get us closer to an answer?

Looking back through my previous issue and at some other similar issues the expand-bluefs container is supposed to solve this problem of used capacity vs available capacity; however, in my case it does not work as expected I guess, but why? I can see the expand-bluefs container is running from the logs.

Explore-logs-2024-04-23 10_42_46.txt

@jameshearttech
Copy link
Author

jameshearttech commented Apr 23, 2024

Looking at the work I have been doing over the course of the day you can see that when I replace an OSD the total and available capacity move together, but when I resize they diverge. Or maybe I should say the use and raw use move together?

image

@travisn
Copy link
Member

travisn commented Apr 23, 2024

I spun up an AWS test cluster and confirmed that I see the same behavior...

The initial cluster has three OSDs:

sh-4.4$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  DATA     OMAP  META    AVAIL   %USE  VAR   PGS  STATUS
 1    ssd  0.00980   1.00000  10 GiB   27 MiB  724 KiB   0 B  26 MiB  10 GiB  0.26  1.00    1      up
 2    ssd  0.00980   1.00000  10 GiB   27 MiB  724 KiB   0 B  26 MiB  10 GiB  0.26  1.00    1      up
 0    ssd  0.00980   1.00000  10 GiB   27 MiB  720 KiB   0 B  26 MiB  10 GiB  0.26  1.00    1      up
                       TOTAL  30 GiB   81 MiB  2.1 MiB   0 B  79 MiB  30 GiB  0.26                  

After resizing, the OSD config is:

sh-4.4$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  DATA     OMAP     META    AVAIL   %USE   VAR   PGS  STATUS
 1    ssd  0.00980   1.00000  15 GiB  5.0 GiB  976 KiB    1 KiB  27 MiB  10 GiB  33.51  1.00    1      up
 2    ssd  0.00980   1.00000  15 GiB  5.0 GiB  976 KiB    1 KiB  27 MiB  10 GiB  33.51  1.00    1      up
 0    ssd  0.00980   1.00000  15 GiB  5.0 GiB  972 KiB    1 KiB  27 MiB  10 GiB  33.51  1.00    1      up
                       TOTAL  45 GiB   15 GiB  2.9 MiB  3.5 KiB  80 MiB  30 GiB  33.51                   

This would be a Ceph issue. Rook doesn't have any influence on the sizes the OSDs that are reported, and I don't see how the OSD resize could influence that incorrect raw size. Would you mind opening a Ceph tracker for this?

@rkachach
Copy link
Contributor

rkachach commented Apr 24, 2024

@nizamial09 is this a knows issue?

@jameshearttech
Copy link
Author

jameshearttech commented Apr 24, 2024

@jameshearttech
Copy link
Author

jameshearttech commented Apr 25, 2024

@jameshearttech
Copy link
Author

jameshearttech commented Apr 26, 2024

I got a response on Ceph issue 65659 stating it is probably the same issue as Ceph issue 63858. The suggested work around is Ceph issue 63858 note 7.

I'm not sure how to apply this work around in Rook. I drained the node, marked the OSD out, rebooted the node, marked the OSD in, uncordoned the node, and waited for rebalance to complete. The used space did not change (i.e., go down by 400 GB). @travisn I'm happy to test this and confirm the work around, but I'm not sure how. Any ideas?

@jameshearttech
Copy link
Author

[From Igor Fedotov](https://tracker.ceph.com/issues/65659#note-11)

Generally what you need is to shutdown OSD process in a non-graceful manner. And let it rebuild allocmap during the following restart. It has nothing about osd draining or node restart (unless you power it off which I'd prefer not to do).

In bare metal setup this implies running kill -9 against ceph-osd process. You need to achieve the same in Rook environment. Sorry, I'm not an expert in it hence unable to provide more detailed guideline...

@jameshearttech
Copy link
Author

jameshearttech commented Apr 27, 2024

Following Igor's suggestion worked as expected.

$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/osd-3/ {print $1}') -n rook-ceph -c osd -it -- sh
sh-4.4# ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
65535         1  0.0  0.0    996     4 ?        Ss   Apr26   0:00 /pause
ceph        465  1.2  6.6 2665832 1644016 ?     Ssl  Apr26   6:53 ceph-osd --foreground --id 3 --fsid e1ebb901-75ad-4b7c-90d9-69edf914c04e --setuser ceph --setgroup ceph --crush-location=root=default host=mgmt-worker3 --default-log-to-stderr=true --default-err-
root        471  0.0  0.0  14096  2984 pts/0    Ss   Apr26   0:00 /bin/bash -x -e -m -c  CEPH_CLIENT_ID=ceph-osd.3 PERIODICITY=daily LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph LOG_MAX_SIZE=500M ROTATE=7  # edit the logrotate file to only rotate a specific daemo
root      29350  0.0  0.0  23144  1524 pts/0    S+   01:36   0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 15m
root      29695  0.0  0.0  14228  3328 pts/0    Ss   01:42   0:00 sh
root      29757  0.0  0.0  49828  3708 pts/0    R+   01:43   0:00 ps aux
sh-4.4# kill -9 465
sh-4.4# command terminated with exit code 137

RAW USE is now the same size as DATA for OSD3.

$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME            
-1         2.92978         -  3.3 TiB  1.0 TiB  1.0 TiB  391 MiB  7.6 GiB  2.3 TiB  31.30  1.00    -          root default         
-5         0.83009         -  850 GiB  290 GiB  289 GiB   63 MiB  1.9 GiB  560 GiB  34.17  1.09    -              host mgmt-worker0
 0    ssd  0.83009   1.00000  850 GiB  290 GiB  289 GiB   63 MiB  1.9 GiB  560 GiB  34.17  1.09  128      up          osd.0        
-9         0.83009         -  850 GiB  289 GiB  286 GiB  132 MiB  2.7 GiB  561 GiB  34.03  1.09    -              host mgmt-worker1
 1    ssd  0.83009   1.00000  850 GiB  289 GiB  286 GiB  132 MiB  2.7 GiB  561 GiB  34.03  1.09  146      up          osd.1        
-3         0.83009         -  850 GiB  294 GiB  292 GiB   83 MiB  2.2 GiB  556 GiB  34.59  1.11    -              host mgmt-worker2
 2    ssd  0.83009   1.00000  850 GiB  294 GiB  292 GiB   83 MiB  2.2 GiB  556 GiB  34.59  1.11  136      up          osd.2        
-7         0.43950         -  850 GiB  190 GiB  189 GiB  113 MiB  880 MiB  660 GiB  22.41  0.72    -              host mgmt-worker3
 3    ssd  0.43950   1.00000  850 GiB  190 GiB  189 GiB  113 MiB  880 MiB  660 GiB  22.41  0.72   97      up          osd.3        
                       TOTAL  3.3 TiB  1.0 TiB  1.0 TiB  391 MiB  7.6 GiB  2.3 TiB  31.30                                          
MIN/MAX VAR: 0.72/1.11  STDDEV: 5.14

The numbers still look off to me. Why is OSD3 smaller and has less PGs than the other 3?

@travisn
Copy link
Member

travisn commented Apr 29, 2024

From the linked ceph tracker, since ceph/ceph#55777 was merged to reef this is expected to be fixed in v18.2.3.

Good to see that the workaround with kill -9 in the pod fixed it.

The weight of osd.3 looks about half the size of the other OSDs, which would explain why the PGs are not balanced. But since the raw size is the same for all the OSDs, I'm not sure why the weight would be smaller. It seems the weight is the same as before the resize? Try ceph osd crush reweight to adjust it.

@jameshearttech
Copy link
Author

jameshearttech commented Apr 29, 2024

It seems the weight is the same as before the resize?

Yeah, that seems like a reasonable guess.

$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd crush reweight
Invalid command: missing required parameter name(<string(goodchars [A-Za-z0-9-_.])>)
osd crush reweight <name> <weight:float> :  change <name>'s weight to <weight> in crush map
Error EINVAL: invalid command
command terminated with exit code 22

Not sure how this is supposed to work. Do I manually specify the new weight? For example, if I want it to be the same the other OSDs then specify 0.83009? However, if I do that then the OSDs do not sum to 2.92978. Should I reweight all the OSDs to 2.92978/4=0.732445? Why is it not doing this automatically such as when a new OSD is created?

@jameshearttech
Copy link
Author

$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd crush reweight osd.3 0.83009
reweighted item id 3 name 'osd.3' to 0.83009 in crush map
$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME            
-1         3.32036         -  3.3 TiB  1.1 TiB  1.1 TiB  411 MiB   10 GiB  2.3 TiB  32.17  1.00    -          root default         
-5         0.83009         -  850 GiB  299 GiB  296 GiB   70 MiB  2.3 GiB  551 GiB  35.12  1.09    -              host mgmt-worker0
 0    ssd  0.83009   1.00000  850 GiB  299 GiB  296 GiB   70 MiB  2.3 GiB  551 GiB  35.12  1.09  127      up          osd.0        
-9         0.83009         -  850 GiB  297 GiB  294 GiB  130 MiB  3.0 GiB  553 GiB  34.95  1.09    -              host mgmt-worker1
 1    ssd  0.83009   1.00000  850 GiB  297 GiB  294 GiB  130 MiB  3.0 GiB  553 GiB  34.95  1.09  144      up          osd.1        
-3         0.83009         -  850 GiB  302 GiB  299 GiB   88 MiB  2.6 GiB  548 GiB  35.52  1.10    -              host mgmt-worker2
 2    ssd  0.83009   1.00000  850 GiB  302 GiB  299 GiB   88 MiB  2.6 GiB  548 GiB  35.52  1.10  133      up          osd.2        
-7         0.83008         -  850 GiB  196 GiB  194 GiB  123 MiB  2.1 GiB  654 GiB  23.10  0.72    -              host mgmt-worker3
 3    ssd  0.83008   1.00000  850 GiB  196 GiB  194 GiB  123 MiB  2.1 GiB  654 GiB  23.10  0.72  103      up          osd.3        
                       TOTAL  3.3 TiB  1.1 TiB  1.1 TiB  411 MiB   10 GiB  2.3 TiB  32.17

@jameshearttech
Copy link
Author

jameshearttech commented Apr 29, 2024

I just went for it. Looks like it worked? Is it a coincidence that the SIZE and WEIGHT are approximately the same? I noticed that osd.3 is ever so slightly smaller than the others at a weight of 0.83008 vs 0.83009. 3.3/4=8.25 so maybe I should set them all to 0.82500?

$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd crush reweight osd.3 0.83009
reweighted item id 3 name 'osd.3' to 0.83009 in crush map
$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME            
-1         3.32036         -  3.3 TiB  1.1 TiB  1.1 TiB  415 MiB  9.0 GiB  2.3 TiB  32.16  1.00    -          root default         
-5         0.83009         -  850 GiB  276 GiB  274 GiB   70 MiB  2.6 GiB  574 GiB  32.51  1.01    -              host mgmt-worker0
 0    ssd  0.83009   1.00000  850 GiB  276 GiB  274 GiB   70 MiB  2.6 GiB  574 GiB  32.51  1.01  122      up          osd.0        
-9         0.83009         -  850 GiB  279 GiB  277 GiB  130 MiB  1.9 GiB  571 GiB  32.86  1.02    -              host mgmt-worker1
 1    ssd  0.83009   1.00000  850 GiB  279 GiB  277 GiB  130 MiB  1.9 GiB  571 GiB  32.86  1.02  134      up          osd.1        
-3         0.83009         -  850 GiB  282 GiB  279 GiB   88 MiB  2.9 GiB  568 GiB  33.19  1.03    -              host mgmt-worker2
 2    ssd  0.83009   1.00000  850 GiB  282 GiB  279 GiB   88 MiB  2.9 GiB  568 GiB  33.19  1.03  125      up          osd.2        
-7         0.83008         -  850 GiB  256 GiB  254 GiB  127 MiB  1.7 GiB  594 GiB  30.08  0.94    -              host mgmt-worker3
 3    ssd  0.83008   1.00000  850 GiB  256 GiB  254 GiB  127 MiB  1.7 GiB  594 GiB  30.08  0.94  126      up          osd.3        
                       TOTAL  3.3 TiB  1.1 TiB  1.1 TiB  415 MiB  9.0 GiB  2.3 TiB  32.16                                          
MIN/MAX VAR: 0.94/1.03  STDDEV: 1.22

@travisn
Copy link
Member

travisn commented Apr 29, 2024

By default, the weight of the OSD is based on the size, so this sounds expected. If it's off by such a small amount it shouldn't impact the PG placement enough to worry about. PGs will not be exactly perfectly distributed across OSDs anyway since it's based on hashing.

@jameshearttech
Copy link
Author

@travisn really appreciate your help. I'm closing this one out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants