Project

General

Profile

Actions

Bug #65659

open

OSD Resize Increases Used Capacity Not Available Capacity

Added by James Ringer 10 days ago. Updated 5 days ago.

Status:
Triaged
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
04/24/2024
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Deviation from expected behavior

After resizing the underlying disk at the hypervisor and OS level resizing the OSD increases cluster total capacity and used capacity .

Expected behavior

After resizing the underlying disk at the hypervisor and OS level resizing the OSD increases cluster total capacity and available capacity .

How to reproduce it (minimal and precise)

  1. Build a Kubernetes cluster from virtual machines where Rook consumes virtual disks as OSDs.
  2. Resize a virtual disk used as an OSD at the hypervisor level.
  3. Resize the disk at the OS level.
  4. Resize the disk at the Ceph level (e.g., restart the OSD pod).

[[https://github.com/rook/rook/issues/14099]]

Actions #1

Updated by Igor Fedotov 9 days ago

Hi James!
I presume you haven't run ceph-bluestore-tool's bluefs-bdev-expand command against expanded OSD, have you?

Without doing that OSD considers the expanded space as OCCUPIED hence there is no increase in Available capacity.

Actions #2

Updated by James Ringer 9 days ago

Igor Fedotov wrote in #note-1:

Hi James!
I presume you haven't run ceph-bluestore-tool's bluefs-bdev-expand command against expanded OSD, have you?

Without doing that OSD considers the expanded space as OCCUPIED hence there is no increase in Available capacity.

Hi Igor,

Rook runs an init container for OSD pods that automatically expand (e.g., ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-foo). This is the log.

2024-04-23 09:58:15.765 2024-04-23T19:58:15.762+0000 7f9e5aaf6980 -1 bluestore(/var/lib/ceph/osd/ceph-0) unable to read label for /var/lib/ceph/osd/ceph-0: (21) Is a directory
2024-04-23 09:58:15.765 2024-04-23T19:58:15.762+0000 7f9e5aaf6980 -1 bluestore(/var/lib/ceph/osd/ceph-0) _read_bdev_label failed to read from /var/lib/ceph/osd/ceph-0: (21) Is a directory
2024-04-23 09:58:15.718 1 : expanding from 0x7080000000 to 0xd480000000
2024-04-23 09:58:15.718 Expanding DB/WAL...
2024-04-23 09:58:15.718 1 : device size 0xd480000000 : using 0x834b868000(525 GiB)
2024-04-23 09:58:06.920 inferring bluefs devices from bluestore path

Thanks,
James

Actions #3

Updated by Igor Fedotov 9 days ago

James,
would you please share the output of 'ceph tell osd.N perf dump bluefs" after such an expansion then?

Actions #4

Updated by Igor Fedotov 9 days ago

Actions #5

Updated by James Ringer 9 days ago

I'm actively working on this cluster. I have already replaced osd.0 to move forward with my work. I'll need to perform another resize then ceph tell osd.N perf dump bluefs.

Actions #6

Updated by James Ringer 9 days ago

I reached a point where I could resize another OSD to get the output from ceph tell osd.N perf dump bluefs. I followed these steps to expand the OSD on mgmt-worker3. Notice the RAW USE increased by 400 GB, which is how much I resized the underlying disk by, while the USE did not increase. Looking at Prometheus metrics this translates to increased used capacity rather than increased available capacity.

  1. Drain Kubernetes node.
  2. Shutdown Kubernetes node.
  3. Resize underlying virtual disk. In this case, expand disk by 400 GB.
  4. Start node.
  5. Uncordon node.
kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME            
-1         2.92978         -  3.3 TiB  1.4 TiB  1.0 TiB  387 MiB  7.2 GiB  1.9 TiB  43.08  1.00    -          root default         
-5         0.83009         -  850 GiB  292 GiB  289 GiB   59 MiB  2.8 GiB  558 GiB  34.32  0.80    -              host mgmt-worker0
 0    ssd  0.83009   1.00000  850 GiB  292 GiB  289 GiB   59 MiB  2.8 GiB  558 GiB  34.32  0.80  130      up          osd.0        
-9         0.83009         -  850 GiB  289 GiB  287 GiB  128 MiB  1.7 GiB  561 GiB  33.95  0.79    -              host mgmt-worker1
 1    ssd  0.83009   1.00000  850 GiB  289 GiB  287 GiB  128 MiB  1.7 GiB  561 GiB  33.95  0.79  148      up          osd.1        
-3         0.83009         -  850 GiB  294 GiB  292 GiB   81 MiB  1.8 GiB  556 GiB  34.57  0.80    -              host mgmt-worker2
 2    ssd  0.83009   1.00000  850 GiB  294 GiB  292 GiB   81 MiB  1.8 GiB  556 GiB  34.57  0.80  136      up          osd.2        
-7         0.43950         -  850 GiB  591 GiB  190 GiB  120 MiB  942 MiB  259 GiB  69.49  1.61    -              host mgmt-worker3
 3    ssd  0.43950   1.00000  850 GiB  591 GiB  190 GiB  120 MiB  942 MiB  259 GiB  69.49  1.61   93      up          osd.3        
                       TOTAL  3.3 TiB  1.4 TiB  1.0 TiB  387 MiB  7.2 GiB  1.9 TiB  43.08                                          
MIN/MAX VAR: 0.79/1.61  STDDEV: 15.25
$ kubectl logs $(kubectl get pod -n rook-ceph | awk '/osd-3/ {print $1}') -n rook-ceph -c expand-bluefs
inferring bluefs devices from bluestore path
1 : device size 0xd480000000 : using 0x93c17bb000(591 GiB)
Expanding DB/WAL...
1 : expanding  from 0x7080000000 to 0xd480000000
2024-04-25T19:10:20.138+0000 7fcba23f1980 -1 bluestore(/var/lib/ceph/osd/ceph-3) _read_bdev_label failed to read from /var/lib/ceph/osd/ceph-3: (21) Is a directory
2024-04-25T19:10:20.138+0000 7fcba23f1980 -1 bluestore(/var/lib/ceph/osd/ceph-3) unable to read label for /var/lib/ceph/osd/ceph-3: (21) Is a directory
$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- sh -c "ceph tell osd.3 perf dump bluefs" | jq
{
  "bluefs": {
    "db_total_bytes": 912680550400,
    "db_used_bytes": 1109327872,
    "wal_total_bytes": 0,
    "wal_used_bytes": 0,
    "slow_total_bytes": 0,
    "slow_used_bytes": 0,
    "num_files": 47,
    "log_bytes": 5627904,
    "log_compactions": 1,
    "log_write_count": 5405,
    "logged_bytes": 22138880,
    "files_written_wal": 8,
    "files_written_sst": 6,
    "write_count_wal": 5590,
    "write_count_sst": 543,
    "bytes_written_wal": 168718336,
    "bytes_written_sst": 289062912,
    "bytes_written_slow": 0,
    "max_bytes_wal": 0,
    "max_bytes_db": 1199112192,
    "max_bytes_slow": 0,
    "alloc_unit_main": 0,
    "alloc_unit_db": 65536,
    "alloc_unit_wal": 0,
    "read_random_count": 55947,
    "read_random_bytes": 587609944,
    "read_random_disk_count": 8034,
    "read_random_disk_bytes": 397036575,
    "read_random_disk_bytes_wal": 0,
    "read_random_disk_bytes_db": 397036575,
    "read_random_disk_bytes_slow": 0,
    "read_random_buffer_count": 48053,
    "read_random_buffer_bytes": 190573369,
    "read_count": 5399,
    "read_bytes": 317697662,
    "read_disk_count": 4592,
    "read_disk_bytes": 294891520,
    "read_disk_bytes_wal": 0,
    "read_disk_bytes_db": 294895616,
    "read_disk_bytes_slow": 0,
    "read_prefetch_count": 5379,
    "read_prefetch_bytes": 317476469,
    "write_count": 11545,
    "write_disk_count": 11548,
    "write_bytes": 480141312,
    "compact_lat": {
      "avgcount": 1,
      "sum": 0.001374156,
      "avgtime": 0.001374156
    },
    "compact_lock_lat": {
      "avgcount": 1,
      "sum": 0.000236034,
      "avgtime": 0.000236034
    },
    "alloc_slow_fallback": 0,
    "alloc_slow_size_fallback": 0,
    "read_zeros_candidate": 0,
    "read_zeros_errors": 0
  }
}
Actions #7

Updated by Igor Fedotov 8 days ago

Hi James,
so I'm pretty sure this is a duplicate of https://tracker.ceph.com/issues/63858

Please see https://tracker.ceph.com/issues/63858#note-7 for a workaround.

Actions #8

Updated by Igor Fedotov 8 days ago

  • Status changed from New to Triaged
Actions #9

Updated by Igor Fedotov 8 days ago

  • Project changed from Ceph to bluestore
  • Category deleted (OSD)
Actions #10

Updated by James Ringer 8 days ago

I read https://tracker.ceph.com/issues/63858#note-7, but I'm not sure how to apply the workaround. I have tried deleting the OSD pod, which didn't resolve the issue. I also tried draining the node, marking the OSD out, rebooting the node, marking the OSD in, and uncordoning the node, but the issue is still not resolved. What am I missing? If there is a reliable work around we need to implement that in Rook until a fix is released.

Actions #11

Updated by Igor Fedotov 8 days ago

Generally what you need is to shutdown OSD process in a non-graceful manner. And let it rebuild allocmap during the following restart. It has nothing about osd draining or node restart (unless you power it off which I'd prefer not to do).

In bare metal setup this implies running kill -9 against ceph-osd process. You need to achieve the same in Rook environment. Sorry, I'm not an expert in it hence unable to provide more detailed guideline...

Actions #12

Updated by James Ringer 8 days ago

Yeah, I kill -9 the OSD process in the pod and it fixed resolved the problem. Thanks!

Actions #13

Updated by James Ringer 8 days ago

$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/osd-3/ {print $1}') -n rook-ceph -c osd -it -- sh
sh-4.4# ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
65535         1  0.0  0.0    996     4 ?        Ss   Apr26   0:00 /pause
ceph        465  1.2  6.6 2665832 1644016 ?     Ssl  Apr26   6:53 ceph-osd --foreground --id 3 --fsid e1ebb901-75ad-4b7c-90d9-69edf914c04e --setuser ceph --setgroup ceph --crush-location=root=default host=mgmt-worker3 --default-log-to-stderr=true --default-err-
root        471  0.0  0.0  14096  2984 pts/0    Ss   Apr26   0:00 /bin/bash -x -e -m -c  CEPH_CLIENT_ID=ceph-osd.3 PERIODICITY=daily LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph LOG_MAX_SIZE=500M ROTATE=7  # edit the logrotate file to only rotate a specific daemo
root      29350  0.0  0.0  23144  1524 pts/0    S+   01:36   0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 15m
root      29695  0.0  0.0  14228  3328 pts/0    Ss   01:42   0:00 sh
root      29757  0.0  0.0  49828  3708 pts/0    R+   01:43   0:00 ps aux
sh-4.4# kill -9 465
sh-4.4# command terminated with exit code 137
kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME            
-1         2.92978         -  3.3 TiB  1.0 TiB  1.0 TiB  391 MiB  7.6 GiB  2.3 TiB  31.30  1.00    -          root default         
-5         0.83009         -  850 GiB  290 GiB  289 GiB   63 MiB  1.9 GiB  560 GiB  34.17  1.09    -              host mgmt-worker0
 0    ssd  0.83009   1.00000  850 GiB  290 GiB  289 GiB   63 MiB  1.9 GiB  560 GiB  34.17  1.09  128      up          osd.0        
-9         0.83009         -  850 GiB  289 GiB  286 GiB  132 MiB  2.7 GiB  561 GiB  34.03  1.09    -              host mgmt-worker1
 1    ssd  0.83009   1.00000  850 GiB  289 GiB  286 GiB  132 MiB  2.7 GiB  561 GiB  34.03  1.09  146      up          osd.1        
-3         0.83009         -  850 GiB  294 GiB  292 GiB   83 MiB  2.2 GiB  556 GiB  34.59  1.11    -              host mgmt-worker2
 2    ssd  0.83009   1.00000  850 GiB  294 GiB  292 GiB   83 MiB  2.2 GiB  556 GiB  34.59  1.11  136      up          osd.2        
-7         0.43950         -  850 GiB  190 GiB  189 GiB  113 MiB  880 MiB  660 GiB  22.41  0.72    -              host mgmt-worker3
 3    ssd  0.43950   1.00000  850 GiB  190 GiB  189 GiB  113 MiB  880 MiB  660 GiB  22.41  0.72   97      up          osd.3        
                       TOTAL  3.3 TiB  1.0 TiB  1.0 TiB  391 MiB  7.6 GiB  2.3 TiB  31.30                                          
MIN/MAX VAR: 0.72/1.11  STDDEV: 5.14
Actions #14

Updated by James Ringer 8 days ago

Something still looks good though with the numbers. It's like Ceph isn't balanced? Sure, OSD 3 RAW USE and DATA are the same now, but why is it using less space and has less PGs than the other 3 OSDs?

Actions #15

Updated by James Ringer 8 days ago

s/good/off/g

Actions #16

Updated by James Ringer 5 days ago

$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd crush reweight osd.3 0.83009
reweighted item id 3 name 'osd.3' to 0.83009 in crush map
$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd df tree
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME            
-1         3.32036         -  3.3 TiB  1.1 TiB  1.1 TiB  415 MiB  9.0 GiB  2.3 TiB  32.16  1.00    -          root default         
-5         0.83009         -  850 GiB  276 GiB  274 GiB   70 MiB  2.6 GiB  574 GiB  32.51  1.01    -              host mgmt-worker0
 0    ssd  0.83009   1.00000  850 GiB  276 GiB  274 GiB   70 MiB  2.6 GiB  574 GiB  32.51  1.01  122      up          osd.0        
-9         0.83009         -  850 GiB  279 GiB  277 GiB  130 MiB  1.9 GiB  571 GiB  32.86  1.02    -              host mgmt-worker1
 1    ssd  0.83009   1.00000  850 GiB  279 GiB  277 GiB  130 MiB  1.9 GiB  571 GiB  32.86  1.02  134      up          osd.1        
-3         0.83009         -  850 GiB  282 GiB  279 GiB   88 MiB  2.9 GiB  568 GiB  33.19  1.03    -              host mgmt-worker2
 2    ssd  0.83009   1.00000  850 GiB  282 GiB  279 GiB   88 MiB  2.9 GiB  568 GiB  33.19  1.03  125      up          osd.2        
-7         0.83008         -  850 GiB  256 GiB  254 GiB  127 MiB  1.7 GiB  594 GiB  30.08  0.94    -              host mgmt-worker3
 3    ssd  0.83008   1.00000  850 GiB  256 GiB  254 GiB  127 MiB  1.7 GiB  594 GiB  30.08  0.94  126      up          osd.3        
                       TOTAL  3.3 TiB  1.1 TiB  1.1 TiB  415 MiB  9.0 GiB  2.3 TiB  32.16                                          
MIN/MAX VAR: 0.94/1.03  STDDEV: 1.22
Actions

Also available in: Atom PDF