Bug #65659
openOSD Resize Increases Used Capacity Not Available Capacity
0%
Description
Deviation from expected behavior¶
After resizing the underlying disk at the hypervisor and OS level resizing the OSD increases cluster total capacity and used capacity .
Expected behavior¶
After resizing the underlying disk at the hypervisor and OS level resizing the OSD increases cluster total capacity and available capacity .
How to reproduce it (minimal and precise)¶
- Build a Kubernetes cluster from virtual machines where Rook consumes virtual disks as OSDs.
- Resize a virtual disk used as an OSD at the hypervisor level.
- Resize the disk at the OS level.
- Resize the disk at the Ceph level (e.g., restart the OSD pod).
Updated by Igor Fedotov 9 days ago
Hi James!
I presume you haven't run ceph-bluestore-tool's bluefs-bdev-expand command against expanded OSD, have you?
Without doing that OSD considers the expanded space as OCCUPIED hence there is no increase in Available capacity.
Updated by James Ringer 9 days ago
Igor Fedotov wrote in #note-1:
Hi James!
I presume you haven't run ceph-bluestore-tool's bluefs-bdev-expand command against expanded OSD, have you?Without doing that OSD considers the expanded space as OCCUPIED hence there is no increase in Available capacity.
Hi Igor,
Rook runs an init container for OSD pods that automatically expand (e.g., ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-foo
). This is the log.
2024-04-23 09:58:15.765 2024-04-23T19:58:15.762+0000 7f9e5aaf6980 -1 bluestore(/var/lib/ceph/osd/ceph-0) unable to read label for /var/lib/ceph/osd/ceph-0: (21) Is a directory
2024-04-23 09:58:15.765 2024-04-23T19:58:15.762+0000 7f9e5aaf6980 -1 bluestore(/var/lib/ceph/osd/ceph-0) _read_bdev_label failed to read from /var/lib/ceph/osd/ceph-0: (21) Is a directory
2024-04-23 09:58:15.718 1 : expanding from 0x7080000000 to 0xd480000000
2024-04-23 09:58:15.718 Expanding DB/WAL...
2024-04-23 09:58:15.718 1 : device size 0xd480000000 : using 0x834b868000(525 GiB)
2024-04-23 09:58:06.920 inferring bluefs devices from bluestore path
Thanks,
James
Updated by Igor Fedotov 9 days ago
James,
would you please share the output of 'ceph tell osd.N perf dump bluefs" after such an expansion then?
Updated by Igor Fedotov 9 days ago
And please be aware of https://tracker.ceph.com/issues/63858
Updated by James Ringer 9 days ago
I'm actively working on this cluster. I have already replaced osd.0 to move forward with my work. I'll need to perform another resize then ceph tell osd.N perf dump bluefs
.
Updated by James Ringer 9 days ago
I reached a point where I could resize another OSD to get the output from ceph tell osd.N perf dump bluefs
. I followed these steps to expand the OSD on mgmt-worker3. Notice the RAW USE
increased by 400 GB, which is how much I resized the underlying disk by, while the USE
did not increase. Looking at Prometheus metrics this translates to increased used capacity rather than increased available capacity.
- Drain Kubernetes node.
- Shutdown Kubernetes node.
- Resize underlying virtual disk. In this case, expand disk by 400 GB.
- Start node.
- Uncordon node.
kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 2.92978 - 3.3 TiB 1.4 TiB 1.0 TiB 387 MiB 7.2 GiB 1.9 TiB 43.08 1.00 - root default
-5 0.83009 - 850 GiB 292 GiB 289 GiB 59 MiB 2.8 GiB 558 GiB 34.32 0.80 - host mgmt-worker0
0 ssd 0.83009 1.00000 850 GiB 292 GiB 289 GiB 59 MiB 2.8 GiB 558 GiB 34.32 0.80 130 up osd.0
-9 0.83009 - 850 GiB 289 GiB 287 GiB 128 MiB 1.7 GiB 561 GiB 33.95 0.79 - host mgmt-worker1
1 ssd 0.83009 1.00000 850 GiB 289 GiB 287 GiB 128 MiB 1.7 GiB 561 GiB 33.95 0.79 148 up osd.1
-3 0.83009 - 850 GiB 294 GiB 292 GiB 81 MiB 1.8 GiB 556 GiB 34.57 0.80 - host mgmt-worker2
2 ssd 0.83009 1.00000 850 GiB 294 GiB 292 GiB 81 MiB 1.8 GiB 556 GiB 34.57 0.80 136 up osd.2
-7 0.43950 - 850 GiB 591 GiB 190 GiB 120 MiB 942 MiB 259 GiB 69.49 1.61 - host mgmt-worker3
3 ssd 0.43950 1.00000 850 GiB 591 GiB 190 GiB 120 MiB 942 MiB 259 GiB 69.49 1.61 93 up osd.3
TOTAL 3.3 TiB 1.4 TiB 1.0 TiB 387 MiB 7.2 GiB 1.9 TiB 43.08
MIN/MAX VAR: 0.79/1.61 STDDEV: 15.25
$ kubectl logs $(kubectl get pod -n rook-ceph | awk '/osd-3/ {print $1}') -n rook-ceph -c expand-bluefs
inferring bluefs devices from bluestore path
1 : device size 0xd480000000 : using 0x93c17bb000(591 GiB)
Expanding DB/WAL...
1 : expanding from 0x7080000000 to 0xd480000000
2024-04-25T19:10:20.138+0000 7fcba23f1980 -1 bluestore(/var/lib/ceph/osd/ceph-3) _read_bdev_label failed to read from /var/lib/ceph/osd/ceph-3: (21) Is a directory
2024-04-25T19:10:20.138+0000 7fcba23f1980 -1 bluestore(/var/lib/ceph/osd/ceph-3) unable to read label for /var/lib/ceph/osd/ceph-3: (21) Is a directory
$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- sh -c "ceph tell osd.3 perf dump bluefs" | jq
{
"bluefs": {
"db_total_bytes": 912680550400,
"db_used_bytes": 1109327872,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 0,
"slow_used_bytes": 0,
"num_files": 47,
"log_bytes": 5627904,
"log_compactions": 1,
"log_write_count": 5405,
"logged_bytes": 22138880,
"files_written_wal": 8,
"files_written_sst": 6,
"write_count_wal": 5590,
"write_count_sst": 543,
"bytes_written_wal": 168718336,
"bytes_written_sst": 289062912,
"bytes_written_slow": 0,
"max_bytes_wal": 0,
"max_bytes_db": 1199112192,
"max_bytes_slow": 0,
"alloc_unit_main": 0,
"alloc_unit_db": 65536,
"alloc_unit_wal": 0,
"read_random_count": 55947,
"read_random_bytes": 587609944,
"read_random_disk_count": 8034,
"read_random_disk_bytes": 397036575,
"read_random_disk_bytes_wal": 0,
"read_random_disk_bytes_db": 397036575,
"read_random_disk_bytes_slow": 0,
"read_random_buffer_count": 48053,
"read_random_buffer_bytes": 190573369,
"read_count": 5399,
"read_bytes": 317697662,
"read_disk_count": 4592,
"read_disk_bytes": 294891520,
"read_disk_bytes_wal": 0,
"read_disk_bytes_db": 294895616,
"read_disk_bytes_slow": 0,
"read_prefetch_count": 5379,
"read_prefetch_bytes": 317476469,
"write_count": 11545,
"write_disk_count": 11548,
"write_bytes": 480141312,
"compact_lat": {
"avgcount": 1,
"sum": 0.001374156,
"avgtime": 0.001374156
},
"compact_lock_lat": {
"avgcount": 1,
"sum": 0.000236034,
"avgtime": 0.000236034
},
"alloc_slow_fallback": 0,
"alloc_slow_size_fallback": 0,
"read_zeros_candidate": 0,
"read_zeros_errors": 0
}
}
Updated by Igor Fedotov 8 days ago
Hi James,
so I'm pretty sure this is a duplicate of https://tracker.ceph.com/issues/63858
Please see https://tracker.ceph.com/issues/63858#note-7 for a workaround.
Updated by Igor Fedotov 8 days ago
- Project changed from Ceph to bluestore
- Category deleted (
OSD)
Updated by James Ringer 8 days ago
I read https://tracker.ceph.com/issues/63858#note-7, but I'm not sure how to apply the workaround. I have tried deleting the OSD pod, which didn't resolve the issue. I also tried draining the node, marking the OSD out, rebooting the node, marking the OSD in, and uncordoning the node, but the issue is still not resolved. What am I missing? If there is a reliable work around we need to implement that in Rook until a fix is released.
Updated by Igor Fedotov 8 days ago
Generally what you need is to shutdown OSD process in a non-graceful manner. And let it rebuild allocmap during the following restart. It has nothing about osd draining or node restart (unless you power it off which I'd prefer not to do).
In bare metal setup this implies running kill -9 against ceph-osd process. You need to achieve the same in Rook environment. Sorry, I'm not an expert in it hence unable to provide more detailed guideline...
Updated by James Ringer 7 days ago
Yeah, I kill -9
the OSD process in the pod and it fixed resolved the problem. Thanks!
Updated by James Ringer 7 days ago
$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/osd-3/ {print $1}') -n rook-ceph -c osd -it -- sh
sh-4.4# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
65535 1 0.0 0.0 996 4 ? Ss Apr26 0:00 /pause
ceph 465 1.2 6.6 2665832 1644016 ? Ssl Apr26 6:53 ceph-osd --foreground --id 3 --fsid e1ebb901-75ad-4b7c-90d9-69edf914c04e --setuser ceph --setgroup ceph --crush-location=root=default host=mgmt-worker3 --default-log-to-stderr=true --default-err-
root 471 0.0 0.0 14096 2984 pts/0 Ss Apr26 0:00 /bin/bash -x -e -m -c CEPH_CLIENT_ID=ceph-osd.3 PERIODICITY=daily LOG_ROTATE_CEPH_FILE=/etc/logrotate.d/ceph LOG_MAX_SIZE=500M ROTATE=7 # edit the logrotate file to only rotate a specific daemo
root 29350 0.0 0.0 23144 1524 pts/0 S+ 01:36 0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 15m
root 29695 0.0 0.0 14228 3328 pts/0 Ss 01:42 0:00 sh
root 29757 0.0 0.0 49828 3708 pts/0 R+ 01:43 0:00 ps aux
sh-4.4# kill -9 465
sh-4.4# command terminated with exit code 137
kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 2.92978 - 3.3 TiB 1.0 TiB 1.0 TiB 391 MiB 7.6 GiB 2.3 TiB 31.30 1.00 - root default
-5 0.83009 - 850 GiB 290 GiB 289 GiB 63 MiB 1.9 GiB 560 GiB 34.17 1.09 - host mgmt-worker0
0 ssd 0.83009 1.00000 850 GiB 290 GiB 289 GiB 63 MiB 1.9 GiB 560 GiB 34.17 1.09 128 up osd.0
-9 0.83009 - 850 GiB 289 GiB 286 GiB 132 MiB 2.7 GiB 561 GiB 34.03 1.09 - host mgmt-worker1
1 ssd 0.83009 1.00000 850 GiB 289 GiB 286 GiB 132 MiB 2.7 GiB 561 GiB 34.03 1.09 146 up osd.1
-3 0.83009 - 850 GiB 294 GiB 292 GiB 83 MiB 2.2 GiB 556 GiB 34.59 1.11 - host mgmt-worker2
2 ssd 0.83009 1.00000 850 GiB 294 GiB 292 GiB 83 MiB 2.2 GiB 556 GiB 34.59 1.11 136 up osd.2
-7 0.43950 - 850 GiB 190 GiB 189 GiB 113 MiB 880 MiB 660 GiB 22.41 0.72 - host mgmt-worker3
3 ssd 0.43950 1.00000 850 GiB 190 GiB 189 GiB 113 MiB 880 MiB 660 GiB 22.41 0.72 97 up osd.3
TOTAL 3.3 TiB 1.0 TiB 1.0 TiB 391 MiB 7.6 GiB 2.3 TiB 31.30
MIN/MAX VAR: 0.72/1.11 STDDEV: 5.14
Updated by James Ringer 7 days ago
Something still looks good though with the numbers. It's like Ceph isn't balanced? Sure, OSD 3 RAW USE and DATA are the same now, but why is it using less space and has less PGs than the other 3 OSDs?
Updated by James Ringer 5 days ago
$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd crush reweight osd.3 0.83009
reweighted item id 3 name 'osd.3' to 0.83009 in crush map
$ kubectl exec $(kubectl get pod -n rook-ceph | awk '/tool/ {print $1}') -n rook-ceph -- ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 3.32036 - 3.3 TiB 1.1 TiB 1.1 TiB 415 MiB 9.0 GiB 2.3 TiB 32.16 1.00 - root default
-5 0.83009 - 850 GiB 276 GiB 274 GiB 70 MiB 2.6 GiB 574 GiB 32.51 1.01 - host mgmt-worker0
0 ssd 0.83009 1.00000 850 GiB 276 GiB 274 GiB 70 MiB 2.6 GiB 574 GiB 32.51 1.01 122 up osd.0
-9 0.83009 - 850 GiB 279 GiB 277 GiB 130 MiB 1.9 GiB 571 GiB 32.86 1.02 - host mgmt-worker1
1 ssd 0.83009 1.00000 850 GiB 279 GiB 277 GiB 130 MiB 1.9 GiB 571 GiB 32.86 1.02 134 up osd.1
-3 0.83009 - 850 GiB 282 GiB 279 GiB 88 MiB 2.9 GiB 568 GiB 33.19 1.03 - host mgmt-worker2
2 ssd 0.83009 1.00000 850 GiB 282 GiB 279 GiB 88 MiB 2.9 GiB 568 GiB 33.19 1.03 125 up osd.2
-7 0.83008 - 850 GiB 256 GiB 254 GiB 127 MiB 1.7 GiB 594 GiB 30.08 0.94 - host mgmt-worker3
3 ssd 0.83008 1.00000 850 GiB 256 GiB 254 GiB 127 MiB 1.7 GiB 594 GiB 30.08 0.94 126 up osd.3
TOTAL 3.3 TiB 1.1 TiB 1.1 TiB 415 MiB 9.0 GiB 2.3 TiB 32.16
MIN/MAX VAR: 0.94/1.03 STDDEV: 1.22