Bug #56650
openceph df reports invalid MAX AVAIL value for stretch mode crush rule
0%
Description
If we define crush rule for stretch mode cluster with multiple take then MAX AVAIL for pools associated with crush rule will report available size equals to available space from single datacenter.
Consider if we define crush rule stretch_rule as per https://docs.ceph.com/en/latest/rados/operations/stretch-mode/ documentation
rule stretch_rule { id 1 type replicated step take DC1 step chooseleaf firstn 2 type host step emit step take DC2 step chooseleaf firstn 2 type host step emit } and another crush rule stretch_replicated_rule with similar placement strategy : rule stretch_replicated_rule { id 2 type replicated step take default step choose firstn 0 type datacenter step chooseleaf firstn 2 type host step emit }
then "MAX AVAIL" for pools from stretch_rule show incorrect value whereas pools from stretch_replicated_rule shows correct value.
The way crush rule stretch_rule is defined, PGMap::get_rule_avail is considering only one datacenter's available size rather than total avail size from both datacenters.
More details :
$ ceph osd crush rule ls replicated_rule stretch_rule stretch_replicated_rule $ ceph osd crush rule dump stretch_rule { "rule_id": 1, "rule_name": "stretch_rule", "type": 1, "steps": [ { "op": "take", "item": -5, "item_name": "DC1" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" }, { "op": "take", "item": -6, "item_name": "DC2" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" } ] } $ ceph osd crush rule dump stretch_replicated_rule { "rule_id": 2, "rule_name": "stretch_replicated_rule", "type": 1, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "choose_firstn", "num": 0, "type": "datacenter" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" } ] } $ ceph osd pool ls detail pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 88 lfor 0/0/62 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 64 lfor 0/0/62 flags hashpspool stripe_width 0 application cephfs pool 4 'rbdpool' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 126 flags hashpspool stripe_width 0 pool 5 'rbdtest' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 6 'stretched_rbdpool' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 130 flags hashpspool stripe_width 0 pool 7 'stretched_rbdtest' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 143 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 8 'stretched_replicated_rbdpool' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 134 flags hashpspool stripe_width 0 pool 9 'stretched_replicated_rbdtest' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 147 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd $ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.2 TiB 960 GiB 252 GiB 252 GiB 20.81 TOTAL 1.2 TiB 960 GiB 252 GiB 252 GiB 20.81 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 1.5 MiB 2 4.5 MiB 0 289 GiB cephfs.a.meta 2 16 2.3 KiB 22 96 KiB 0 289 GiB cephfs.a.data 3 32 0 B 0 0 B 0 289 GiB rbdpool 4 32 0 B 0 0 B 0 216 GiB rbdtest 5 32 20 GiB 5.14k 80 GiB 8.46 216 GiB stretched_rbdpool 6 32 0 B 0 0 B 0 108 GiB stretched_rbdtest 7 32 20 GiB 5.14k 80 GiB 15.60 108 GiB stretched_replicated_rbdpool 8 32 0 B 0 0 B 0 216 GiB stretched_replicated_rbdtest 9 32 20 GiB 5.14k 80 GiB 8.46 216 GiB
Updated by Prashant D almost 2 years ago
- Status changed from New to Fix Under Review
Updated by Prashant D almost 2 years ago
Before applying PR#47189, MAX AVAIL for stretch_rule pools is incorrect :
$ ceph osd crush rule ls replicated_rule stretch_rule stretch_replicated_rule $ ceph osd crush rule dump stretch_rule { "rule_id": 1, "rule_name": "stretch_rule", "type": 1, "steps": [ { "op": "take", "item": -5, "item_name": "DC1" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" }, { "op": "take", "item": -6, "item_name": "DC2" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" } ] } $ ceph osd crush rule dump stretch_replicated_rule { "rule_id": 2, "rule_name": "stretch_replicated_rule", "type": 1, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "choose_firstn", "num": 0, "type": "datacenter" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" } ] } $ ceph osd pool ls detail pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 88 lfor 0/0/62 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 64 lfor 0/0/62 flags hashpspool stripe_width 0 application cephfs pool 4 'rbdpool' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 126 flags hashpspool stripe_width 0 pool 5 'rbdtest' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 6 'stretched_rbdpool' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 130 flags hashpspool stripe_width 0 pool 7 'stretched_rbdtest' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 143 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 8 'stretched_replicated_rbdpool' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 134 flags hashpspool stripe_width 0 pool 9 'stretched_replicated_rbdtest' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 147 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd $ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.2 TiB 960 GiB 252 GiB 252 GiB 20.81 TOTAL 1.2 TiB 960 GiB 252 GiB 252 GiB 20.81 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 1.5 MiB 2 4.5 MiB 0 289 GiB cephfs.a.meta 2 16 2.3 KiB 22 96 KiB 0 289 GiB cephfs.a.data 3 32 0 B 0 0 B 0 289 GiB rbdpool 4 32 0 B 0 0 B 0 216 GiB rbdtest 5 32 20 GiB 5.14k 80 GiB 8.46 216 GiB stretched_rbdpool 6 32 0 B 0 0 B 0 108 GiB stretched_rbdtest 7 32 20 GiB 5.14k 80 GiB 15.60 108 GiB stretched_replicated_rbdpool 8 32 0 B 0 0 B 0 216 GiB stretched_replicated_rbdtest 9 32 20 GiB 5.14k 80 GiB 8.46 216 GiB
After applying PR#47189, MAX AVAIL is correctly shown for stretch_rule pools :
$ ceph -s cluster: id: 3d796a5c-2ad9-4c37-b671-45446254a11b health: HEALTH_OK services: mon: 1 daemons, quorum a (age 19m) mgr: x(active, since 19m) mds: 1/1 daemons up osd: 12 osds: 12 up (since 18m), 12 in (since 18m) data: volumes: 1/1 healthy pools: 9 pools, 241 pgs objects: 15.43k objects, 60 GiB usage: 252 GiB used, 960 GiB / 1.2 TiB avail pgs: 241 active+clean $ ceph osd crush rule ls replicated_rule stretch_rule stretch_replicated_rule $ ceph osd crush rule dump stretch_rule { "rule_id": 1, "rule_name": "stretch_rule", "type": 1, "steps": [ { "op": "take", "item": -5, "item_name": "DC1" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" }, { "op": "take", "item": -6, "item_name": "DC2" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" } ] } $ ceph osd crush rule dump stretch_replicated_rule { "rule_id": 2, "rule_name": "stretch_replicated_rule", "type": 1, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "choose_firstn", "num": 0, "type": "datacenter" }, { "op": "chooseleaf_firstn", "num": 2, "type": "host" }, { "op": "emit" } ] } $ ceph osd pool ls detail pool 1 '.mgr' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 19 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr pool 2 'cephfs.a.meta' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 90 lfor 0/0/62 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs pool 3 'cephfs.a.data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 64 lfor 0/0/62 flags hashpspool stripe_width 0 application cephfs pool 4 'rbdpool' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 126 flags hashpspool stripe_width 0 pool 5 'rbdtest' replicated size 4 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 6 'stretched_rbdpool' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 130 flags hashpspool stripe_width 0 pool 7 'stretched_rbdtest' replicated size 4 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 143 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 8 'stretched_replicated_rbdpool' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 134 flags hashpspool stripe_width 0 pool 9 'stretched_replicated_rbdtest' replicated size 4 min_size 1 crush_rule 2 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 147 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd $ ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 1.18301 - 1.2 TiB 252 GiB 240 GiB 0 B 1.8 GiB 960 GiB 20.81 1.00 - root default -5 0.59151 - 606 GiB 127 GiB 121 GiB 0 B 928 MiB 479 GiB 20.90 1.00 - datacenter DC1 -9 0.19716 - 202 GiB 50 GiB 48 GiB 0 B 440 MiB 152 GiB 24.64 1.18 - host osd-0 0 hdd 0.09859 1.00000 101 GiB 22 GiB 21 GiB 0 B 214 MiB 79 GiB 21.73 1.04 71 up osd.0 3 hdd 0.09859 1.00000 101 GiB 28 GiB 27 GiB 0 B 226 MiB 73 GiB 27.54 1.32 99 up osd.3 -10 0.19716 - 202 GiB 36 GiB 34 GiB 0 B 191 MiB 166 GiB 18.05 0.87 - host osd-1 1 hdd 0.09859 1.00000 101 GiB 18 GiB 17 GiB 0 B 95 MiB 83 GiB 17.79 0.86 72 up osd.1 2 hdd 0.09859 1.00000 101 GiB 18 GiB 17 GiB 0 B 96 MiB 83 GiB 18.31 0.88 66 up osd.2 -11 0.19716 - 202 GiB 40 GiB 38 GiB 0 B 297 MiB 162 GiB 20.00 0.96 - host osd-2 7 hdd 0.09859 1.00000 101 GiB 24 GiB 23 GiB 0 B 160 MiB 77 GiB 24.16 1.16 80 up osd.7 11 hdd 0.09859 1.00000 101 GiB 16 GiB 15 GiB 0 B 137 MiB 85 GiB 15.85 0.76 70 up osd.11 -6 0.59151 - 606 GiB 126 GiB 120 GiB 0 B 923 MiB 480 GiB 20.71 1.00 - datacenter DC2 -12 0.19716 - 202 GiB 46 GiB 44 GiB 0 B 368 MiB 156 GiB 22.64 1.09 - host osd-3 4 hdd 0.09859 1.00000 101 GiB 19 GiB 18 GiB 0 B 199 MiB 82 GiB 18.58 0.89 69 up osd.4 8 hdd 0.09859 1.00000 101 GiB 27 GiB 26 GiB 0 B 169 MiB 74 GiB 26.69 1.28 89 up osd.8 -13 0.19716 - 202 GiB 40 GiB 38 GiB 0 B 258 MiB 162 GiB 19.98 0.96 - host osd-4 5 hdd 0.09859 1.00000 101 GiB 20 GiB 19 GiB 0 B 104 MiB 81 GiB 19.70 0.95 74 up osd.5 9 hdd 0.09859 1.00000 101 GiB 20 GiB 19 GiB 0 B 154 MiB 81 GiB 20.25 0.97 78 up osd.9 -14 0.19716 - 202 GiB 39 GiB 37 GiB 0 B 297 MiB 163 GiB 19.52 0.94 - host osd-5 6 hdd 0.09859 1.00000 101 GiB 16 GiB 15 GiB 0 B 82 MiB 85 GiB 15.48 0.74 66 up osd.6 10 hdd 0.09859 1.00000 101 GiB 24 GiB 23 GiB 0 B 215 MiB 77 GiB 23.57 1.13 81 up osd.10 TOTAL 1.2 TiB 252 GiB 240 GiB 0 B 1.8 GiB 960 GiB 20.81 MIN/MAX VAR: 0.74/1.32 STDDEV: 3.81 $ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.2 TiB 960 GiB 252 GiB 252 GiB 20.81 TOTAL 1.2 TiB 960 GiB 252 GiB 252 GiB 20.81 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 1.5 MiB 2 4.5 MiB 0 289 GiB cephfs.a.meta 2 16 2.3 KiB 22 96 KiB 0 289 GiB cephfs.a.data 3 32 0 B 0 0 B 0 289 GiB rbdpool 4 32 0 B 0 0 B 0 217 GiB rbdtest 5 32 20 GiB 5.14k 80 GiB 8.46 217 GiB stretched_rbdpool 6 32 0 B 0 0 B 0 217 GiB stretched_rbdtest 7 32 20 GiB 5.14k 80 GiB 8.46 217 GiB stretched_replicated_rbdpool 8 32 0 B 0 0 B 0 217 GiB stretched_replicated_rbdtest 9 32 20 GiB 5.13k 80 GiB 8.46 217 GiB
Updated by Radoslaw Zarzynski 10 months ago
A workaround from Prashant:
The workaround for this issue is instead of defining below stretch_rule for the stretch cluster rule stretch_rule { id 1 type replicated step take DC1 step chooseleaf firstn 2 type host step emit step take DC2 step chooseleaf firstn 2 type host step emit } define it as : rule stretch_rule { id 2 type replicated step take default step choose firstn 0 type datacenter step chooseleaf firstn 2 type host step emit }
Updated by Sake Paulusma 9 months ago
Radoslaw Zarzynski wrote:
A workaround from Prashant:
[...]
Thank you for the workaround! After fixing the issue, will this workaround be the default to define the CRUSH rule for a stretched cluster or should we revert back to the original rule in the documentation?
Updated by Radoslaw Zarzynski 9 months ago
The fix is undergoing extended testing.