Bug #56046
openChanges to CRUSH weight and Upmaps causes PGs to go into a degraded+remapped state instead of just remapped
0%
Description
I have a brand new virtual 16.2.9 cluster (and a physical 16.2.7 cluster) with 0 client activity. Both built initially on Pacific. The cluster has only been partially filled with rados bench objects.
When changing CRUSH weights of an OSD "ceph osd crush reweight osd.10 0" or introducing upmap entries (manually or via the balancer) the cluster responds with degraded PGs instead of "remapped PGs". This is counter to how things have worked in past clusters in Nautilus.
In Nautilus we would make use of the norebalance flag weighing in new capacity to prevent data movement until finished adding OSDs but because in Pacific this causes the PGs to go degraded data movement begins immediately and we cant make use of tools which modify the upmaps to facilitate the gradual movement.
Files
Updated by Wes Dillingham almost 2 years ago
Steps to reproduce:
[root@p3plocephmon001 ~]# ceph health detail HEALTH_OK [root@p3plocephmon001 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 84.00000 root default -5 42.00000 rack SI06-05 -3 21.00000 host p3plcephosd706 0 ssd 7.00000 osd.0 up 1.00000 1.00000 1 ssd 7.00000 osd.1 up 1.00000 1.00000 2 ssd 7.00000 osd.2 up 1.00000 1.00000 -7 21.00000 host p3plcephosd707 3 ssd 7.00000 osd.3 up 1.00000 1.00000 4 ssd 7.00000 osd.4 up 1.00000 1.00000 5 ssd 7.00000 osd.5 up 1.00000 1.00000 -11 42.00000 rack SI06-06 -9 21.00000 host p3plcephosd708 6 ssd 7.00000 osd.6 up 1.00000 1.00000 7 ssd 7.00000 osd.7 up 1.00000 1.00000 8 ssd 7.00000 osd.8 up 1.00000 1.00000 -13 21.00000 host p3plcephosd709 9 ssd 7.00000 osd.9 up 1.00000 1.00000 10 ssd 7.00000 osd.10 up 1.00000 1.00000 11 ssd 7.00000 osd.11 up 1.00000 1.00000 [root@p3plocephmon001 ~]# ceph osd crush reweight osd.0 1; sleep 5; ceph pg ls degraded reweighted item id 0 name 'osd.0' to 1 in crush map PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 5.2 510 510 0 0 2139095040 0 0 510 active+recovery_wait+undersized+degraded+remapped 3s 3614'510 3684:6592 [5,8,2]p5 [5,8]p5 2022-06-13T23:50:25.082857-0700 2022-06-12T13:20:26.472922-0700 5.5 521 521 0 0 2185232384 0 0 521 active+recovery_wait+undersized+degraded+remapped 3s 3614'521 3684:4496 [7,11,2]p7 [7,11]p7 2022-06-13T07:04:38.311757-0700 2022-06-11T00:49:20.304572-0700 5.7 507 507 507 0 2126512128 0 0 507 active+recovery_wait+undersized+degraded+remapped 3s 3614'507 3684:4448 [5,8,10]p5 [0,8]p0 2022-06-13T22:31:06.343793-0700 2022-06-13T22:31:06.343793-0700 5.9 501 501 0 0 2101346304 0 0 501 active+recovery_wait+undersized+degraded+remapped 3s 3614'501 3684:4367 [5,2,6]p5 [5,6]p5 2022-06-14T00:47:39.500568-0700 2022-06-07T19:17:48.542668-0700 5.d 540 1080 0 0 2264924160 0 0 540 active+recovery_wait+undersized+degraded+remapped 3s 3614'540 3684:7209 [7,11,5]p7 [7,5]p7 2022-06-13T11:17:13.239320-0700 2022-06-13T11:17:13.239320-0700 5.12 512 512 0 0 2147483648 0 0 512 active+recovery_wait+undersized+degraded+remapped 3s 3614'512 3684:4309 [6,4,11]p6 [6,11]p6 2022-06-14T01:26:49.775034-0700 2022-06-12T16:11:10.758423-0700 5.13 519 519 0 0 2176843776 0 0 519 active+recovery_wait+undersized+degraded+remapped 3s 3614'519 3684:4381 [4,6,2]p4 [4,6]p4 2022-06-14T04:30:27.175902-0700 2022-06-08T21:20:37.383154-0700 5.1d 492 492 492 0 2059403288 0 0 493 active+recovery_wait+undersized+degraded+remapped 3s 3614'493 3684:4333 [8,4,2]p8 [1,4]p1 2022-06-14T02:22:00.852826-0700 2022-06-10T15:25:32.139164-0700 * NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilization. See http://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics for further details.
pool details:
pool 5 'wes_test_1' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 3602 lfor 0/0/236 flags hashpspool stripe_width 0 application testing
CRUSH Rule:
[root@p3plocephmon001 ~]# ceph osd crush rule dump replicated_rule { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }
Updated by Wes Dillingham almost 2 years ago
I have disabled the upmap balancer and removed all upmaps from the osdmap and the problem is still reproducible.
Updated by Dan van der Ster almost 2 years ago
I cannot reproduce on a small 16.2.9 cluster -- I changed osd crush weights several times and the PGs never go degraded, only remapped.
Updated by Wes Dillingham almost 2 years ago
I have found that I can only reproduce this on clusters built initially on pacific. Have reproduced on 3 separate pacific clusters built directly on pacific, but not on my cluster which went Nautilus -> Pacific.
My working theory is that it may be related to this:
rocksdb sharding: https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#rocksdb-sharding
OSDs deployed in Pacific or later use RocksDB sharding by default. If Ceph is upgraded to Pacific from a previous version, sharding is off.
To enable sharding and apply the Pacific defaults, stop an OSD and run
ceph-bluestore-tool \
--path <data path> \
--sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
reshard
Updated by Wes Dillingham almost 2 years ago
I am attaching the ceph.log and osd log for the osd marked out and the log covers the period during the osd getting a weight change until recovery is complete.
ceph.log: https://drive.google.com/file/d/1uoPIcA66D35hwCd0mZVlqKH6IBvMPLrT/view?usp=sharing
ceph-osd.10.log : https://drive.google.com/file/d/19itr2Q-XAagUUbuI7bIz7nHbW2YiJ1dY/view?usp=sharing
logs were larger than redmine allowed so used google drive as alternative location.
Updated by Wes Dillingham almost 2 years ago
- File osdmap.748.decompiled osdmap.748.decompiled added
- File osdmap.748 osdmap.748 added
- File osdmap.747.decompiled osdmap.747.decompiled added
- File osdmap.747 osdmap.747 added
- File osdmap.746.decompiled osdmap.746.decompiled added
- File osdmap.746 osdmap.746 added
- File osdmap.745.decompiled osdmap.745.decompiled added
- File osdmap.745 osdmap.745 added
I am attaching osdmap epochs 745 to 748 corresponding to the above
Updated by Wes Dillingham almost 2 years ago
I should note the ceph osd tree etc in my initial "Steps to reproduce" was on a different cluster than the cluster I generated the logs from. Both clusters have same issue.