Project

General

Profile

Bug #56046

Changes to CRUSH weight and Upmaps causes PGs to go into a degraded+remapped state instead of just remapped

Added by Wes Dillingham 6 months ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have a brand new virtual 16.2.9 cluster (and a physical 16.2.7 cluster) with 0 client activity. Both built initially on Pacific. The cluster has only been partially filled with rados bench objects.

When changing CRUSH weights of an OSD "ceph osd crush reweight osd.10 0" or introducing upmap entries (manually or via the balancer) the cluster responds with degraded PGs instead of "remapped PGs". This is counter to how things have worked in past clusters in Nautilus.

In Nautilus we would make use of the norebalance flag weighing in new capacity to prevent data movement until finished adding OSDs but because in Pacific this causes the PGs to go degraded data movement begins immediately and we cant make use of tools which modify the upmaps to facilitate the gradual movement.

osdmap.748 (9.85 KB) Wes Dillingham, 06/16/2022 10:11 PM

osdmap.748.decompiled (5.4 KB) Wes Dillingham, 06/16/2022 10:11 PM

osdmap.747.decompiled (5.4 KB) Wes Dillingham, 06/16/2022 10:11 PM

osdmap.747 (9.85 KB) Wes Dillingham, 06/16/2022 10:11 PM

osdmap.746.decompiled (4.95 KB) Wes Dillingham, 06/16/2022 10:11 PM

osdmap.746 (9.18 KB) Wes Dillingham, 06/16/2022 10:11 PM

osdmap.745.decompiled (4.95 KB) Wes Dillingham, 06/16/2022 10:11 PM

osdmap.745 (9.18 KB) Wes Dillingham, 06/16/2022 10:11 PM

History

#1 Updated by Wes Dillingham 6 months ago

Steps to reproduce:

[root@p3plocephmon001 ~]# ceph health detail
HEALTH_OK

[root@p3plocephmon001 ~]# ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME                    STATUS  REWEIGHT  PRI-AFF
 -1         84.00000  root default
 -5         42.00000      rack SI06-05
 -3         21.00000          host p3plcephosd706
  0    ssd   7.00000              osd.0                up   1.00000  1.00000
  1    ssd   7.00000              osd.1                up   1.00000  1.00000
  2    ssd   7.00000              osd.2                up   1.00000  1.00000
 -7         21.00000          host p3plcephosd707
  3    ssd   7.00000              osd.3                up   1.00000  1.00000
  4    ssd   7.00000              osd.4                up   1.00000  1.00000
  5    ssd   7.00000              osd.5                up   1.00000  1.00000
-11         42.00000      rack SI06-06
 -9         21.00000          host p3plcephosd708
  6    ssd   7.00000              osd.6                up   1.00000  1.00000
  7    ssd   7.00000              osd.7                up   1.00000  1.00000
  8    ssd   7.00000              osd.8                up   1.00000  1.00000
-13         21.00000          host p3plcephosd709
  9    ssd   7.00000              osd.9                up   1.00000  1.00000
 10    ssd   7.00000              osd.10               up   1.00000  1.00000
 11    ssd   7.00000              osd.11               up   1.00000  1.00000

[root@p3plocephmon001 ~]# ceph osd crush reweight osd.0 1; sleep 5; ceph pg ls degraded
reweighted item id 0 name 'osd.0' to 1 in crush map
PG    OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES       OMAP_BYTES*  OMAP_KEYS*  LOG  STATE                                              SINCE  VERSION   REPORTED   UP          ACTING    SCRUB_STAMP                      DEEP_SCRUB_STAMP
5.2       510       510          0        0  2139095040            0           0  510  active+recovery_wait+undersized+degraded+remapped     3s  3614'510  3684:6592   [5,8,2]p5   [5,8]p5  2022-06-13T23:50:25.082857-0700  2022-06-12T13:20:26.472922-0700
5.5       521       521          0        0  2185232384            0           0  521  active+recovery_wait+undersized+degraded+remapped     3s  3614'521  3684:4496  [7,11,2]p7  [7,11]p7  2022-06-13T07:04:38.311757-0700  2022-06-11T00:49:20.304572-0700
5.7       507       507        507        0  2126512128            0           0  507  active+recovery_wait+undersized+degraded+remapped     3s  3614'507  3684:4448  [5,8,10]p5   [0,8]p0  2022-06-13T22:31:06.343793-0700  2022-06-13T22:31:06.343793-0700
5.9       501       501          0        0  2101346304            0           0  501  active+recovery_wait+undersized+degraded+remapped     3s  3614'501  3684:4367   [5,2,6]p5   [5,6]p5  2022-06-14T00:47:39.500568-0700  2022-06-07T19:17:48.542668-0700
5.d       540      1080          0        0  2264924160            0           0  540  active+recovery_wait+undersized+degraded+remapped     3s  3614'540  3684:7209  [7,11,5]p7   [7,5]p7  2022-06-13T11:17:13.239320-0700  2022-06-13T11:17:13.239320-0700
5.12      512       512          0        0  2147483648            0           0  512  active+recovery_wait+undersized+degraded+remapped     3s  3614'512  3684:4309  [6,4,11]p6  [6,11]p6  2022-06-14T01:26:49.775034-0700  2022-06-12T16:11:10.758423-0700
5.13      519       519          0        0  2176843776            0           0  519  active+recovery_wait+undersized+degraded+remapped     3s  3614'519  3684:4381   [4,6,2]p4   [4,6]p4  2022-06-14T04:30:27.175902-0700  2022-06-08T21:20:37.383154-0700
5.1d      492       492        492        0  2059403288            0           0  493  active+recovery_wait+undersized+degraded+remapped     3s  3614'493  3684:4333   [8,4,2]p8   [1,4]p1  2022-06-14T02:22:00.852826-0700  2022-06-10T15:25:32.139164-0700

* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilization. See http://docs.ceph.com/en/latest/dev/placement-group/#omap-statistics for further details.

pool details:

pool 5 'wes_test_1' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 3602 lfor 0/0/236 flags hashpspool stripe_width 0 application testing


CRUSH Rule:
[root@p3plocephmon001 ~]# ceph osd crush rule dump replicated_rule
{
    "rule_id": 0,
    "rule_name": "replicated_rule",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default" 
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host" 
        },
        {
            "op": "emit" 
        }
    ]
}

#2 Updated by Wes Dillingham 6 months ago

I have disabled the upmap balancer and removed all upmaps from the osdmap and the problem is still reproducible.

#3 Updated by Dan van der Ster 5 months ago

I cannot reproduce on a small 16.2.9 cluster -- I changed osd crush weights several times and the PGs never go degraded, only remapped.

#4 Updated by Wes Dillingham 5 months ago

I have found that I can only reproduce this on clusters built initially on pacific. Have reproduced on 3 separate pacific clusters built directly on pacific, but not on my cluster which went Nautilus -> Pacific.

My working theory is that it may be related to this:

rocksdb sharding: https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#rocksdb-sharding
OSDs deployed in Pacific or later use RocksDB sharding by default. If Ceph is upgraded to Pacific from a previous version, sharding is off.
To enable sharding and apply the Pacific defaults, stop an OSD and run
ceph-bluestore-tool \
--path <data path> \
--sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
reshard

#5 Updated by Wes Dillingham 5 months ago

I am attaching the ceph.log and osd log for the osd marked out and the log covers the period during the osd getting a weight change until recovery is complete.

ceph.log: https://drive.google.com/file/d/1uoPIcA66D35hwCd0mZVlqKH6IBvMPLrT/view?usp=sharing
ceph-osd.10.log : https://drive.google.com/file/d/19itr2Q-XAagUUbuI7bIz7nHbW2YiJ1dY/view?usp=sharing

logs were larger than redmine allowed so used google drive as alternative location.

#6 Updated by Wes Dillingham 5 months ago

I am attaching osdmap epochs 745 to 748 corresponding to the above

#7 Updated by Wes Dillingham 5 months ago

I should note the ceph osd tree etc in my initial "Steps to reproduce" was on a different cluster than the cluster I generated the logs from. Both clusters have same issue.

Also available in: Atom PDF